comparator icon indicating copy to clipboard operation
comparator copied to clipboard

Reduce memory usage for edit distances?

Open OlivierBinette opened this issue 2 years ago • 4 comments

@ngmarchant the Levenshtein distance can be implemented using only two rows for dmat, instead of using a square matrix. That could significantly reduce memory usage when comparing long sequences (400 Mb to 80 Kb when comparing strings of length 10,000).

Would it be worth it to implement this? I could propose the changes.

Example Python implementation:

import numpy as np
dmat = np.zeros((100,2))

def levenshtein(s, t, dmat):
    m = len(s)
    n = len(t)
    dmat[:, 0] = np.arange(dmat.shape[0])

    for j in range(1, n+1):
        dmat[0, (j-1) % 2] = j-1
        dmat[0, j % 2] = j
        for i in range(1, m+1):
            cost = 0
            if s[i-1] != t[j-1]:
                cost = 1
            dmat[i, j % 2] = min(dmat[i-1, j % 2] + 1, dmat[i, (j-1) % 2] +
                                 1, dmat[i-1, (j-1) % 2] + cost)
    return dmat[m, n % 2]

levenshtein("test", "testt", dmat)

OlivierBinette avatar Jan 07 '22 12:01 OlivierBinette

Hi Olivier,

I'd be in favor of this. Keeping the distance matrix in memory is only useful if one is interested in recording the optimal sequence of edit operations. But here, we're only interested in the distance (the last entry in the matrix).

One thing to be mindful of, is that the Levenshtein distance is related to other edit-based distances in the package through class inheritance. The inheritance diagram is as follows: DamerauLevenshtein -> OSA -> Levenshtein -> LCS.

Since all of these classes currently rely on the full distance matrix, they would also be impacted by your proposal. I believe it's possible to implement LCS and Levenshtein by keeping 2 rows in memory, and OSA by keeping 3 rows in memory. For DamerauLevenshtein, it may be necessary to keep all rows in memory (need to check this).

Would you be happy to submit a pull request?

ngmarchant avatar Jan 09 '22 07:01 ngmarchant

You might be interested in this: https://web.archive.org/web/20180612143641/https://bitbucket.org/clearer/iosifovich/

ngmarchant avatar Jan 09 '22 07:01 ngmarchant

@ngmarchant thanks for the link! I'll look into it for further optimization, and then see if I can make a pull request.

OlivierBinette avatar Jan 23 '22 22:01 OlivierBinette

For reference, here's the optimization I could get by reducing the memory usage a bit more:

def levenshtein(s, t):
    m = len(s)
    n = len(t)
    buffer = np.arange(max(n, m) + 1)

    p = m
    for j in range(1, n+1):
        temp = j-1
        p = j
        for i in range(1, m+1):
            p = min(p + 1, buffer[i] + 1, temp + (s[i-1] != t[j-1]))
            temp = buffer[i]
            buffer[i] = p
    return p

OlivierBinette avatar Jan 24 '22 00:01 OlivierBinette