fast_diff_match_patch icon indicating copy to clipboard operation
fast_diff_match_patch copied to clipboard

update dmp submodule to fix a buffer overflow

Open beqabeqa473 opened this issue 1 year ago • 15 comments

There was a problem in dmp project, where it was possible to get a fatal crash, when converting diff back to lines because of heap overflow. It is fixed in this submodule.

beqabeqa473 avatar Feb 09 '24 16:02 beqabeqa473

It looks like the commit isn't in the upstream repo, is that right?

JoshData avatar Feb 13 '24 00:02 JoshData

Yes, and i am not sure if they will accept pr as last commit was 8 years ago

On 2/13/24, Joshua Tauberer @.***> wrote:

It looks like the commit isn't in the upstream repo, is that right?

-- Reply to this email directly or view it on GitHub: https://github.com/JoshData/fast_diff_match_patch/pull/28#issuecomment-1939858233 You are receiving this because you authored the thread.

Message ID: @.***>

-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: @.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

beqabeqa473 avatar Feb 13 '24 06:02 beqabeqa473

It would be difficult for me to review your changes to know whether I should accept them here, and I'm not sure if it is meaningful to have a submodule point to a commit that isn't in the repository that it points to. So for both reasons I don't think I can merge this PR as is.

Is there a similar fix in Google's original DMP library, which the repository that the submodule points to is based on?

JoshData avatar Feb 14 '24 18:02 JoshData

I am not sure. I will try to make a pr to dmp-cpp-stl but i am not sure it will be merged unfortunately. I

On 2/14/24, Joshua Tauberer @.***> wrote:

It would be difficult for me to review your changes to know whether I should accept them here, and I'm not sure if it is meaningful to have a submodule point to a commit that isn't in the repository that it points to. So for both reasons I don't think I can merge this PR as is.

Is there a similar fix in Google's original DMP library, which the repository that the submodule points to is based on?

-- Reply to this email directly or view it on GitHub: https://github.com/JoshData/fast_diff_match_patch/pull/28#issuecomment-1944400859 You are receiving this because you authored the thread.

Message ID: @.***>

-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: @.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

beqabeqa473 avatar Feb 14 '24 19:02 beqabeqa473

I made a pr again diff-match-patch -cpp-stl but have not hope this to be merged unfortunately because of inactivity of author.

We have two ways, you can review and merge this pr as is, or fork my fork to your account to be sure nothing will happen and make a commit with that fork.

This issue is very serious, and makes whole application to crash.

fast-diff-match-patch is used in nvda screenreader and users of this screenreader are suffering from this issue. it is possible to downgrade to diffmatchpatch-python, but as testing revealed, it is 4 to 10 times slower than fast-diff-match-patch.

beqabeqa473 avatar Feb 17 '24 17:02 beqabeqa473

Got it. Ok let me find time to review what you did and figure out how to best include it.

Do you have a test case that I can run to see the crash?

JoshData avatar Feb 17 '24 23:02 JoshData

Unfortunately, just using dmp without nvda no, i don't have.

It is possible to reliably get a crash when you try to pass large data to it, like git log -P 10000 in nvaccess/nvda repository, with nvda running.

On 2/18/24, Joshua Tauberer @.***> wrote:

Got it. Ok let me find time to review what you did and figure out how to best include it.

Do you have a test case that I can run to see the crash?

-- Reply to this email directly or view it on GitHub: https://github.com/JoshData/fast_diff_match_patch/pull/28#issuecomment-1950508287 You are receiving this because you authored the thread.

Message ID: @.***>

-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: @.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

beqabeqa473 avatar Feb 18 '24 07:02 beqabeqa473

Hello @JoshData You can use this test script to see the problem.

Test text files contains some commits from NVDA repo, searching for diff before crashing takes a while (20 seconds on my machine).

Danstiv avatar Feb 18 '24 08:02 Danstiv

  hello @JoshData  By the chance, did you have a possibility to look at pr?

beqabeqa473 avatar Feb 24 '24 14:02 beqabeqa473

I've had a chance to look now. I was able to reproduce the crash with @Danstiv's example (thanks for that!).

In DMP's method of doing a line-by-line diff, it maps each unique line to an integer, stores the integer as a character in a new string, and then does a diff over those fake characters. The DMP library doesn't check that the character type of the string can hold integers for as many unique lines that occurs. In this library, when comparing byte strings, it runs DMP using 'char' as the string's underlying character type, which means it won't be able to store more than ~255 unique lines. In the example, there are ~45,000 unique lines. The issue seems to be that since 'char' is signed, it doesn't correctly round-trip all the integers.

Rather than patching the DMP library, I think the right thing to do would be for me to avoid using the char type and always use a larger unsigned integer, i.e. wchar_t, the type used when comparing Python strings. This will also ensure that more unique lines can be represented in the string's character type, which would stay true to the DMP library's intent.

I'll try to get a fix done by the end of next weekend.

Edit: A work-around would be to do a comparison using Python strings (or unicode strings in Py2) rather than byte strings.

JoshData avatar Mar 04 '24 04:03 JoshData

I pushed a fix to the main branch. Hopefully this fixes it. If you can give it a try before I post a release that would be appreciated. (I am trying to get some pre-built packages to test ahead of time.)

JoshData avatar Mar 06 '24 19:03 JoshData

@JoshData, Unfortunately, another incorrect behavior is now observed. When running on the same data the function hangs probably forever, I waited for about two minutes but didn't get the diff.

Danstiv avatar Mar 06 '24 21:03 Danstiv

It takes 14 minutes to complete for me. I'm not sure this is incorrect behavior since these files are quite large. To check that it isn't related to this PR, I changed the test to read the files as string (rather than bytes) and I also reverted to an earlier version of the library - same time.

JoshData avatar Mar 09 '24 13:03 JoshData

@JoshData, But with the initial fix the processing is much faster.

Fix by @beqabeqa473

Danstiv@Danstiv-PC UCRT64 /d/temp/dmp
$ time py -3.11 test_dmp.py
Before diff
after diff

real    0m14.218s
user    0m0.000s
sys     0m0.000s

Fix by @JoshData

Danstiv@Danstiv-PC UCRT64 /d/temp/dmp
$ time py -3.11 test_dmp.py
Before diff
after diff

real    6m44.171s
user    0m0.015s
sys     0m0.046s

It seems your fix has caused some performance degradation.

I'm also attaching new files for testing, collected from the NVDA screen reader when running git log. The test results above were obtained by comparing these files. test_dmp.zip

Danstiv avatar Mar 09 '24 18:03 Danstiv

Thanks. I've spent some time trying to figure out what's happening and haven't figured it out yet.

JoshData avatar Mar 24 '24 13:03 JoshData

Hello @JoshData

Please update official dmp-stl submodule, where my pr was merged. After your update submodule to the latest commit, i will close this pr.

beqabeqa473 avatar Apr 09 '24 11:04 beqabeqa473

Also, please revert your changes, as they are slowing down diffing

beqabeqa473 avatar Apr 09 '24 11:04 beqabeqa473

Done.

JoshData avatar Apr 09 '24 18:04 JoshData

Hello @JoshData, Are you planning to release on pypi?

Danstiv avatar Apr 15 '24 14:04 Danstiv

Thanks for the reminder. I just published a release. Hopefully it all works OK now!

JoshData avatar Apr 23 '24 15:04 JoshData