fast_diff_match_patch
fast_diff_match_patch copied to clipboard
update dmp submodule to fix a buffer overflow
There was a problem in dmp project, where it was possible to get a fatal crash, when converting diff back to lines because of heap overflow. It is fixed in this submodule.
It looks like the commit isn't in the upstream repo, is that right?
Yes, and i am not sure if they will accept pr as last commit was 8 years ago
On 2/13/24, Joshua Tauberer @.***> wrote:
It looks like the commit isn't in the upstream repo, is that right?
-- Reply to this email directly or view it on GitHub: https://github.com/JoshData/fast_diff_match_patch/pull/28#issuecomment-1939858233 You are receiving this because you authored the thread.
Message ID: @.***>
-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: @.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili
It would be difficult for me to review your changes to know whether I should accept them here, and I'm not sure if it is meaningful to have a submodule point to a commit that isn't in the repository that it points to. So for both reasons I don't think I can merge this PR as is.
Is there a similar fix in Google's original DMP library, which the repository that the submodule points to is based on?
I am not sure. I will try to make a pr to dmp-cpp-stl but i am not sure it will be merged unfortunately. I
On 2/14/24, Joshua Tauberer @.***> wrote:
It would be difficult for me to review your changes to know whether I should accept them here, and I'm not sure if it is meaningful to have a submodule point to a commit that isn't in the repository that it points to. So for both reasons I don't think I can merge this PR as is.
Is there a similar fix in Google's original DMP library, which the repository that the submodule points to is based on?
-- Reply to this email directly or view it on GitHub: https://github.com/JoshData/fast_diff_match_patch/pull/28#issuecomment-1944400859 You are receiving this because you authored the thread.
Message ID: @.***>
-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: @.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili
I made a pr again diff-match-patch -cpp-stl but have not hope this to be merged unfortunately because of inactivity of author.
We have two ways, you can review and merge this pr as is, or fork my fork to your account to be sure nothing will happen and make a commit with that fork.
This issue is very serious, and makes whole application to crash.
fast-diff-match-patch is used in nvda screenreader and users of this screenreader are suffering from this issue. it is possible to downgrade to diffmatchpatch-python, but as testing revealed, it is 4 to 10 times slower than fast-diff-match-patch.
Got it. Ok let me find time to review what you did and figure out how to best include it.
Do you have a test case that I can run to see the crash?
Unfortunately, just using dmp without nvda no, i don't have.
It is possible to reliably get a crash when you try to pass large data to it, like git log -P 10000 in nvaccess/nvda repository, with nvda running.
On 2/18/24, Joshua Tauberer @.***> wrote:
Got it. Ok let me find time to review what you did and figure out how to best include it.
Do you have a test case that I can run to see the crash?
-- Reply to this email directly or view it on GitHub: https://github.com/JoshData/fast_diff_match_patch/pull/28#issuecomment-1950508287 You are receiving this because you authored the thread.
Message ID: @.***>
-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: @.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili
Hello @JoshData You can use this test script to see the problem.
Test text files contains some commits from NVDA repo, searching for diff before crashing takes a while (20 seconds on my machine).
hello @JoshData By the chance, did you have a possibility to look at pr?
I've had a chance to look now. I was able to reproduce the crash with @Danstiv's example (thanks for that!).
In DMP's method of doing a line-by-line diff, it maps each unique line to an integer, stores the integer as a character in a new string, and then does a diff over those fake characters. The DMP library doesn't check that the character type of the string can hold integers for as many unique lines that occurs. In this library, when comparing byte strings, it runs DMP using 'char' as the string's underlying character type, which means it won't be able to store more than ~255 unique lines. In the example, there are ~45,000 unique lines. The issue seems to be that since 'char' is signed, it doesn't correctly round-trip all the integers.
Rather than patching the DMP library, I think the right thing to do would be for me to avoid using the char type and always use a larger unsigned integer, i.e. wchar_t, the type used when comparing Python strings. This will also ensure that more unique lines can be represented in the string's character type, which would stay true to the DMP library's intent.
I'll try to get a fix done by the end of next weekend.
Edit: A work-around would be to do a comparison using Python strings (or unicode strings in Py2) rather than byte strings.
I pushed a fix to the main branch. Hopefully this fixes it. If you can give it a try before I post a release that would be appreciated. (I am trying to get some pre-built packages to test ahead of time.)
@JoshData, Unfortunately, another incorrect behavior is now observed. When running on the same data the function hangs probably forever, I waited for about two minutes but didn't get the diff.
It takes 14 minutes to complete for me. I'm not sure this is incorrect behavior since these files are quite large. To check that it isn't related to this PR, I changed the test to read the files as string (rather than bytes) and I also reverted to an earlier version of the library - same time.
@JoshData, But with the initial fix the processing is much faster.
Fix by @beqabeqa473
Danstiv@Danstiv-PC UCRT64 /d/temp/dmp
$ time py -3.11 test_dmp.py
Before diff
after diff
real 0m14.218s
user 0m0.000s
sys 0m0.000s
Fix by @JoshData
Danstiv@Danstiv-PC UCRT64 /d/temp/dmp
$ time py -3.11 test_dmp.py
Before diff
after diff
real 6m44.171s
user 0m0.015s
sys 0m0.046s
It seems your fix has caused some performance degradation.
I'm also attaching new files for testing, collected from the NVDA screen reader when running git log. The test results above were obtained by comparing these files. test_dmp.zip
Thanks. I've spent some time trying to figure out what's happening and haven't figured it out yet.
Hello @JoshData
Please update official dmp-stl submodule, where my pr was merged. After your update submodule to the latest commit, i will close this pr.
Also, please revert your changes, as they are slowing down diffing
Done.
Hello @JoshData, Are you planning to release on pypi?
Thanks for the reminder. I just published a release. Hopefully it all works OK now!