DAWG-Python
DAWG-Python copied to clipboard
Adding edges() and iteredges() Functions for DAWGs
As discussed at https://github.com/kmike/marisa-trie/issues/20, this is support for adding the edges() and iteredges() methods for CompletionDAWG. If this looks good, I'll add similar support for RecordDAWGs and ByteDAWGs. The code isn't as optimized as it could be, but it works, it's clean (IMO), and it's fast enough for me.
Coverage decreased (-2.82%) to 91.02% when pulling fa6cd76748a273d8cf05103cf209e1e1d89af834 on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-2.82%) to 91.02% when pulling fa6cd76748a273d8cf05103cf209e1e1d89af834 on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-2.82%) to 91.02% when pulling fa6cd76748a273d8cf05103cf209e1e1d89af834 on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
These latest additions add edges() and iteredges() functionality for all applicable DAWGs and clean up the code since the original pull request. They complete all the work I planned to implement that we originally discussed. Would love to hear your thoughts @kmike.
Coverage decreased (-4.37%) to 89.46% when pulling 30bf53bebcd7d4eda64dbda8c247ad866e168b4b on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-4.37%) to 89.46% when pulling 30bf53bebcd7d4eda64dbda8c247ad866e168b4b on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-4.2%) to 89.64% when pulling 15355be17f1433adca05549e649fdd001593a729 on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-4.2%) to 89.64% when pulling 15355be17f1433adca05549e649fdd001593a729 on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-4.03%) to 89.81% when pulling dee560c4079d7e15127385416d4889318f70defb on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-3.85%) to 89.98% when pulling 2a931734694c90e7475d510736fe3e64bc72ba9a on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-3.85%) to 89.98% when pulling 2a931734694c90e7475d510736fe3e64bc72ba9a on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-4.96%) to 88.87% when pulling 8cb08f348bd9d0a70ff59bc12197b76fb2511fc4 on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
@kmike latest change makes edges() and iter_edges() always return (str, boolean) to denote what the edge key is and whether or not it's a terminal. It also adds edges_data() and edgesiter_data() for dawgs that store data which always return (str, data or None).
Everything looks done as far as I can tell. I'd like to start using this in prod soon. I can use my fork, but if you plan to merge soon, that would be even better.
Coverage decreased (-4.93%) to 88.91% when pulling f3baac8455a2e48f1edc5db8188d6cc53dd6db8a on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-3.93%) to 89.91% when pulling ae7472ab7c5d9e7f6257e5f97da705b0d3a8cdb2 on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Hi @EliFinkelshteyn,
Everything looks done as far as I can tell. I'd like to start using this in prod soon. I can use my fork, but if you plan to merge soon, that would be even better.
It seems the main complexity is that some characters are represented by multiple transitions, right? You solved it by trying to decode data until is succeeds, which is reasonable for UTF-8.
Regarding the API - so .edges is like .keys, but it only traverses graph to depth of 1 unicode character, and also returns if the result is terminal or not? I think it is reasonable.
One question is whether it should return full keys or partial keys, without the prefix. You've implemented it the same way as Completer, which looks fine.
Could you please add more tests? For example, based on https://coveralls.io/builds/2376072/source?filename=dawg_python%2Fwrapper.py, the code which handles UnicodeDecodeErrors is untested; some conditions are also missing in dawgs.py (see https://coveralls.io/builds/2376072/source?filename=dawg_python%2Fdawgs.py).
Thanks for your PR! It is wel-written :+1: But I need a bit more time to review it. I'm not sure I'll be able to finish the review during this work week; weekend is more likely.
So, there's actually an issue here. When unicode chars share the same first bytes, this will only return one of the chars. I am working on fixing that now. I realized you can tell exactly how many bytes are in a unicode char by how many leading ones the first byte has, so I can use this to speed up the whole thing a bit as well.
A good catch. For some reason I thought that UTF8 synchronization is enough to make repeated decoding work, but it is not.
Coverage decreased (-2.12%) to 91.72% when pulling 54629163f010d6da620b04a3d32d438569aa3738 on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-2.12%) to 91.72% when pulling 54629163f010d6da620b04a3d32d438569aa3738 on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-1.82%) to 92.02% when pulling 0b81a9fede3ec31906a1a0415206e2512bc9808e on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-1.82%) to 92.02% when pulling 2cbd3404148f627fe03cb850194d150a122d7abd on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Coverage decreased (-1.82%) to 92.02% when pulling 2cbd3404148f627fe03cb850194d150a122d7abd on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
That was my bad. I didn't know python3.2 has an issue with 'u'. I'm also just tired, so this took way longer than it should have. Should all be fixed and working now though.
Coverage decreased (-1.82%) to 92.02% when pulling f56e2b9530dbc39cf2c3b1cceb8adefe48fba1a2 on EliFinkelshteyn:master into 3a5692264c98f5d085c707ad90d2fea64a53d07c on kmike:master.
Ping here. Anything else you want done for this to be merged in?