ahocorapy icon indicating copy to clipboard operation
ahocorapy copied to clipboard

Question about Python representation of matcher

Open baoilleach opened this issue 2 years ago • 3 comments

I'm curious whether it would be possible to generate a Python representation of the matcher data structure, and if so, whether that might be even faster at matching (e.g. fewer lookups as things would be hardcoded, more scope for Pypy to optimize?). Just tossing that out there as a potential idea, but in any case curious to know what you think.

baoilleach avatar Feb 19 '23 11:02 baoilleach

Hi, not sure what you mean by generating a Python representation. Can you provide some more detailed thoughts on that. It's been a while since I did most of the core implementation. The finalized keyword tree is already a pretty optimized python representation of the matcher data structure. If you have any specific ideas on how to further optimize that, let me know and maybe even fork the repo, then we can compare performance. :) Thanks for your input

FrederikP avatar Mar 02 '23 16:03 FrederikP

I was thinking by analogy with a prefix tree. You can create a data structure that takes a bunch of strings and will match anything that starts with any of those strings. But given a fixed set of strings, you can also convert that datastructure to Python (or C++) code which would do the matching as if you had written the Python code yourself (laboriously). I don't know if something similar could be done in this case.

On Thu, 2 Mar 2023 at 16:30, Frederik Petersen @.***> wrote:

Hi, not sure what you mean by generating a Python representation. Can you provide some more detailed thoughts on that. It's been a while since I did most of the core implementation. The finalized keyword tree is already a pretty optimized python representation of the matcher data structure. If you have any specific ideas on how to further optimize that, let me know and maybe even fork the repo, then we can compare performance. :) Thanks for your input

— Reply to this email directly, view it on GitHub https://github.com/abusix/ahocorapy/issues/15#issuecomment-1452163301, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAALCGOCV73SOCZ3TCAXSGLW2DDLTANCNFSM6AAAAAAVA46H6M . You are receiving this because you authored the thread.Message ID: @.***>

baoilleach avatar Mar 02 '23 17:03 baoilleach

Okay, got it.

Mhm, I would be curious if there could be any input-specific generated Python code that would be much faster than the code that is actually running as of now, as it's already pretty optimized as stated above. I def don't see a big speed up potential through that. The generated data structure in its finalized form is already set up in a way that there isn't a lot of overhead compared to the baseline of the algorithm.

Obviously generated C(++) code would be faster, but then why not use pyahocorasick?

Overall I probably just don't grasp where the big speedup should come from when comparing the current implementation and datastructure vs. a static generated-code version of a specific result graph. The only place that might be a bit faster would maybe be the lookups of transitions, but then again it might be outweighed by the additional complexity of the code.

When it comes to pypy: Yeah the generated code might improve performance there. I could see that. But I'm not an expert on how the JIT component of it works and if it would actually be worth it. Overall I'm also not a big fan of generating code. Either it can be error prone (when done poorly) or it takes quite some effort to get right (using ASTs, etc.).

If someone would come up with a working version of that I would be very interested to see it. I don't have time to look into it myself though, at this moment.

FrederikP avatar Mar 02 '23 17:03 FrederikP