python-nameparser MemoryError for a name with a lot of prefixes

MemoryError for a name with a lot of prefixes

Open Ronserruya opened this issue 4 years ago • 1 comments

I don't really think this is a "bug", more like an extreme edge case.

While using the library I had to parse millions of name and encountered a user input:

"<first_name> van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der <last_name>"

This name quickly caused a MemoryError in a PC with 60+GB of RAM, more specifically This list : https://github.com/derek73/python-nameparser/blob/master/nameparser/parser.py#L799 is growing exponentially in size very fast.

Again, Im not expecting you to fix this since this is obviously a user input error (which I bypassed by setting a maximum size to the string), but I thought you might be interested to know about this edge case.

Mar 22 '20 15:03 Ronserruya

Thanks for the bug report. I wondered if this would ever be an issue when I wrote it that way.

When the parser encounters a new combination of titles joined with a conjunction, it saves the complete string as a new title in the module's shared config (by default) and takes another pass. So each pass would result in a title with one additional conjunction or title added to the end. That somewhat explains the exponential nature, but it might also depend on how you're using the parser. I wonder if you would have the same problem with something like this:

parser = HumanName()
parser.fullname = name1
parsed_name1 = str(parser)

parser.fullname = name2
parsed_name2 = str(parser)

This should ensure that the module level config is shared across all the instances. I guess I'm not clear why that list would grow so large as to throw a memory error. In my understanding, it seems like it should just be storing less than 50 different versions of that very long title.

Anyway, It would be nice if the library didn't throw ambiguous memory errors, so maybe we can give it a better exception. Here is where the new title with conjunctions are saved to the module level config:

https://github.com/derek73/python-nameparser/blob/master/nameparser/parser.py#L721

We could test for some maximum around there, and have a default that can be overridden with the config object. I'm not sure exactly where it would need to go though. I wonder if the problem is in that group_contiguous_integers(conj_index) call?

If you are able to poke around or have any ideas, let me know. I haven't fired up my dev environment yet to try that name string, but when I do I'll try to find someplace to put in a maximum and then maybe throw some more informative exception? or maybe a warning?

Mar 22 '20 19:03 derek73

python-nameparser python-nameparser copied to clipboard

MemoryError for a name with a lot of prefixes

python-nameparser
python-nameparser copied to clipboard