dolma
dolma copied to clipboard
not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs.
While running taggers on the hplt dataset, I encountered a problem that means that the not_alphanum_paragraph_v1
stalls forever. In order to debug the problem I have created a minimum working example by copy pasting some code from the TaggerProcessor. I have attached the debugging code in this archive with some text that triggers the problem.
mwe.tar.gz
It looks like long sequences of emojis stalls the tagger forever. Here are some timings of emoji text from the hplt dataset:
InputSpec(id='7', text='๐ ๐ก', source='hplt1.2', version=None) took 0.000039 seconds
InputSpec(id='4', text='๐ ๐ก ๐ค ๐ ๐ ๐ฆ ๐ง ๐ ๐ ๐ ๐ \n\nAnti-Spam: *\nSpรธrgsmรฅl: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 0.000025 seconds
InputSpec(id='11', text='๐ ๐ก ๐ค ๐ ๐ ๐ด ๐ ๐ ๐ ๐ ๐ ๐ฒ ๐ฎ ๐ ๐ถ โค ๐ ๐ ๐ ๐ ๐ ๐ ๐ฐ ๐ ๐ ๐ ๐ โ๏ธ ๐ค ๐ ๐ต โ๏ธ ๐ต Anti-Spam: *\nSpรธrgsmรฅl: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 0.000021 seconds
InputSpec(id='5', text='๐ ๐ก ๐ค ๐ ๐ ๐ด ๐ ๐ ๐ ๐ ๐ ๐ฒ ๐ฎ ๐ ๐ถ โค ๐ ๐ ๐ ๐ ๐ ๐ ๐ฐ ๐ ๐ ๐ ๐ โ๏ธ ๐ค ๐ ๐ต โ๏ธ ๐ต ๐บ ๐ท ๐ผ โ๏ธ ๐ค ๐ฆ ๐ง ๐ ๐ ๐ ๐ \n\nAnti-Spam: *\nSpรธrgsmรฅl: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 64.204857 seconds
InputSpec(id='3', text='\nGรฆstebogs indlรฆg: *\n๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ณ ๐ ๐ฌ ๐ ๐ ๐ข ๐ ๐ญ ๐ ๐ ๐ฉ ๐ฎ ๐ฑ ๐ ๐ก ๐ค ๐ ๐ ๐ด ๐ ๐ ๐ ๐ ๐ ๐ฒ ๐ฎ ๐ ๐ถ โค ๐ ๐ ๐ ๐ ๐ ๐ ๐ฐ ๐ ๐ ๐ ๐ โ๏ธ ๐ค ๐ ๐ต โ๏ธ ๐ต ๐บ ๐ท ๐ผ โ๏ธ ๐ค ๐ฆ ๐ง ๐ ๐ ๐ ๐ \n\nAnti-Spam: *\nSpรธrgsmรฅl: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) ... takes 'forever'
It seems to be a bug in the regex
python package. If I swap the regex
package with the standard library re
package it takes only ms again. I am not sure what feature this regex
package has that makes it necessary, but this bug make me question whether it will encounter something similar with other regex queries.
We encountered the bug while trying to create an overview of the taggers: https://github.com/centre-for-humanities-computing/danish-foundation-models/issues/207#issuecomment-1946399142
Yikes. Probably the easiest way to tackle this is to create two version of the taggers; one using regex, the other using re.