dolma not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs.

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs.

Open peterbjorgensen opened this issue 1 year ago • 1 comments

While running taggers on the hplt dataset, I encountered a problem that means that the not_alphanum_paragraph_v1 stalls forever. In order to debug the problem I have created a minimum working example by copy pasting some code from the TaggerProcessor. I have attached the debugging code in this archive with some text that triggers the problem. mwe.tar.gz

It looks like long sequences of emojis stalls the tagger forever. Here are some timings of emoji text from the hplt dataset:

InputSpec(id='7', text='😠 😡', source='hplt1.2', version=None) took 0.000039 seconds

InputSpec(id='4', text='😠 😡 😤 😋 😎 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 0.000025 seconds

InputSpec(id='11', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 Anti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 0.000021 seconds

InputSpec(id='5', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 64.204857 seconds

InputSpec(id='3', text='\nGæstebogs indlæg: *\n😄 😃 😊 😉 😍 😚 😗 😜 😛 😳 😁 😬 😌 😞 😢 😂 😭 😅 😓 😩 😮 😱 😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) ... takes 'forever'

It seems to be a bug in the regex python package. If I swap the regex package with the standard library re package it takes only ms again. I am not sure what feature this regex package has that makes it necessary, but this bug make me question whether it will encounter something similar with other regex queries.

We encountered the bug while trying to create an overview of the taggers: https://github.com/centre-for-humanities-computing/danish-foundation-models/issues/207#issuecomment-1946399142

Feb 15 '24 18:02 peterbjorgensen

Yikes. Probably the easiest way to tackle this is to create two version of the taggers; one using regex, the other using re.

Feb 21 '24 23:02 soldni

dolma dolma copied to clipboard

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs.

dolma
dolma copied to clipboard