texterrors icon indicating copy to clipboard operation
texterrors copied to clipboard

RuntimeError: Word is too long!

Open kavyamanohar opened this issue 2 months ago • 3 comments

Is there any inherent limit on the length of reference and hypothesis strings?

The following code snippet resulted in Run-time error for me.

ref_string = "ಕಾಶ್ಮೀರ ವಿಚಾರ ಕುರಿತಂತೆ ಪ್ರತಿಕ್ರಿಯಿಸಿದ ಅಮೇರಿಕ ಅಧ್ಯಕ್ಷ ಡೊನಾಲ್ಡ್ ಟ್ರಂಪ್ ಈ ಸಮಸ್ನೆಯನ್ನು ಭಾರತ ಮತ್ತು ಪಾಕಿಸ್ತಾನ ಸದ್ಯವೇ ಬಗೆಹರಿಸಿಕೊಳ್ಳಲಿವೆ ಎಂದು ಹೇಳಿದರು"
pred_string = "ಕಾಶ್ಮೀರ ವಿಚಾರ ಕುರಿತಂತೆ ಪ್ರತಿಕ್ರಿಯಿಸಿದ ಅಮೇರಿಕಾ ಅಧ್ಯಕ್ಷ ಡೊನಾಲ್ಡ್ ಟ್ರಂಪ್ ಈ ಸಮಸ್ಯೆಯನ್ನು ಭಾರತ ಮತ್ತು ಪಾಕಿಸ್ತಾನದಲ್ಲಿದೆ, ಸ್ಥಾನ ಸಧ್ಯವೇ ಬಗೆಹರಿಸಿಕೊಳ್ಳಲಿದೆ ಎಂದು ಹೇಳಿದರು."

ref_string_vector = StringVector(ref_string.split())
hypothesis_string_vector = StringVector(pred_string.split())

[aligned_a, aligned_b, cost] = align_texts(
    ref_string_vector,
    hypothesis_string_vector,
    use_chardiff=True,
    debug=False,
)

cost = texterrors_align.calc_sum_cost(summed_cost, words_a, words_b, use_chardiff, True) RuntimeError: Word is too long! Increase buffer

Is there any way out if I want to perform character aware alignment with long strings?

kavyamanohar avatar Sep 24 '25 17:09 kavyamanohar

Hey thank you for reporting and the example! A quick workaround is downloading the codebase, and modifying the code to use a bigger buffer. However, after looking into it, I don't think this implementation of character-aware alignment actually makes sense for utf8. I'll aim to have a closer look and come up with something by end of the week.

I'll also make a change so one doesn't need to use StringVector for calling align_texts, makes it less work for a user to call align_texts(), been wanting to do this for a while

RuABraun avatar Sep 24 '25 23:09 RuABraun

This is exciting @RuABraun. Looking forward to it.

On another note, align_text()expecting a StringVector, infact allowed me to pass a custom tokenized hypothesis and reference. In the code snippet above, I used split by space, but my actual implementation expects a different tokenization. Of course, it need not be a StringVector, rather a Python list of strings.

kavyamanohar avatar Sep 25 '25 03:09 kavyamanohar

Need more time sorry

RuABraun avatar Oct 01 '25 02:10 RuABraun