texterrors
texterrors copied to clipboard
RuntimeError: Word is too long!
Is there any inherent limit on the length of reference and hypothesis strings?
The following code snippet resulted in Run-time error for me.
ref_string = "ಕಾಶ್ಮೀರ ವಿಚಾರ ಕುರಿತಂತೆ ಪ್ರತಿಕ್ರಿಯಿಸಿದ ಅಮೇರಿಕ ಅಧ್ಯಕ್ಷ ಡೊನಾಲ್ಡ್ ಟ್ರಂಪ್ ಈ ಸಮಸ್ನೆಯನ್ನು ಭಾರತ ಮತ್ತು ಪಾಕಿಸ್ತಾನ ಸದ್ಯವೇ ಬಗೆಹರಿಸಿಕೊಳ್ಳಲಿವೆ ಎಂದು ಹೇಳಿದರು"
pred_string = "ಕಾಶ್ಮೀರ ವಿಚಾರ ಕುರಿತಂತೆ ಪ್ರತಿಕ್ರಿಯಿಸಿದ ಅಮೇರಿಕಾ ಅಧ್ಯಕ್ಷ ಡೊನಾಲ್ಡ್ ಟ್ರಂಪ್ ಈ ಸಮಸ್ಯೆಯನ್ನು ಭಾರತ ಮತ್ತು ಪಾಕಿಸ್ತಾನದಲ್ಲಿದೆ, ಸ್ಥಾನ ಸಧ್ಯವೇ ಬಗೆಹರಿಸಿಕೊಳ್ಳಲಿದೆ ಎಂದು ಹೇಳಿದರು."
ref_string_vector = StringVector(ref_string.split())
hypothesis_string_vector = StringVector(pred_string.split())
[aligned_a, aligned_b, cost] = align_texts(
ref_string_vector,
hypothesis_string_vector,
use_chardiff=True,
debug=False,
)
cost = texterrors_align.calc_sum_cost(summed_cost, words_a, words_b, use_chardiff, True) RuntimeError: Word is too long! Increase buffer
Is there any way out if I want to perform character aware alignment with long strings?
Hey thank you for reporting and the example! A quick workaround is downloading the codebase, and modifying the code to use a bigger buffer. However, after looking into it, I don't think this implementation of character-aware alignment actually makes sense for utf8. I'll aim to have a closer look and come up with something by end of the week.
I'll also make a change so one doesn't need to use StringVector for calling align_texts, makes it less work for a user to call align_texts(), been wanting to do this for a while
This is exciting @RuABraun. Looking forward to it.
On another note, align_text()expecting a StringVector, infact allowed me to pass a custom tokenized hypothesis and reference. In the code snippet above, I used split by space, but my actual implementation expects a different tokenization. Of course, it need not be a StringVector, rather a Python list of strings.
Need more time sorry