yalign icon indicating copy to clipboard operation
yalign copied to clipboard

Key error in Alignment

Open sanjanasri opened this issue 8 years ago • 4 comments

Hi,

 I have successfully created the model for other languages tamil and english. But, when try to do alignment `python yalign-align -a en -b ta en-ta en.txt ta.txt > aligned.txt`. I am getting the keyerror 

Traceback (most recent call last): File "yalign-align", line 64, in <module> document_b = read_document(args['<document_b>'], lang_b) File "yalign-align", line 44, in read_document return text_to_document(text, language) File "/home/sanjana/Documents/Python_pgms/yalign/yalign/input_conversion.py", line 65, in text_to_document splitter = _sentence_splitters[language] File "/home/sanjana/Documents/Python_pgms/yalign/yalign/utils.py", line 82, in __missing__ x = self.default_factory(key) File "/home/sanjana/Documents/Python_pgms/yalign/yalign/input_conversion.py", line 51, in <lambda> _sentence_splitters = Memoized(lambda lang: nltkload("tokenizers/punkt/%s.pickle" % CODES_TO_LANGUAGE[lang])) KeyError: 'ta' It would be great if I am getting an earnest reply.

PS:nltk does not support tamil language

sanjanasri avatar Dec 10 '16 05:12 sanjanasri

Tamil is currently not a supported language for nltk and therefore Yalign fails to load the sentence splitter for Tamil.

I would recommend you to hack the _sentence_splitters function in yalign/input_conversion.py to implement a custom sentence splitting algorithm for Tamil. I does not needs to be anything fancy, if you preprocessed the input to Yaling it could be as simple as text.split('\n') (ie, splitting one sentence by line).

rafacarrascosa avatar Dec 11 '16 13:12 rafacarrascosa

Thank You, I did something like this "_sentence_splitters = text.split("\n")" in yalign/input_conversion.py, It works for other languages, but returns an empty file for tamil.

If I use command as python /home/yalign/scripts/yalign-align -a ta -b en ta-en 2.txt 1.txt > aligned.txt I am getting key error,

File "/home/sanjana/Documents/Python_pgms/yalign/scripts/yalign-align",

line 63, in document_a = read_document(args['<document_a>'], lang_a) File "/home/sanjana/Documents/Python_pgms/yalign/scripts/yalign-align", line 44, in read_document return text_to_document(text, language) File "build/bdist.linux-x86_64/egg/yalign/input_conversion.py", line 65, in text_to_document File "build/bdist.linux-x86_64/egg/yalign/utils.py", line 82, in missing File "build/bdist.linux-x86_64/egg/yalign/input_conversion.py", line 51, in KeyError: 'ta'

So i used en instead of ta. I don't get an error but an empty file. Do not know where I am wrong. Please help

On Sun, Dec 11, 2016 at 7:13 PM, Rafael Carrascosa <[email protected]

wrote:

Tamil is currently not a supported language for nltk and therefore Yalign fails to load the sentence splitter for Tamil.

I would recommend you to hack the _sentence_splitters function in yalign/input_conversion.py to implement a custom sentence splitting algorithm for Tamil. I does not needs to be anything fancy, if you preprocessed the input to Yaling it could be as simple as text.split('\n') (ie, splitting one sentence by line).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/machinalis/yalign/issues/9#issuecomment-266282823, or mute the thread https://github.com/notifications/unsubscribe-auth/AMmkNhIupSx-m8FOmTayzi11FIy_ih7Tks5rG_34gaJpZM4LJj40 .

-- Thanks and regards,

Sanjanasri J.P

sanjanasri avatar Dec 12 '16 09:12 sanjanasri

I am sorry Sanjanasri, but I have too much work right now to walk you through debugging that output.

If you have some programming skills my recommendation remains: hack that function. If you do not, perhaps someone from the community can give you a hand.

Regards,

Rafael

rafacarrascosa avatar Dec 12 '16 13:12 rafacarrascosa

Sanjanasri, at line 31 (or thereabouts) of input_conversion.py is a statement: CODES_TO_LANGUAGE = { "cs": "czech", "da": "danish", "de": "german", "el": "greek", "en": "english", "es": "spanish", "et": "estonian", "fi": "finnish", "fr": "french", "it": "italian", "nb": "norwegian", "pl": "polish", "pt": "portuguese", "nl": "dutch", "sv": "swedish", "tr": "turkish", }

Suggest you add "ta": "tamil" to that. You'll probably find more problems after that, but it should stop the key error at least.

simontite-capita-ti avatar Dec 12 '16 14:12 simontite-capita-ti