stanza icon indicating copy to clipboard operation
stanza copied to clipboard

[QUESTION] Rebuilding an existing language from sources

Open student-nlp-project opened this issue 3 years ago • 9 comments

This is my first question, so sorry if I make any mistake, but I haven't had found information about rebuilding an exsiting language from its sources. I plan doing this before building a variant of the language with a variant of the corpus.

student-nlp-project avatar Aug 18 '22 08:08 student-nlp-project

https://stanfordnlp.github.io/stanza/retrain_ud.html

AngledLuffa avatar Aug 18 '22 11:08 AngledLuffa

Tranks, I have missed this part of the documentation.

I have tried out the example of English-EWT, and it seems to work, though with lengthy trainings. Same result for Spanish-GSD, but Spanish-AnCora crashes at the very beginning claimning to lack files somewhere, and AnCora corpus is a very interesting one.

student-nlp-project avatar Aug 20 '22 13:08 student-nlp-project

These are the error lines, where [[PythonPath]] shortens the path to Python folder.

2022-08-20 15:08:50 INFO: Datasets program called with: [[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py UD_Spanish-AnCora-master Traceback (most recent call last): File "[[PythonPath]]\Python39\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "[[PythonPath]]\Python39\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1136, in main() File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1133, in main common.main(process_treebank, add_specific_args) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\common.py", line 134, in main process_treebank(treebank, paths, args) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1089, in process_treebank short_name = common.project_to_short_name(treebank) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\common.py", line 19, in project_to_short_name return treebank_to_short_name(treebank) File "[[PythonPath]]\Python39\lib\site-packages\stanza\models\common\constant.py", line 177, in treebank_to_short_name assert len(splits) == 2, "Unable to process %s" % treebank AssertionError: Unable to process Spanish-AnCora-master

student-nlp-project avatar Aug 20 '22 13:08 student-nlp-project

The name is UD_Spanish-AnCora not UD_Spanish-AnCora-master

On Sat, Aug 20, 2022 at 6:15 AM student-nlp-project < @.***> wrote:

These are the error lines, where [[PythonPath]] shortens the path to Python folder.

2022-08-20 15:08:50 INFO: Datasets program called with: [[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py UD_Spanish-AnCora-master Traceback (most recent call last): File "[[PythonPath]]\Python39\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "[[PythonPath]]\Python39\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1136, in main() File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1133, in main common.main(process_treebank, add_specific_args) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\common.py", line 134, in main process_treebank(treebank, paths, args) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1089, in process_treebank short_name = common.project_to_short_name(treebank) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\common.py", line 19, in project_to_short_name return treebank_to_short_name(treebank) File "[[PythonPath]]\Python39\lib\site-packages\stanza\models\common\constant.py", line 177, in treebank_to_short_name assert len(splits) == 2, "Unable to process %s" % treebank AssertionError: Unable to process Spanish-AnCora-master

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1102#issuecomment-1221313071, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWPUUDIBNA7BXQZWA23V2DK7DANCNFSM564L4ZKQ . You are receiving this because you commented.Message ID: @.***>

AngledLuffa avatar Aug 20 '22 14:08 AngledLuffa

Thnaks, I renamed the directory and now the trainer can access it, but I now get two differente errors in Windows and in Linux, and they seem to have to do with the training data format.

In Windows: 2022-08-20 16:59:31 INFO: Datasets program called with: [[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py UD_Spanish-AnCora Preparing data for UD_Spanish-AnCora: es_ancora, es Traceback (most recent call last): File "[[PythonPath]]\Python39\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "[[PythonPath]]\Python39\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1136, in main() File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1133, in main common.main(process_treebank, add_specific_args) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\common.py", line 134, in main process_treebank(treebank, paths, args) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1123, in process_treebank process_ud_treebank(treebank, udbase_dir, tokenizer_dir, short_name, short_language, args.augment) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1026, in process_ud_treebank prepare_ud_dataset(treebank, udbase_dir, tokenizer_dir, short_name, short_language, "train", augment) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1016, in prepare_ud_dataset write_augmented_dataset(input_conllu, output_conllu, augment_punct) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 728, in write_augmented_dataset sents = read_sentences_from_conllu(input_conllu) File "[[PythonPath]]\Python39\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 86, in read_sentences_from_conllu for line in infile: File "[[PythonPath]]\Python39\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 5198: character maps to

In Linux: imagen

student-nlp-project avatar Aug 20 '22 15:08 student-nlp-project

The Linux error fixed on the dev branch. Please use that. The Windows error is fixable, but then there's another error regarding unicode errors, plus you need perl installed for a later tool which hasn't been converted yet, so it's a bit of a hassle.

We are planning a new release, but there's always another last minute thing to do...

In general you can post text with ``` rather than pasting images

AngledLuffa avatar Aug 20 '22 17:08 AngledLuffa

Thanks for advising about perl, and sorry for lacking a better way of copying from the Linux machine.

student-nlp-project avatar Aug 22 '22 17:08 student-nlp-project

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 22 '22 18:10 stale[bot]

not stale - will leave it open as a reminder to one day get rid of the perl dependency

AngledLuffa avatar Oct 22 '22 20:10 AngledLuffa