stanza Error While adding a new Language

Hello, Im trying to add a new language in Stanza and for that im following https://stanfordnlp.github.io/stanza/new_language.html this link, so i already have language data in conllu format and using python3 -m stanza.utils.charlm.conll17_to_text ./
this command i've successully converted my conllu file into .txt.xz file and now i want to convert it in a suitable dataset using following command: python3 -m stanza.utils.charlm.make_lm_data ./extern_data/charlm_raw ./extern_data/charlm so when I run this command im getting following output with the error, i dont know if im doing something wrong.

Output: Processing files: source root: ./extern_data/charlm_raw target root: ./extern_data/charlm

1 total languages found: ['SINDHIDATA.txt.xz']

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/msi/.local/lib/python3.10/site-packages/stanza/utils/charlm/make_lm_data.py", line 139, in main() File "/home/msi/.local/lib/python3.10/site-packages/stanza/utils/charlm/make_lm_data.py", line 69, in main data_dirs = os.listdir(lang_root) NotADirectoryError: [Errno 20] Not a directory: 'extern_data/charlm_raw/SINDHIDATA.txt.xz'

Jul 12 '22 16:07 AbdulHaseeb22

It looks like you have not arranged the directories as expected. You would need to put it in a directory

extern_data/charlm_raw/sd/<dataset_name>/SINDHIDATA.txt.xz

I clarified some of the charlm steps here:

https://stanfordnlp.github.io/stanza/ner_new_language.html#charlm-and-bert

(I may separate that into a new page at some point)

May I ask, what data have you found for Sindhi? A brief search found that there's an NER dataset and a non-UD dependency treebank. In terms of tokenization, is there anything that needs to be done, or is it mostly sufficient to do whitespace tokenization for this language?

On Tue, Jul 12, 2022 at 9:53 AM Abdul Haseeb @.***> wrote:

Hello, Im trying to add a new language in Stanza and for that im following https://stanfordnlp.github.io/stanza/new_language.html https://stanfordnlp.github.io/stanza/new_language.html this link, so i already have language data in conllu format and using python3 -m stanza.utils.charlm.conll17_to_text ./ this command i've successully converted my conllu file into .txt.xz file and now i want to convert it in a suitable dataset using following command: python3 -m stanza.utils.charlm.make_lm_data ./extern_data/charlm_raw ./extern_data/charlm so when I run this command im getting following output with the error, i dont know if im doing something wrong.

Output: Processing files: source root: ./extern_data/charlm_raw target root: ./extern_data/charlm

1 total languages found: ['SINDHIDATA.txt.xz']

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/msi/.local/lib/python3.10/site-packages/stanza/utils/charlm/make_lm_data.py", line 139, in main() File "/home/msi/.local/lib/python3.10/site-packages/stanza/utils/charlm/make_lm_data.py", line 69, in main data_dirs = os.listdir(lang_root) NotADirectoryError: [Errno 20] Not a directory: 'extern_data/charlm_raw/SINDHIDATA.txt.xz'

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1075, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWO6QM4T5LNQCW6IMCLVTWPI3ANCNFSM53LWKMDA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jul 12 '22 17:07 AngledLuffa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 16 '22 00:09 stale[bot]

@AbdulHaseeb22 Did you ever find any large corpus of written Sindhi? I have found a couple small sources: there is about 30MB of Wikipedia data and 110MB of Common Crawl data. We could theoretically build new word vectors and/or a charlm out of this data, but it's quite small. If you know of a larger source of data for this, we can make better models overall.

There's an NER dataset available, and I've been discussing tokenization data with a group at Isra University, so we should be able to put together some Sindhi models soon. More data will make for better models, though!

Sep 16 '22 00:09 AngledLuffa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 19 '22 18:11 stale[bot]

I have a tokenizer and NER model - let me make sure they're okay with releasing it

Nov 19 '22 18:11 AngledLuffa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jan 20 '23 22:01 stale[bot]

In fact, those models are now available on the dev branch, and will be made part of an official release soon

Jan 20 '23 22:01 AngledLuffa