stanza-train
stanza-train copied to clipboard
Training in Windows
It's not so easy to use your code in Windows. I'm using Windows 10.
It took me some time to figure preparation steps. But here it is:
Download stanza_train. Put UD directory into data/ubase directory. Download Perl from https://strawberryperl.com/ (WHY DO WE NEED IT?) Install it.
IMPORTANT. As I couldn't install it (YES, I HAVE NO ADMIN RIGHTS!), I used Portable version In this case extracted content to C:\StrawberryPerl directory Also, in this case you need to modify config.bat file AND add one more step: set PATH=C:\StrawberryPerl\perl\site\bin;C:\StrawberryPerl\perl\bin; C:\StrawberryPerl\c\bin;%PATH%
Now fully modified config.bat file looks like this:
REM set environment variables for the training and testing of stanza modules.
REM set UDBASE to the location of UD data folder REM The data should be CoNLL-U format REM For details, see http://universaldependencies.org/conll18/data.html (CoNLL-18 UD data) set UDBASE=data\udbase
REM set NERBASE to the location of NER data folder REM The data should be BIO format REM For details, see https:\www.aclweb.org\anthology\W03-0419.pdf (CoNLL-03 NER paper) set NERBASE=\data\nerbase
REM set directories to store processed training\evaluation files set DATA_ROOT=data\processed set TOKENIZE_DATA_DIR=%DATA_ROOT%\tokenize set MWT_DATA_DIR=%DATA_ROOT%\mwt set LEMMA_DATA_DIR=%DATA_ROOT%\lemma set POS_DATA_DIR=%DATA_ROOT%\pos set DEPPARSE_DATA_DIR=%DATA_ROOT%\depparse set ETE_DATA_DIR=%DATA_ROOT%\ete set NER_DATA_DIR=%DATA_ROOT%\ner set CHARLM_DATA_DIR=%DATA_ROOT%\charlm
REM set directories to store external word vector data set WORDVEC_DIR=\data\wordvec
REM set perl to PATH set PATH=C:\StrawberryPerl\perl\site\bin;C:\StrawberryPerl\perl\bin;C:\StrawberryPerl\c\bin;%PATH%
I'm running everyting from cmd: Thus, I'm now in stanza_train directory and executed config.bat Next step - executing script:
python -m stanza.utils.datasets.prepare_tokenizer_treebank UD_Lithuanian-ALKSNIS python -m stanza.utils.training.run_tokenizer UD_Lithuanian-ALKSNIS
First error about encoding. YES, characters can be up to 4 bytes! So, edited stanza/models/tokenization/data.py 49 line: with open(txt_file, encoding="utf-8") as f:
Looks like training started and then the next error: PermissionError permission denied.
Logs looks like this:
Looks like I need to figure how to manage temporary files. Most likely there are differences between Linux and Windows temp files. Hopefully it's last error... when I figure how to solve it.
Addendum.
For crying out loud, why someone needs to use Temp dirs in OSes? Modified stanza/utils/conll.py method write_doc2conll() this code
with open(filename, mode, encoding=encoding) as outfile:
outfile.write("{:C}\n\n".format(doc))
into this
try:
outfile = open(filename, mode, encoding=encoding)
except IOError as ioe:
print("I/O ERROR. Can't write to {filename}")
print(ioe)
else:
outfile.write("{:C}\n\n".format(doc))
Still not sure if it's a good thing, but now the training began. After some time.... Now, it's interesting, that training stopped after no more improvement. And again Permission error. Looks like the temporary file is necessary.
There is no sense or profit in making it work on Windows.
You should try using Windows Subsystem for Linux for this.
There is allways a sense but not profit.
Important part of your suggestion:
You can now install everything you need to run WSL with a single command. Open PowerShell or Windows Command Prompt in administrator mode by right-clicking and selecting "Run as administrator",
I will repeat myself - I HAVE NO ADMIN RIGHTS
Maybe the last update.
The culprit was stanza/utils/training/run_tokenizer.py Modified these variables to:
train_pred = f"{tokenize_dir}/{short_name}.train.pred.conllu"
dev_pred = f"{tokenize_dir}/{short_name}.dev.pred.conllu"
test_pred = f"{tokenize_dir}/{short_name}.test.pred.conllu"
And voila
2025-01-21 09:51:57 INFO: lt_alksnis: token F1 = 99.95, sentence F1 = 89.98, mwt F1 = 99.95
2025-01-21 09:51:57 INFO: Dev score: 94.725 lr: 0.001974 -> 0.001972 Stopping training after 5200 steps with no improvement
2025-01-21 09:51:57 INFO: Best dev score=0.9613531110916763 at step 2200
...
2025-01-21 09:52:01 INFO: Finished running dev set on
UD_Lithuanian-ALKSNIS
Tokens Sentences Words
99.91 92.60 99.91
...
2025-01-21 09:52:04 INFO: Finished running test set on
UD_Lithuanian-ALKSNIS
Tokens Sentences Words
99.74 89.14 99.74
The primary reason for the temp dir is to make the results not permanently stick around. You can provide the --save_output flag to the run_... scripts and it should turn off the temp dirs on windows.
Good to know.
But I think, I also found simple solution if you want to use temporary files.
- In stanza-train directory we need to create new "temp" directory.
- And in config file we need to add one more line (well, two)
REM set directory for temporary files
set TEMP=temp
Finally found culprit. Check this piece of code:
import tempfile
with tempfile.NamedTemporaryFile() as tmp:
print(f'File - {tmp.name}')
f = open(tmp.name, 'w')
f.write('test')
f.close()
File - C:\Users\T2EEC~1.PLA\AppData\Local\Temp\tmp4u1kwsht
Traceback (most recent call last):
File "<pyshell#6>", line 3, in <module>
f = open(tmp.name, 'w')
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\T2EEC~1.PLA\\AppData\\Local\\Temp\\tmp4u1kwsht'
And now lets go into stanza/utils/training/common.py 188-190 lines
with tempfile.NamedTemporaryFile() as temp_output_file:
run_treebank(mode, paths, treebank, short_name,
temp_output_file.name, command_args, extra_args + save_name_args)
And documentation says:
Opening the temporary file again by its name while it is still open works as follows:
- On Windows, make sure that at least one of the following conditions are fulfilled:
- delete is false ....etc.
with tempfile.NamedTemporaryFile(delete=False) as tmp:
print(f'File - {tmp.name}')
f = open(tmp.name, 'w')
f.write('test')
f.close()
f = open(tmp.name, 'r')
print(f'File content: {f.read()}')
f.close()
File - C:\Users\T2EEC~1.PLA\AppData\Local\Temp\tmpapujsw25
4
File content: test
So. Can you check if setting delete=False works on Linux and change the code to work on both OSs?
It is the point of tmp files to be deleted.
Wysłano z programu Outlook dla systemu Androidhttps://aka.ms/AAb9ysg
From: topl0305 @.> Sent: Wednesday, January 22, 2025 12:41:34 PM To: stanfordnlp/stanza-train @.> Cc: Karol Saputa @.>; Comment @.> Subject: Re: [stanfordnlp/stanza-train] Training in Windows (Issue #20)
Finally found culprit. Check this piece of code:
import tempfile with tempfile.NamedTemporaryFile() as tmp: print(f'File - {tmp.name}') f = open(tmp.name, 'w') f.write('test') f.close()
File - C:\Users\T2EEC~1.PLA\AppData\Local\Temp\tmp4u1kwsht
Traceback (most recent call last):
File "<pyshell#6>", line 3, in
And now lets go into stanza/utils/training/common.py 188-190 lines
with tempfile.NamedTemporaryFile() as temp_output_file: run_treebank(mode, paths, treebank, short_name, temp_output_file.name, command_args, extra_args + save_name_args)
And documentation says:
Opening the temporary file again by its name while it is still open works as follows:
- On Windows, make sure that at least one of the following conditions are fulfilled:
- delete is false ....etc.
with tempfile.NamedTemporaryFile(delete=False) as tmp: print(f'File - {tmp.name}') f = open(tmp.name, 'w') f.write('test') f.close() f = open(tmp.name, 'r') print(f'File content: {f.read()}') f.close()
File - C:\Users\T2EEC~1.PLA\AppData\Local\Temp\tmpapujsw25 4 File content: test
So. Can you check if setting delete=False works on Linux and change the code to work on both OSs?
— Reply to this email directly, view it on GitHubhttps://github.com/stanfordnlp/stanza-train/issues/20#issuecomment-2607017049, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHYL543QCEXJOF7I7N2HS4L2L57W5AVCNFSM6AAAAABVQJODDGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBXGAYTOMBUHE. You are receiving this because you commented.Message ID: @.***>
It is the point of tmp files to be deleted.
Wysłano z programu Outlook dla systemu Androidhttps://aka.ms/AAb9ysg …
Did you read documentation how it works on Windows?
Very interesting. Used online tool to test delete parameter. The code is below. If delete parameter is set to False, then after with statement the file is still available, but if we comment with statement and try to run the same code again - it's gone. Most likely - file is available only until program stops. The bigger issue is with Python documentation. NamedTemporaryFile class is different before 3.12 version and after. Description drastically differs (or was poorly written) as the number of parameters too. But it looks safe to use delete parameter.
import platform, tempfile, sys
print(platform.system(), platform.release(), platform.version())
print(f'Python {sys.version}')
with tempfile.NamedTemporaryFile(delete=False) as tmp:
fn = tmp.name
print(f'File - {fn}')
f = open(tmp.name, 'w')
f.write('test')
f.close()
f = open(tmp.name, 'r')
print(f'File content: {f.read()}')
f.close()
f = open(fn) #<- Put file name here after commenting with statement
print(f.read())
f.close()
Output
Linux 6.1.0-22-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.94-1 (2024-06-21)
Python 3.8.5 (default, Jul 20 2020, 23:11:29)
[GCC 9.3.0]
File - /tmp/tmp_ribtep1
File content: test
test
Output after commenting:
Traceback (most recent call last):
File "main.py", line 15, in <module>
f = open('/tmp/tmp_ribtep1')
FileNotFoundError: [Errno 2] No such file or directory: ...
Hmm, I wouldn't say it was necessary to close the issue. There are definitely some actionable items here, including finding a way to get rid of the perl dependency. It may also be helpful / optimal to detect if the permissions for deleting a temp file are not allowed, and if so, keep the results file instead of crashing.
Well, I can't promise this will fix everything in terms of training on Windows, but I converted the conllu_to_text.pl script to python. It works exactly the same on several datasets I tested.
Regarding deleting temp files, I didn't want to keep the output with delete=False, especially not in the temp directory. If that's causing headaches (and I can verify it does with Python 3.10 on Windows) there is a flag --save_output which will keep the output file rather than trying to delete it
Alright, I can't promise that each individual NER, Sentiment, etc dataset is properly using utf-8 encoding on Windows, but the UD-based ones certainly are. If any NER models don't rebuild because of encoding issues, we can always revisit it. This fix is now merged in the dev branch and will be released soon in 1.11