stanza-train icon indicating copy to clipboard operation
stanza-train copied to clipboard

Training in Windows

Open topl0305 opened this issue 10 months ago • 11 comments

It's not so easy to use your code in Windows. I'm using Windows 10.

It took me some time to figure preparation steps. But here it is:

Download stanza_train. Put UD directory into data/ubase directory. Download Perl from https://strawberryperl.com/ (WHY DO WE NEED IT?) Install it.

IMPORTANT. As I couldn't install it (YES, I HAVE NO ADMIN RIGHTS!), I used Portable version In this case extracted content to C:\StrawberryPerl directory Also, in this case you need to modify config.bat file AND add one more step: set PATH=C:\StrawberryPerl\perl\site\bin;C:\StrawberryPerl\perl\bin; C:\StrawberryPerl\c\bin;%PATH%

Now fully modified config.bat file looks like this:


REM set environment variables for the training and testing of stanza modules.

REM set UDBASE to the location of UD data folder REM The data should be CoNLL-U format REM For details, see http://universaldependencies.org/conll18/data.html (CoNLL-18 UD data) set UDBASE=data\udbase

REM set NERBASE to the location of NER data folder REM The data should be BIO format REM For details, see https:\www.aclweb.org\anthology\W03-0419.pdf (CoNLL-03 NER paper) set NERBASE=\data\nerbase

REM set directories to store processed training\evaluation files set DATA_ROOT=data\processed set TOKENIZE_DATA_DIR=%DATA_ROOT%\tokenize set MWT_DATA_DIR=%DATA_ROOT%\mwt set LEMMA_DATA_DIR=%DATA_ROOT%\lemma set POS_DATA_DIR=%DATA_ROOT%\pos set DEPPARSE_DATA_DIR=%DATA_ROOT%\depparse set ETE_DATA_DIR=%DATA_ROOT%\ete set NER_DATA_DIR=%DATA_ROOT%\ner set CHARLM_DATA_DIR=%DATA_ROOT%\charlm

REM set directories to store external word vector data set WORDVEC_DIR=\data\wordvec

REM set perl to PATH set PATH=C:\StrawberryPerl\perl\site\bin;C:\StrawberryPerl\perl\bin;C:\StrawberryPerl\c\bin;%PATH%


I'm running everyting from cmd: Thus, I'm now in stanza_train directory and executed config.bat Next step - executing script:

python -m stanza.utils.datasets.prepare_tokenizer_treebank UD_Lithuanian-ALKSNIS python -m stanza.utils.training.run_tokenizer UD_Lithuanian-ALKSNIS

First error about encoding. YES, characters can be up to 4 bytes! So, edited stanza/models/tokenization/data.py 49 line: with open(txt_file, encoding="utf-8") as f:

Looks like training started and then the next error: PermissionError permission denied.

Logs looks like this:

train_log.txt

Looks like I need to figure how to manage temporary files. Most likely there are differences between Linux and Windows temp files. Hopefully it's last error... when I figure how to solve it.

topl0305 avatar Jan 20 '25 12:01 topl0305

Addendum.

For crying out loud, why someone needs to use Temp dirs in OSes? Modified stanza/utils/conll.py method write_doc2conll() this code

with open(filename, mode, encoding=encoding) as outfile:
               outfile.write("{:C}\n\n".format(doc))

into this

try:
                outfile = open(filename, mode, encoding=encoding)
            except IOError as ioe:
                print("I/O ERROR. Can't write to {filename}")
                print(ioe)
            else:
                outfile.write("{:C}\n\n".format(doc))

Still not sure if it's a good thing, but now the training began. After some time.... Now, it's interesting, that training stopped after no more improvement. And again Permission error. Looks like the temporary file is necessary.

topl0305 avatar Jan 21 '25 06:01 topl0305

There is no sense or profit in making it work on Windows.

You should try using Windows Subsystem for Linux for this.

k-sap avatar Jan 21 '25 06:01 k-sap

There is allways a sense but not profit.

Important part of your suggestion:

You can now install everything you need to run WSL with a single command. Open PowerShell or Windows Command Prompt in administrator mode by right-clicking and selecting "Run as administrator",

I will repeat myself - I HAVE NO ADMIN RIGHTS

topl0305 avatar Jan 21 '25 07:01 topl0305

Maybe the last update.

The culprit was stanza/utils/training/run_tokenizer.py Modified these variables to:

    train_pred = f"{tokenize_dir}/{short_name}.train.pred.conllu"
    dev_pred = f"{tokenize_dir}/{short_name}.dev.pred.conllu"
    test_pred = f"{tokenize_dir}/{short_name}.test.pred.conllu"

And voila

2025-01-21 09:51:57 INFO: lt_alksnis: token F1 = 99.95, sentence F1 = 89.98, mwt F1 = 99.95
2025-01-21 09:51:57 INFO: Dev score: 94.725     lr: 0.001974 -> 0.001972        Stopping training after 5200 steps with no improvement
2025-01-21 09:51:57 INFO: Best dev score=0.9613531110916763 at step 2200
...
2025-01-21 09:52:01 INFO: Finished running dev set on
UD_Lithuanian-ALKSNIS
   Tokens Sentences     Words
    99.91     92.60     99.91
...
2025-01-21 09:52:04 INFO: Finished running test set on
UD_Lithuanian-ALKSNIS
   Tokens Sentences     Words
    99.74     89.14     99.74

topl0305 avatar Jan 21 '25 07:01 topl0305

The primary reason for the temp dir is to make the results not permanently stick around. You can provide the --save_output flag to the run_... scripts and it should turn off the temp dirs on windows.

AngledLuffa avatar Jan 21 '25 08:01 AngledLuffa

Good to know.

But I think, I also found simple solution if you want to use temporary files.

  • In stanza-train directory we need to create new "temp" directory.
  • And in config file we need to add one more line (well, two)
REM set directory for temporary files
set TEMP=temp

topl0305 avatar Jan 22 '25 10:01 topl0305

Finally found culprit. Check this piece of code:

import tempfile
with tempfile.NamedTemporaryFile() as tmp:
    print(f'File - {tmp.name}')
    f = open(tmp.name, 'w')
    f.write('test')
    f.close()

    
File - C:\Users\T2EEC~1.PLA\AppData\Local\Temp\tmp4u1kwsht
Traceback (most recent call last):
  File "<pyshell#6>", line 3, in <module>
    f = open(tmp.name, 'w')
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\T2EEC~1.PLA\\AppData\\Local\\Temp\\tmp4u1kwsht'

And now lets go into stanza/utils/training/common.py 188-190 lines

with tempfile.NamedTemporaryFile() as temp_output_file:
                run_treebank(mode, paths, treebank, short_name,
                             temp_output_file.name, command_args, extra_args + save_name_args)

And documentation says:

Opening the temporary file again by its name while it is still open works as follows:

  • On Windows, make sure that at least one of the following conditions are fulfilled:
  • delete is false ....etc.
with tempfile.NamedTemporaryFile(delete=False) as tmp:
    print(f'File - {tmp.name}')
    f = open(tmp.name, 'w')
    f.write('test')
    f.close()
    f = open(tmp.name, 'r')
    print(f'File content: {f.read()}')
    f.close()

    
File - C:\Users\T2EEC~1.PLA\AppData\Local\Temp\tmpapujsw25
4
File content: test

So. Can you check if setting delete=False works on Linux and change the code to work on both OSs?

topl0305 avatar Jan 22 '25 11:01 topl0305

It is the point of tmp files to be deleted.

Wysłano z programu Outlook dla systemu Androidhttps://aka.ms/AAb9ysg


From: topl0305 @.> Sent: Wednesday, January 22, 2025 12:41:34 PM To: stanfordnlp/stanza-train @.> Cc: Karol Saputa @.>; Comment @.> Subject: Re: [stanfordnlp/stanza-train] Training in Windows (Issue #20)

Finally found culprit. Check this piece of code:

import tempfile with tempfile.NamedTemporaryFile() as tmp: print(f'File - {tmp.name}') f = open(tmp.name, 'w') f.write('test') f.close()

File - C:\Users\T2EEC~1.PLA\AppData\Local\Temp\tmp4u1kwsht Traceback (most recent call last): File "<pyshell#6>", line 3, in f = open(tmp.name, 'w') PermissionError: [Errno 13] Permission denied: 'C:\Users\T2EEC~1.PLA\AppData\Local\Temp\tmp4u1kwsht'

And now lets go into stanza/utils/training/common.py 188-190 lines

with tempfile.NamedTemporaryFile() as temp_output_file: run_treebank(mode, paths, treebank, short_name, temp_output_file.name, command_args, extra_args + save_name_args)

And documentation says:

Opening the temporary file again by its name while it is still open works as follows:

  • On Windows, make sure that at least one of the following conditions are fulfilled:
  • delete is false ....etc.

with tempfile.NamedTemporaryFile(delete=False) as tmp: print(f'File - {tmp.name}') f = open(tmp.name, 'w') f.write('test') f.close() f = open(tmp.name, 'r') print(f'File content: {f.read()}') f.close()

File - C:\Users\T2EEC~1.PLA\AppData\Local\Temp\tmpapujsw25 4 File content: test

So. Can you check if setting delete=False works on Linux and change the code to work on both OSs?

— Reply to this email directly, view it on GitHubhttps://github.com/stanfordnlp/stanza-train/issues/20#issuecomment-2607017049, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHYL543QCEXJOF7I7N2HS4L2L57W5AVCNFSM6AAAAABVQJODDGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBXGAYTOMBUHE. You are receiving this because you commented.Message ID: @.***>

k-sap avatar Jan 22 '25 12:01 k-sap

It is the point of tmp files to be deleted.

Wysłano z programu Outlook dla systemu Androidhttps://aka.ms/AAb9ysg

Did you read documentation how it works on Windows?

topl0305 avatar Jan 22 '25 13:01 topl0305

Very interesting. Used online tool to test delete parameter. The code is below. If delete parameter is set to False, then after with statement the file is still available, but if we comment with statement and try to run the same code again - it's gone. Most likely - file is available only until program stops. The bigger issue is with Python documentation. NamedTemporaryFile class is different before 3.12 version and after. Description drastically differs (or was poorly written) as the number of parameters too. But it looks safe to use delete parameter.

import platform, tempfile, sys
print(platform.system(), platform.release(), platform.version())
print(f'Python {sys.version}')

with tempfile.NamedTemporaryFile(delete=False) as tmp:
     fn = tmp.name
     print(f'File - {fn}')
     f = open(tmp.name, 'w')
     f.write('test')
     f.close()
     f = open(tmp.name, 'r')
     print(f'File content: {f.read()}')
     f.close()
    
f = open(fn) #<- Put file name here after commenting with statement
print(f.read())
f.close()

Output  

Linux 6.1.0-22-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.94-1 (2024-06-21)
Python 3.8.5 (default, Jul 20 2020, 23:11:29) 
[GCC 9.3.0]
File - /tmp/tmp_ribtep1
File content: test

test

Output after commenting:

Traceback (most recent call last):
  File "main.py", line 15, in <module>
    f = open('/tmp/tmp_ribtep1')
FileNotFoundError: [Errno 2] No such file or directory: ...

topl0305 avatar Jan 23 '25 06:01 topl0305

Hmm, I wouldn't say it was necessary to close the issue. There are definitely some actionable items here, including finding a way to get rid of the perl dependency. It may also be helpful / optimal to detect if the permissions for deleting a temp file are not allowed, and if so, keep the results file instead of crashing.

AngledLuffa avatar Jan 23 '25 07:01 AngledLuffa

Well, I can't promise this will fix everything in terms of training on Windows, but I converted the conllu_to_text.pl script to python. It works exactly the same on several datasets I tested.

Regarding deleting temp files, I didn't want to keep the output with delete=False, especially not in the temp directory. If that's causing headaches (and I can verify it does with Python 3.10 on Windows) there is a flag --save_output which will keep the output file rather than trying to delete it

AngledLuffa avatar Sep 19 '25 07:09 AngledLuffa

Alright, I can't promise that each individual NER, Sentiment, etc dataset is properly using utf-8 encoding on Windows, but the UD-based ones certainly are. If any NER models don't rebuild because of encoding issues, we can always revisit it. This fix is now merged in the dev branch and will be released soon in 1.11

AngledLuffa avatar Sep 23 '25 02:09 AngledLuffa