text
text copied to clipboard
Cannot run text_classifier end to end
🐛 Bug
There's a minor issue with the text_classifier in the examples folder. When I run the run_script.sh it creates a .data folder, then the train command
python train.py AG_NEWS --device cpu --save-model-path model.i --dictionary vocab.i
works fine, but when it's finished, there's only a single file .data/datasets/AG_NEWS/train.csv - but the subsequence predict command
cut -f 2- -d "," .data/AG_NEWS/test.csv | python predict.py model.i vocab.i > predict_script.o
is expecting a test.csv file and in a different folder (.data/AG_NEWS/test.csv rather than .data/datasets/AG_NEWS/test.csv)
To Reproduce Steps to reproduce the behavior:
- ./run_script.sh
- After training, see error
cut: .data/AG_NEWS/test.csv: No such file or directory
Environment
- PyTorch Version (e.g., 1.0): 1.13.0+cu117
- OS (e.g., Linux): Linux (Ubuntu 20.04)
- How you installed PyTorch (
conda,pip, source): pip - Build command you used (if compiling from source):
- Python version: 3.8
- CUDA/cuDNN version: 11.7 Versions of relevant libraries: [pip3] numpy==1.23.5 [pip3] torch==1.13.0+cu117 [pip3] torchdata==0.5.0 [pip3] torchtext==0.14.0
I think I found the problem the train split is accessed twice to build the vocab and count the number of labels
https://github.com/pytorch/text/blob/ed78e3b014e67c672b8fd224e0fc8ecea6282ab0/examples/text_classification/train.py#L113
and
https://github.com/pytorch/text/blob/ed78e3b014e67c672b8fd224e0fc8ecea6282ab0/examples/text_classification/train.py#L123
But then the third time both train and test splits are accessed, and this time around the data_dir isn't specified so I guess it's downloading into my home folder or wherever the datsets default is:
https://github.com/pytorch/text/blob/ed78e3b014e67c672b8fd224e0fc8ecea6282ab0/examples/text_classification/train.py#L131
If you change the line above to
train_iter, test_iter = DATASETS[args.dataset](root=data_dir)
And fix the path in run_script.sh to
cut -f 2- -d "," .data/datasets/AG_NEWS/test.csv | python predict.py model.i vocab.i > predict_script.o
It runs end to end
@david-waterworth thanks so much for catching this. Do you want to make submit a PR with these changes and I can help review? Otherwise I can get to this in a couple of weeks!