text icon indicating copy to clipboard operation
text copied to clipboard

Cannot run text_classifier end to end

Open david-waterworth opened this issue 3 years ago • 2 comments

🐛 Bug

There's a minor issue with the text_classifier in the examples folder. When I run the run_script.sh it creates a .data folder, then the train command

python train.py AG_NEWS --device cpu --save-model-path model.i --dictionary vocab.i

works fine, but when it's finished, there's only a single file .data/datasets/AG_NEWS/train.csv - but the subsequence predict command

cut -f 2- -d "," .data/AG_NEWS/test.csv | python predict.py model.i vocab.i > predict_script.o

is expecting a test.csv file and in a different folder (.data/AG_NEWS/test.csv rather than .data/datasets/AG_NEWS/test.csv)

To Reproduce Steps to reproduce the behavior:

  1. ./run_script.sh
  2. After training, see error cut: .data/AG_NEWS/test.csv: No such file or directory

Environment

  • PyTorch Version (e.g., 1.0): 1.13.0+cu117
  • OS (e.g., Linux): Linux (Ubuntu 20.04)
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.8
  • CUDA/cuDNN version: 11.7 Versions of relevant libraries: [pip3] numpy==1.23.5 [pip3] torch==1.13.0+cu117 [pip3] torchdata==0.5.0 [pip3] torchtext==0.14.0

david-waterworth avatar Nov 27 '22 04:11 david-waterworth

I think I found the problem the train split is accessed twice to build the vocab and count the number of labels

https://github.com/pytorch/text/blob/ed78e3b014e67c672b8fd224e0fc8ecea6282ab0/examples/text_classification/train.py#L113

and

https://github.com/pytorch/text/blob/ed78e3b014e67c672b8fd224e0fc8ecea6282ab0/examples/text_classification/train.py#L123

But then the third time both train and test splits are accessed, and this time around the data_dir isn't specified so I guess it's downloading into my home folder or wherever the datsets default is:

https://github.com/pytorch/text/blob/ed78e3b014e67c672b8fd224e0fc8ecea6282ab0/examples/text_classification/train.py#L131

If you change the line above to

   train_iter, test_iter = DATASETS[args.dataset](root=data_dir)

And fix the path in run_script.sh to

cut -f 2- -d "," .data/datasets/AG_NEWS/test.csv | python predict.py  model.i  vocab.i > predict_script.o

It runs end to end

david-waterworth avatar Nov 27 '22 04:11 david-waterworth

@david-waterworth thanks so much for catching this. Do you want to make submit a PR with these changes and I can help review? Otherwise I can get to this in a couple of weeks!

Nayef211 avatar Dec 01 '22 22:12 Nayef211