gpt-2-tensorflow2.0
gpt-2-tensorflow2.0 copied to clipboard
RuntimeError: Internal: /sentencepiece/src/trainer_interface.cc(336) [!sentences_.empty()]
Hi, Fisrt thanks for your work.
When I am trying to do preprocessing. I get following error message: RuntimeError: Internal: /sentencepiece/src/trainer_interface.cc(336) [!sentences_.empty()]
I am using a *.txt file uploaded on my colab. I would like to know what does it mean and how to fix it. Thanks
Vincent
I have the same problem while doing preprocessing locally.
I cd'ed to the gpt-2-tensorflow2.0 dir and run the following command:
python pre_process.py --data-dir="/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data" --vocab-size=32000
Tried it with with the data from the "scraped" dir provided with the repo.
Please find the log in the attached file.
I've installed the dependencies using conda, as follows:
conda install setuptools ftfy tqdm Click tensorflow numpy
pip install sentencepiece
conda list
output:
Hi @vincsous and @RomanPlusPlus
Thanks for reporting the issue. I have fixed the issue please pull the code and test.
Thanks
Hi @akanyaani and thank you. Preprocessing is working for me now. But I have another problem for the training. First, as I am using Colab, I do not have multiple GPU so I choose --distributed=False. It seems that it starts to train but training stops ("Training Done....") at step 20, 11% accuracy. Here is the log. log_train.txt
Thanks again
Hi @akanyaani, thank you for your speedy response.
Unfortunately, the problem persists. I still get the same [!sentences_.empty()]
error.
Please find the log in the attached file.
Hi @RomanPlusPlus
But it's working on my system could you please print files in that directory.
Add print in the pre_process.py train method.
text_files = glob.glob((data_dir + "/*.txt"))
print(text_files) #Add this and see does it print text files
process_text(text_files)
train_byte_pair_encoding(vocab_size)
create_tf_records(min_seq_len, max_seq_len)
print("Pre-processing is done............")
This error comes when text_files does not have any text files. If text_files is an empty list then try to resolve path issues.
Hi @vincsous
I will look into that.
Thanks
Hi @akanyaani ,
I added the line you suggested. It prints out the following:
['/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data/processed.txt']
I also checked the "processed.txt" file. It's empty.
Hi @RomanPlusPlus
You are getting this error because you are passing the wrong data directory. This repo has sample data in /data/scraped so now try this.
python pre_process.py --data-dir="/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data/scraped" --vocab-size=32000
I am also getting this error. My command:
python pre_process.py --data-dir=/media/b/F:/patent_data_v2/patent_data_joined --vocab-size=50000
Checked the processed.txt
file - it's got PLENTY of data.
Notably, this ran fine on my Mac (running Catalina). However, Macs don't have GPUs, so I'm moving all this over to a client's Linux machine.
My os: Linux Ubuntu (latest version, 20)
Running in conda custom environment.
My conda env.yaml file: `name: tf channels:
- anaconda
- defaults dependencies:
- _libgcc_mutex=0.1=main
- _tflow_select=2.1.0=gpu
- absl-py=0.9.0=py36_0
- astunparse=1.6.3=py_0
- blas=1.0=mkl
- blinker=1.4=py36_0
- brotlipy=0.7.0=py36h7b6447c_1000
- c-ares=1.15.0=h7b6447c_1001
- ca-certificates=2020.6.24=0
- cachetools=4.1.0=py_1
- certifi=2020.6.20=py36_0
- cffi=1.14.0=py36he30daa8_1
- chardet=3.0.4=py36_1003
- click=7.1.2=py_0
- cryptography=2.9.2=py36h1ba5d50_0
- cudatoolkit=10.1.243=h6bb024c_0
- cudnn=7.6.5=cuda10.1_0
- cupti=10.1.168=0
- ftfy=5.7=py_0
- gast=0.3.3=py_0
- google-auth=1.14.1=py_0
- google-auth-oauthlib=0.4.1=py_2
- google-pasta=0.2.0=py_0
- grpcio=1.27.2=py36hf8bcb03_0
- h5py=2.10.0=py36hd6299e0_1
- hdf5=1.10.6=hb1b8bf9_0
- idna=2.10=py_0
- intel-openmp=2020.1=217
- keras-preprocessing=1.1.0=py_1
- ld_impl_linux-64=2.33.1=h53a641e_7
- libedit=3.1.20191231=h14c3975_1
- libffi=3.3=he6710b0_2
- libgcc-ng=9.1.0=hdf63c60_0
- libgfortran-ng=7.3.0=hdf63c60_0
- libprotobuf=3.12.3=hd408876_0
- libstdcxx-ng=9.1.0=hdf63c60_0
- markdown=3.1.1=py36_0
- mkl=2019.4=243
- mkl-service=2.3.0=py36he904b0f_0
- mkl_fft=1.1.0=py36h23d657b_0
- mkl_random=1.1.0=py36hd6b4f25_0
- ncurses=6.2=he6710b0_1
- numpy=1.18.5=py36ha1c710e_0
- numpy-base=1.18.5=py36hde5b4d6_0
- oauthlib=3.1.0=py_0
- openssl=1.1.1g=h7b6447c_0
- opt_einsum=3.1.0=py_0
- pip=20.1.1=py36_1
- protobuf=3.12.3=py36he6710b0_0
- pyasn1=0.4.8=py_0
- pyasn1-modules=0.2.7=py_0
- pycparser=2.20=py_0
- pyjwt=1.7.1=py36_0
- pyopenssl=19.1.0=py36_0
- pysocks=1.7.1=py36_0
- python=3.6.10=h7579374_2
- readline=8.0=h7b6447c_0
- requests=2.24.0=py_0
- requests-oauthlib=1.3.0=py_0
- rsa=4.0=py_0
- scipy=1.5.0=py36h0b6359f_0
- setuptools=47.3.1=py36_0
- six=1.15.0=py_0
- sqlite=3.32.3=h62c20be_0
- tensorboard=2.2.1=pyh532a8cf_0
- tensorboard-plugin-wit=1.6.0=py_0
- tensorflow=2.2.0=gpu_py36hf933387_0
- tensorflow-base=2.2.0=gpu_py36h8a81be8_0
- tensorflow-estimator=2.2.0=pyh208ff02_0
- tensorflow-gpu=2.2.0=h0d30ee6_0
- termcolor=1.1.0=py36_1
- tk=8.6.10=hbc83047_0
- tqdm=4.47.0=py_0
- urllib3=1.25.9=py_0
- wcwidth=0.2.5=py_0
- werkzeug=1.0.1=py_0
- wheel=0.34.2=py36_0
- wrapt=1.12.1=py36h7b6447c_1
- xz=5.2.5=h7b6447c_0
- zlib=1.2.11=h7b6447c_3
- pip:
- sentencepiece==0.1.85 prefix: /home/b/anaconda3/envs/tf `
You can run into this error even if your path is correct because the train
method assumes your data files use the txt
file extension. If you don't have files with txt
as their extension, they won't be considered, causing the error.
I'd recommend that the train
method be changed to:
def train(data_dir, vocab_size, min_seq_len, max_seq_len):
text_files = glob.glob((data_dir + "/*"))
process_text(text_files)
train_byte_pair_encoding(vocab_size)
create_tf_records(min_seq_len, max_seq_len)
print("Pre-processing is done............")
In other words, change "/*.txt"
to "/*"
.
Better yet, gather the file paths recursively like so:
text_files = glob.glob((data_dir + "/**/*"))
This allows you to have your data files within their own directories - useful if you have thousands of them and want to work with subsets of those thousands sometimes.
I encountered this error when running the code on Windows. I fixed this by editing all calls to with open
like this:
with open(PROCESS_DATA_PATH, 'r', encoding = 'utf-8') as f:
with open(BPE_TSV_PATH, 'w', encoding = 'utf-8', newline='') as f_output:
The files that are read need to be encoded in UTF-8, but I guess that goes without saying.