A problem when using the pecos model to train xtransformer
Description
When I train xtransformer with pecos model, a training error occurs in the matcher stage. the size of dataset is 108457, Hierarchical label tree: [32, 1102]。In the matcher stage, when I was training the second layer of label trees(There is no problem when training the first layer of label trees), after the matcher fine-tuning was completed, it got stuck when predicting the training data, look pecos.xmc.xtransformer.matcher
I think it is caused by my training data set is too large,so I modified the code snippet of pecos.xmc.xtransformer.matcher。
P_trn, inst_embeddings = matcher.predict(
prob.X_text,
csr_codes=csr_codes,
pred_params=pred_params,
batch_size=train_params.batch_size,
batch_gen_workers=train_params.batch_gen_workers,
max_pred_chunk=30000,
)
But another problem happened, see the training log below。
05/08/2023 10:31:56 - INFO - pecos.xmc.xtransformer.matcher - Reload the best checkpoint from /tmp/tmp0kdzh7n5
05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.31423333333333
05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000
05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.2335
05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000
Traceback (most recent call last):
File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 564, in
I'm not sure if this is a bug, can you give me some advice? Thanks!
Environment
- Operating system: Ubuntu 20.04.4 LTS container
- Python version: Python 3.8.16
- PECOS version:libpecos 1.0.0
Hi xiaokening, the issue is caused by pre-tensorized prob.X_text has larger instance index than the partitioned chunk size (30000). This should not happen if prob.X_text is not tensorized (list of str).
If you want to manually truncated predict, one simple workaround is to turn off the train_params.pre_tokenize so every chunk of data will be tensorized independently.
thanks! @jiong-zhang
@jiong-zhang When I train xtransformer with pecos model, the same training error occurs in the matcher stage. At first I thought that my data volume was too large, but when I increased the memory, this problem would still appear. This problem may occur in any matcher stage(I don't manually truncate predict)
I use the top and free commands to monitor the running of the program. I found that the number of processes suddenly increased and then disappeared. I suspect it is a problem with the dataloader. You can refer to this link
note:after the matcher fine-tuning was completed, it got stuck when predicting the training data at first step, look pecos.xmc.xtransformer.matcher
can you give me some adivce? Thanks
Environment
- Operating system: Ubuntu 20.04.4 LTS container
- Python version: Python 3.8.16
- PECOS version:libpecos 1.0.0
- pytorch version: pytorch==1.11.0
- cuda version: 4 x NVIDIA V100 16GB;cudatoolkit=11.3