Very slow training on AMD CPU
Hi All,
I've been trying to train a BPE tokenizer with the following options. The data is around 300+ GB of text.
spm.SentencePieceTrainer.Train('--input={} --model_prefix=tokener --unk_piece=[UNK] --pad_piece=[PAD] --bos_piece=[CLS] --eos_piece=[SEP] --user_defined_symbols=[PAD],[MASK],[END_GEN] --vocab_size=110000 --input_sentence_size=15000000 --num_threads=1000'.format(file_paths),num_threads=1000)
My system specs are as below -
CPU - AMD 5600X RAM 32 GB 3200 MHz
The training has been going on for almost 2 1/2 hrs and still hasn't completed. My CPU utilization is only at 13%. Not sure if this is the expected behavior or there is some room for improvement.
I'm not an expert on this project. But 1000 threads is far too many for a single CPU. That processor supports running 12 threads at a time. So to use more than 12 threads, the CPU has to switch between them.
Try setting your num_threads to 6 or 12. I wouldn't be surprised if that helps.
I've tried setting it to many different values including 12 with no help. The cpu usage wasn't going beyond 13-17%. I instead switched to hf bpe trainer. Still took me over 16hrs to rain
@kjhanjee Can it be that most of the time Sentencepiece simply traverses through your data sampling 15m sentences you have specified in the input_sentence_size parameter? Have you seen the first merges appearing in the logs?
It completed training but took 16 hrs to complete. The problem isn't the number of sentences. It is parallel execution. While computing merges as well it only utilizes 16% of my cpu. I switched to hf tokenizer to train bpe though even that seems to be taking a lot of time about 10 hrs. Could be because I'm keeping a huge vocabulary size.
In general, larger corpus/vocab sizes require a lot of time. We don't know the cause of the problem just from the current reports, and we can't determine if it's AMD-specific or not.
We are going to close this issue on March 1 if there are not further updates.