sentencepiece Very slow training on AMD CPU

Hi All,

I've been trying to train a BPE tokenizer with the following options. The data is around 300+ GB of text.

spm.SentencePieceTrainer.Train('--input={} --model_prefix=tokener --unk_piece=[UNK] --pad_piece=[PAD] --bos_piece=[CLS] --eos_piece=[SEP] --user_defined_symbols=[PAD],[MASK],[END_GEN] --vocab_size=110000 --input_sentence_size=15000000 --num_threads=1000'.format(file_paths),num_threads=1000)

My system specs are as below -

CPU - AMD 5600X RAM 32 GB 3200 MHz

The training has been going on for almost 2 1/2 hrs and still hasn't completed. My CPU utilization is only at 13%. Not sure if this is the expected behavior or there is some room for improvement.

Jun 25 '23 16:06 kjhanjee

I'm not an expert on this project. But 1000 threads is far too many for a single CPU. That processor supports running 12 threads at a time. So to use more than 12 threads, the CPU has to switch between them.

Try setting your num_threads to 6 or 12. I wouldn't be surprised if that helps.

Jul 12 '23 22:07 dgrahn

I've tried setting it to many different values including 12 with no help. The cpu usage wasn't going beyond 13-17%. I instead switched to hf bpe trainer. Still took me over 16hrs to rain

Jul 12 '23 23:07 kjhanjee

@kjhanjee Can it be that most of the time Sentencepiece simply traverses through your data sampling 15m sentences you have specified in the input_sentence_size parameter? Have you seen the first merges appearing in the logs?

Aug 17 '23 00:08 Goader

It completed training but took 16 hrs to complete. The problem isn't the number of sentences. It is parallel execution. While computing merges as well it only utilizes 16% of my cpu. I switched to hf tokenizer to train bpe though even that seems to be taking a lot of time about 10 hrs. Could be because I'm keeping a huge vocabulary size.

Aug 17 '23 05:08 kjhanjee

In general, larger corpus/vocab sizes require a lot of time. We don't know the cause of the problem just from the current reports, and we can't determine if it's AMD-specific or not.

Dec 23 '23 08:12 taku910

We are going to close this issue on March 1 if there are not further updates.

Feb 26 '24 11:02 taku910