GPT2-chitchat
GPT2-chitchat copied to clipboard
Improve tokenization performance
My train.txt is about 1GB. The tokenization speed is about 1 dialogue per second. preprocess_raw_data() will take about 80 days to finish. After making the change, it's 5000 dialogue per second.
I have used the preprocess_raw_data() function to tokenize train.txt about 800M,it only took a few minutes.There are indeed problems with efficiency,thanks for your advice,I will check and update the code.
Is your training sample a DOS text file? I use unix text files. I guess it might because "\r\n" in data
being slow for unix text files.
emm~,I have dealt with both situation separately if "\r\n" in data: train_data = data.split("\r\n\r\n") else: train_data = data.split("\n\n")
I'm trying to figure out why it's slow for me, but fast for you. For example, 'a' in 'aaaaaaaaaaaaaaa'
runs faster than 'b' in 'aaaaaaaaaaaaaaa'
. In my case, unix file doesn't contain '\r\n', so the whole 1GB file has to be scanned, which takes 80 days in total. In your case, only a few bytes are scanned.