GPT2-chitchat icon indicating copy to clipboard operation
GPT2-chitchat copied to clipboard

Improve tokenization performance

Open wuyongzheng opened this issue 5 years ago • 4 comments

My train.txt is about 1GB. The tokenization speed is about 1 dialogue per second. preprocess_raw_data() will take about 80 days to finish. After making the change, it's 5000 dialogue per second.

wuyongzheng avatar Dec 18 '19 09:12 wuyongzheng

I have used the preprocess_raw_data() function to tokenize train.txt about 800M,it only took a few minutes.There are indeed problems with efficiency,thanks for your advice,I will check and update the code.

yangjianxin1 avatar Dec 19 '19 03:12 yangjianxin1

Is your training sample a DOS text file? I use unix text files. I guess it might because "\r\n" in data being slow for unix text files.

wuyongzheng avatar Dec 19 '19 05:12 wuyongzheng

emm~,I have dealt with both situation separately if "\r\n" in data: train_data = data.split("\r\n\r\n") else: train_data = data.split("\n\n")

yangjianxin1 avatar Dec 19 '19 07:12 yangjianxin1

I'm trying to figure out why it's slow for me, but fast for you. For example, 'a' in 'aaaaaaaaaaaaaaa' runs faster than 'b' in 'aaaaaaaaaaaaaaa'. In my case, unix file doesn't contain '\r\n', so the whole 1GB file has to be scanned, which takes 80 days in total. In your case, only a few bytes are scanned.

wuyongzheng avatar Dec 19 '19 07:12 wuyongzheng