Megatron-DeepSpeed C4-mC4 pre processing

mC4 data is too large. For 13 selected language it's around 18TB of data. I excluded the english data since teven already processed it.

Arabic, Swahili (Bantu), Chinese, Catalan, English, French, Indic (Hindi,Urdu,Bangla), Indonesian, Portuguese, Spanish, Russian, Japanese, Amharic

This pull adds pre-processing codes for mC4 data.

Jul 23 '21 08:07 sbmaruf

Here are some of the stat of the currently processing data,

Language : Data Size in mC4 (GB)
--------------------
zh-Latn :   0.65
am :   1.25
ru-Latn :   2.65
sw :   3.16
ur :   10.4
ca :   42.3
bn :   43.39
hi :   127.86
zh :   148.15
id :   248.72
ar :   251.31
pt :   478.09
ja :   773.92
fr :   1050.29
es :   1492.72
ru :   3905.98
--------------------
Total size : 8580.84
Expected size afted resizing : 576
Per language allocated size : 36.0
Low resource languages (<36.0) : sw(3.16) zh-Latn(0.65) ur(10.4) ru-Latn(2.65) am(1.25)
Total size consumed by low resource languages 18.11
For high resource language, predefined (given by user) Minimum allocation size : 12 GB, Maximum allocation size 100GB
Sampling High resource language based on multinomial distribution with alpha 0.01. 
--------------------------------------------------------------------------------
Language : ar, Sampling prob : 0.09 , (251.31 -> 23 GB)
Language : zh, Sampling prob : 0.09 , (148.15 -> 13 GB)
Language : ca, Sampling prob : 0.09 -> 0.28, (42.3 -> 12 GB)
Language : fr, Sampling prob : 0.09 , (1050.29 -> 97 GB)
Language : hi, Sampling prob : 0.09 -> 0.09, (127.86 -> 12 GB)
Language : bn, Sampling prob : 0.09 -> 0.28, (43.39 -> 12 GB)
Language : id, Sampling prob : 0.09 , (248.72 -> 23 GB)
Language : pt, Sampling prob : 0.09 , (478.09 -> 44 GB)
Language : es, Sampling prob : 0.09 -> 0.07, (1492.72 -> 100 GB)
Language : ru, Sampling prob : 0.09 -> 0.03, (3905.98 -> 100 GB)
Language : ja, Sampling prob : 0.09 , (773.92 -> 71 GB)
Expected high resource size 557.89, Total Size : 505.848648173082
Performing adjustment ...

Final Breakdown
---------------
Language : ar, Sampling prob : 0.13, Data resized : (251.31 -> 31.78 GB)
Language : sw, Sampling prob : 1.0, Data resized : (3.16 -> 3.16 GB)
Language : zh, Sampling prob : 0.15, Data resized : (148.15 -> 22.36 GB)
Language : zh-Latn, Sampling prob : 1.0, Data resized : (0.65 -> 0.65 GB)
Language : ca, Sampling prob : 0.28, Data resized : (42.3 -> 12.0 GB)
Language : fr, Sampling prob : 0.1, Data resized : (1050.29 -> 105.59 GB)
Language : hi, Sampling prob : 0.09, Data resized : (127.86 -> 12.0 GB)
Language : ur, Sampling prob : 1.0, Data resized : (10.4 -> 10.4 GB)
Language : bn, Sampling prob : 0.28, Data resized : (43.39 -> 12.0 GB)
Language : id, Sampling prob : 0.13, Data resized : (248.72 -> 31.55 GB)
Language : pt, Sampling prob : 0.11, Data resized : (478.09 -> 51.66 GB)
Language : es, Sampling prob : 0.07, Data resized : (1492.72 -> 100.0 GB)
Language : ru, Sampling prob : 0.03, Data resized : (3905.98 -> 100.0 GB)
Language : ru-Latn, Sampling prob : 1.0, Data resized : (2.65 -> 2.65 GB)
Language : ja, Sampling prob : 0.1, Data resized : (773.92 -> 78.95 GB)
Language : am, Sampling prob : 1.0, Data resized : (1.25 -> 1.25 GB)
Expected resource size 576, Total Size : 576.0

Note: For some reason, large files writing on the disk failing (i.e., russia, es, fr) in my system. Hopefully in a few days when I'm free, I can look into this more details.

Jul 27 '21 17:07 sbmaruf

UPDATE: All data has been processed except es, fr, ru. I cannot load, random shuffle and sample these 3 language with my current system. I am using huggingface dataset. Spend a full day debugging error on these 3 language. An alternate possible method could be using Allen AI's github LFS data which is splitted into small parts. I am trying that now. I am uploading the data in Hf-datasets. Unfortunately I have very low upload speed ~1.5MiB/s. It will take 1-2 days to upload full data.

Aug 02 '21 02:08 sbmaruf

let's talk what goes where:

examples aren't the best place - we now fully own this repo - so we want logical placements - probably should remove this folder altogether.
tools is not the place to put libraries, this dir is for project maintenance scripts - think janitor closet.

The following is a possible approach:

Put all data processing libraries under megatron/data/ perhaps megatron/data/c4 in this case?
Create top-level scripts and probably arrange the .sh scripts for this PR in scripts/data/c4?

As mentioned above, I'm happy to discuss other layouts.

Thank you!

Aug 03 '21 18:08 stas00

@stas00 tools is not the place to put libraries, this dir is for project maintenance scripts - think janitor closet.

Actually I found that openwebtext processing scripts are in the tools folder. That's why I put the c4_mc4 processing codes in the tools folder. I also like your idea putting them in megatron/data/c4.

examples aren't the best place - we now fully own this repo - so we want logical placements - probably should remove this folder altogether.

Create top-level scripts and probably arrange the .sh scripts for this PR in scripts/data/c4?

I actually proposed similar stuff in our last archi meeting. All my projects has a top level scripts folder to track the runs. But I didn't create one here because I thought we are putting the scripts in the examples folder. Since I never maintain any large project, I didn't want to interfere on the development cycles (creating new folders, adding new stuff in .gitignore etc.). I am open to any proposal.

Aug 03 '21 19:08 sbmaruf

Thank you for the feedback, @sbmaruf!

OK, let's leave the tools as it is for now and then we can move the whole thing at once if it makes more sense.

And let's start a top-level scripts and start migrating from examples - and eventually probably remove examples all-together. Examples are for a multitude of users of Megatron-LM who need examples. We are now the owners of this repo and need concrete solutions and not examples.

Since I never maintain any large project, I didn't want to interfere on the development cycles (creating new folders, adding new stuff in .gitignore etc.). I am open to any proposal.

This repo/project is currently very experimental, so please don't be afraid to experiment. This is all under git, so we can easily revert or change things if an experiment doesn't stick.

Since we don't quite have a multi-months spec on what will be done, the layout and structure will be evolving as the code emerges.

Please don't hesitate to tag me or others if you're not sure about something you're inspired to try.

Aug 03 '21 19:08 stas00

Question @sbmaruf and @stas00 : I see you are using mt5 tokenizer. I thought we are doing gpt style modeling, either auto-regressive or prefix_lm. How would it work to feed in mt5 tokens into the gpt model? I've played around with mt5 a bit and I think it's wonderful, except there's like 250K tokens as I recall, and some of the tokens could be removed, like all the emoji stuff, and some of the formatting stuff, and there are tokens for langauges we aren't modeling. Maybe I'm missing something? I know we are waiting for the tokenizer WG so maybe they will have a decision on tokenizer. In the meantime, if we want to feed in mt5 tokens to a gpt style model for testing, I recommend trimming down the number of tokens to the language we need, like keeping the top N tokens for each of the lang from mc4 we are testing on, and keeping shorter tokens to fill in any OOV tokens? Also, this tokenizer uses HF sentence piece tokenizer, which while fast is not easy to modify to shrink the number of tokens, altough we could rewrite some of this to use a more vanila python tokenizer (which I've done in the past and can share). I suspect we can probably shrink down the number of tokens to the similar number as gpt.

Aug 06 '21 00:08 huu4ontocord

I see you are using mt5 tokenizer. I thought we are doing gpt style modeling, either auto-regressive or prefix_lm

Any tokenizer can be used for auto-regressive model. We actually discuss about this in the meeting.
We wanted to train our own tokenizer but it seems like tokenization WG group will finally propose all the hyperparameters of the tokenizer. So we decided to use regular mT5 tokenizer.

In the meantime, if we want to feed in mt5 tokens to a gpt style model for testing, I recommend trimming down the number of tokens to the language we need, like keeping the top N tokens for each of the lang from mc4 we are testing on, and keeping shorter tokens to fill in any OOV tokens?

sentencepiece tokenizer are not as straightforward as to resizing and remapping. In addition to that it is complicated to identify the source language of a subword since there are virtually no way we can identify language for typographically same tokens.

except there's like 250K tokens as I recall, and some of the tokens could be removed, like all the emoji stuff, and some of the formatting stuff

Actually scaling law doesn't depend on vocabulary size of the tokenizer.

altough we could rewrite some of this to use a more vanila python tokenizer (which I've done in the past and can share). I suspect we can probably shrink down the number of tokens to the similar number as gpt.

shrinking down tokenizer won't bring any theoretical improvement. The improvement we will be getting is speedup on lookup operation. For me, it's more easier to manually shrink a tokenizer than train a new one. While shrinking a tokenizer, so many things could go wrong.

@ontocord

Aug 07 '21 17:08 sbmaruf

sentencepiece tokenizer are not as straightforward as to resizing and remapping.

Here is a hack that helps with shrinking an spm vocab: https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564

It may or may not help, as I was only seeking to make it much smaller.

Aug 07 '21 18:08 stas00

@stas00 Thank you for the great resources re sentencepiece.

@sbmaruf I didn't realize we were considering running the gpt model with an mt5 size token space, but as you said, any tokenization mechanism can be used. I just wonder what happens to the token/embeddings that are rarely if never seen. Good luck and I'd love to see the results!

Aug 08 '21 12:08 huu4ontocord

[x] Done with processing.
[x] Here is the readme.
[x] Process data is here
[x] Raw data (except english) is here.
[x] Please review this part carefully.
[x] Final iterator selection probability is here, you need to choose an alpha from here.
[x] All the json file for iterator selection probability is here.

Awaiting for review, @stas00 @ibeltagy @yongzx

Aug 10 '21 10:08 sbmaruf

@sbmaruf For the "Please review this part carefully", is there any particular component to pay extra attention to? For instance, the reported figures?

Aug 11 '21 03:08 yongzx

@sbmaruf For the "Please review this part carefully", is there any particular component to pay extra attention to? For instance, the reported figures?

@yongzx I find some numbers not consistent. Specially for 86 GB raw data becomes 28GB binary and for english 784GB raw data becomes 763GB binary. There are no pattern in the Raw size to Binary size which worries me most. May be that's how it should be. Not sure at this point.

Aug 11 '21 14:08 sbmaruf