bigscience icon indicating copy to clipboard operation
bigscience copied to clipboard

mC4 sampling & pre-processing

Open sbmaruf opened this issue 2 years ago • 1 comments

Hi @TevenLeScao,

I think there are some confusing and broken link in the mC4 data preprocessing section. Can you take a look?

Both of the links are broken here,

  1. mc4_preprocessing
  2. mc4_sampled_raw

The original link should be,

  1. mc4_preprocessing
  2. mc4_sampled_raw

In addition to that, the multinomial data processing code to create the different language splits are in this pull request, https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/9

Here's few things,

  1. Did you use this data for any one of your experiments?
  2. If not then I think you can update the doc, https://github.com/bigscience-workshop/bigscience/tree/master/data/mc4

For reference purpose, if you want to keep the code, I'm happy to open a pull request here. If not I'll close the pull request from bigscience/Megatron-Deepspeed repo.

Let me know what do you think.

sbmaruf avatar Aug 17 '22 22:08 sbmaruf

We did use mc4 for early multilingual experiments before switching to OSCAR - let's keep the code for future reference. Thanks for catching this!

TevenLeScao avatar Aug 18 '22 17:08 TevenLeScao