nnue-pytorch icon indicating copy to clipboard operation
nnue-pytorch copied to clipboard

[WIP] Automatic script to fetch current datasets.

Open Sopel97 opened this issue 1 year ago • 5 comments

The intended purpose of this script is to always document and fetch the datasets required for replicating the training of the current Stockfish master network.

Right now this is mostly a skeleton with a DSL allowing to define how the datasets are combined. Downloading from kaggle and concatenation should work but are untested. Interleaving is not yet implemented. It is only meant to be used in the dry-run form right now.

A single kaggle dataset is always combined into a single destination file by concatenation in alphabetical sort order. If this is too rigid of a requirement we can work on relaxing it, but I think it works for all the current datasets.

@linrock Could you please add a full specification for the currently used datasets? I included an example for the dataset used in the first stage of the training. If there is any needed functionality missing let me know.

Even the data for the first stage requires downloading 200GB of data, so I'm unable to verify the correctness right now. We'll see about it after we have the full process documented.

Sopel97 avatar Jul 06 '23 14:07 Sopel97

I've tried this a couple of days ago, IMO looks good. I would not delete intermediate files by default, I think for most people it is more difficult to download 200GB than to store it, even though you might have different constraints on your server.

Let's try to complete the list of data sets for current master nets. I can definitely upload more data as needed.

vondele avatar Jul 12 '23 06:07 vondele

the current master stage2 dataset is composed of: https://www.kaggle.com/datasets/joostvandevondele/t60t70wisrightfarseert60t74t75t76 https://www.kaggle.com/datasets/linrock/t78juntoaugt79mart80dec-16tb7p

   LeelaFarseer-T78juntoaugT79marT80dec.binpack (141G)
     T60T70wIsRightFarseerT60T74T75T76.binpack
     test78-junjulaug2022-16tb7p.no-db.min.binpack
     test79-mar2022-16tb7p.no-db.min.binpack
     test80-dec2022-16tb7p.no-db.min.binpack

i'll take a closer look at stage3 later. the current L1-2048 master final stage dataset is an unshuffled 800GB+ binpack that i'm no longer using since it's too inconvenient. i'm working on replacing it with a fully minimized ~330GB dataset which i'll document later as well.

linrock avatar Jul 12 '23 15:07 linrock

the current master stage3 dataset is: https://www.kaggle.com/datasets/linrock/leela96-filt-v2-min https://www.kaggle.com/datasets/linrock/dfrc99-16tb7p-filt-v2-min https://www.kaggle.com/datasets/linrock/sfnnv7-s3

   leela96-dfrc99-v2-T80dectofeb-sk20-mar-v6-T77decT78janfebT79apr.binpack (223G)
     leela96-filt-v2.min.binpack
     dfrc99-16tb7p-eval-filt-v2.min.binpack
     test80-dec2022-16tb7p-filter-v6-sk20.min-mar2023.binpack
     test80-jan2023-16tb7p-filter-v6-sk20.min-mar2023.binpack
     test80-feb2023-16tb7p-filter-v6-sk20.min-mar2023.binpack
     test80-mar2023-2tb7p-filter-v6.min.binpack
     test77-dec2021-16tb7p.no-db.min.binpack
     test78-janfeb2022-16tb7p.no-db.min.binpack
     test79-apr2022-16tb7p.no-db.min.binpack

linrock avatar Jul 22 '23 18:07 linrock

the current master stage4/5 dataset is composed of: https://www.kaggle.com/datasets/linrock/leela96-filt-v2-min https://www.kaggle.com/datasets/linrock/dfrc99-16tb7p-filt-v2-min https://www.kaggle.com/datasets/linrock/0dd1cebea57-test80-v6-dd https://www.kaggle.com/datasets/linrock/0dd1cebea57-misc-v6-dd https://www.kaggle.com/datasets/linrock/test80-apr2023-2tb7p-no-db

the uploaded dataset components are all minimized. parts of the dataset were unminimized to increase randomness during training. however, it's unclear how much of an elo benefit this brings. see https://github.com/official-stockfish/Stockfish/pull/4606 for more details on this particular dataset.

as of now, all datasets for training the current master net (nn-c38c3d8d3920.nnue) are documented in this PR.

linrock avatar Sep 10 '23 04:09 linrock

the current master stage6 dataset is composed of:

  • https://www.kaggle.com/datasets/linrock/leela96-filt-v2-min
  • https://www.kaggle.com/datasets/linrock/dfrc99-16tb7p-filt-v2-min
  • https://www.kaggle.com/datasets/linrock/0dd1cebea57-test80-v6-dd/versions/2
  • https://www.kaggle.com/datasets/linrock/0dd1cebea57-misc-v6-dd
  • https://www.kaggle.com/datasets/linrock/1ee1aba5ed-test60-2020-test77-nov2021-2tb7p
  • https://www.kaggle.com/datasets/linrock/1ee1aba5ed-test80-martojul2023-2tb7p

since this was a retraining of the master net, all datasets for training the current master net (nn-1ee1aba5ed4c.nnue) are documented in this PR. more details about this dataset in: https://github.com/official-stockfish/Stockfish/pull/4782

linrock avatar Sep 14 '23 22:09 linrock