nnue-pytorch
nnue-pytorch copied to clipboard
[WIP] Automatic script to fetch current datasets.
The intended purpose of this script is to always document and fetch the datasets required for replicating the training of the current Stockfish master network.
Right now this is mostly a skeleton with a DSL allowing to define how the datasets are combined. Downloading from kaggle and concatenation should work but are untested. Interleaving is not yet implemented. It is only meant to be used in the dry-run form right now.
A single kaggle dataset is always combined into a single destination file by concatenation in alphabetical sort order. If this is too rigid of a requirement we can work on relaxing it, but I think it works for all the current datasets.
@linrock Could you please add a full specification for the currently used datasets? I included an example for the dataset used in the first stage of the training. If there is any needed functionality missing let me know.
Even the data for the first stage requires downloading 200GB of data, so I'm unable to verify the correctness right now. We'll see about it after we have the full process documented.
I've tried this a couple of days ago, IMO looks good. I would not delete intermediate files by default, I think for most people it is more difficult to download 200GB than to store it, even though you might have different constraints on your server.
Let's try to complete the list of data sets for current master nets. I can definitely upload more data as needed.
the current master stage2 dataset is composed of: https://www.kaggle.com/datasets/joostvandevondele/t60t70wisrightfarseert60t74t75t76 https://www.kaggle.com/datasets/linrock/t78juntoaugt79mart80dec-16tb7p
LeelaFarseer-T78juntoaugT79marT80dec.binpack (141G)
T60T70wIsRightFarseerT60T74T75T76.binpack
test78-junjulaug2022-16tb7p.no-db.min.binpack
test79-mar2022-16tb7p.no-db.min.binpack
test80-dec2022-16tb7p.no-db.min.binpack
i'll take a closer look at stage3 later. the current L1-2048 master final stage dataset is an unshuffled 800GB+ binpack that i'm no longer using since it's too inconvenient. i'm working on replacing it with a fully minimized ~330GB dataset which i'll document later as well.
the current master stage3 dataset is: https://www.kaggle.com/datasets/linrock/leela96-filt-v2-min https://www.kaggle.com/datasets/linrock/dfrc99-16tb7p-filt-v2-min https://www.kaggle.com/datasets/linrock/sfnnv7-s3
leela96-dfrc99-v2-T80dectofeb-sk20-mar-v6-T77decT78janfebT79apr.binpack (223G)
leela96-filt-v2.min.binpack
dfrc99-16tb7p-eval-filt-v2.min.binpack
test80-dec2022-16tb7p-filter-v6-sk20.min-mar2023.binpack
test80-jan2023-16tb7p-filter-v6-sk20.min-mar2023.binpack
test80-feb2023-16tb7p-filter-v6-sk20.min-mar2023.binpack
test80-mar2023-2tb7p-filter-v6.min.binpack
test77-dec2021-16tb7p.no-db.min.binpack
test78-janfeb2022-16tb7p.no-db.min.binpack
test79-apr2022-16tb7p.no-db.min.binpack
the current master stage4/5 dataset is composed of: https://www.kaggle.com/datasets/linrock/leela96-filt-v2-min https://www.kaggle.com/datasets/linrock/dfrc99-16tb7p-filt-v2-min https://www.kaggle.com/datasets/linrock/0dd1cebea57-test80-v6-dd https://www.kaggle.com/datasets/linrock/0dd1cebea57-misc-v6-dd https://www.kaggle.com/datasets/linrock/test80-apr2023-2tb7p-no-db
the uploaded dataset components are all minimized. parts of the dataset were unminimized to increase randomness during training. however, it's unclear how much of an elo benefit this brings. see https://github.com/official-stockfish/Stockfish/pull/4606 for more details on this particular dataset.
as of now, all datasets for training the current master net (nn-c38c3d8d3920.nnue) are documented in this PR.
the current master stage6 dataset is composed of:
- https://www.kaggle.com/datasets/linrock/leela96-filt-v2-min
- https://www.kaggle.com/datasets/linrock/dfrc99-16tb7p-filt-v2-min
- https://www.kaggle.com/datasets/linrock/0dd1cebea57-test80-v6-dd/versions/2
- https://www.kaggle.com/datasets/linrock/0dd1cebea57-misc-v6-dd
- https://www.kaggle.com/datasets/linrock/1ee1aba5ed-test60-2020-test77-nov2021-2tb7p
- https://www.kaggle.com/datasets/linrock/1ee1aba5ed-test80-martojul2023-2tb7p
since this was a retraining of the master net, all datasets for training the current master net (nn-1ee1aba5ed4c.nnue) are documented in this PR. more details about this dataset in: https://github.com/official-stockfish/Stockfish/pull/4782