KataGo
KataGo copied to clipboard
Error during training
When I trained KataGo, it made a problem: Shuffled data path does not exist, there seems to be no shuffled data yet, waiting and trying again later: /content/KataGo/python/selfplay/shuffleddata/current How can I solve it?
Presumably you are running it with command line arguments such that it expects the shuffled data to be at /content/KataGo/python/selfplay/shuffleddata/current
Is that the path you intended? If not, you need specify the correct path, presumably the path that you are actually outputting the shuffled data to via shuffle.py or shuffle.sh.
If this is the very start of training from scratch, there might not be enough data yet, which is why it says it is waiting and trying again later. Check all the logs and paths for your selfplay processes and your shuffle processes.
The parameters I set are ./selfplay/train.sh selfplay b6c96 b6c96 256 main -lr-scale 1.0 >> log.txt 2>&1 & disown
Okay, but where is the data that you are trying to train on?
/content/KataGo/python/selfplay/
They are npz files, right?
Great, yes, npz files are the raw data, and assuming you're using the current training code that uses Tensorflow, the shuffled data should be tfrecord files. So where is your shuffle script outputting the shuffled data? Check the logs for your shuffle script.
Logs are in log.txt?
I have no idea, it depends on what file paths you're using. Are you shuffling the data at all? https://github.com/lightvector/KataGo/blob/master/python/selfplay/shuffle.sh
And: https://github.com/lightvector/KataGo/blob/master/python/selfplay/shuffle_loop.sh
mkdir -p /content/KataGo/python/selfplay//train/b6c96
- git show --no-patch --no-color
- git diff --no-color
- git diff --staged --no-color ++ date +%Y%m%d-%H%M%S
- DATE_FOR_FILENAME=20220610-125909
- DATED_ARCHIVE=/content/KataGo/python/selfplay//scripts/train/dated/20220610-125909
- mkdir -p /content/KataGo/python/selfplay//scripts/train/dated/20220610-125909
- cp /content/KataGo/python/board.py /content/KataGo/python/common.py /content/KataGo/python/data.py /content/KataGo/python/elo.py /content/KataGo/python/export_model.py /content/KataGo/python/genboard_common.py /content/KataGo/python/genboard_run.py /content/KataGo/python/genboard_train.py /content/KataGo/python/inspect_variable.py /content/KataGo/python/migrate_sbscale.py /content/KataGo/python/modelconfigs.py /content/KataGo/python/model.py /content/KataGo/python/play.py /content/KataGo/python/set_global_step.py /content/KataGo/python/shuffle.py /content/KataGo/python/summarize_old_selfplay_files.py /content/KataGo/python/summarize_sgfs.py /content/KataGo/python/test.py /content/KataGo/python/tfrecordio.py /content/KataGo/python/train.py /content/KataGo/python/upload_model.py /content/KataGo/python/upload_poses.py /content/KataGo/python/visualize.py /content/KataGo/python/selfplay/train.sh /content/KataGo/python/selfplay//scripts/train/dated/20220610-125909
- git show --no-patch --no-color
- git diff --no-color
- git diff --staged --no-color
- '[' main == main ']'
- EXPORT_SUBDIR=tfsavedmodels_toexport
- EXTRAFLAG=
- python3 /content/KataGo/python/train.py -traindir /content/KataGo/python/selfplay//train/b6c96 -datadir /content/KataGo/python/selfplay//shuffleddata/current/ -exportdir /content/KataGo/python/selfplay//tfsavedmodels_toexport -exportprefix b6c96 -pos-len 19 -batch-size 256 -gpu-memory-frac 0.6 -model-kind b6c96 -sub-epochs 4 -swa-sub-epoch-scale 4 -lr-scale 1.0
- tee -a /content/KataGo/python/selfplay//train/b6c96/stdout.txt ['/content/KataGo/python/train.py', '-traindir', '/content/KataGo/python/selfplay//train/b6c96', '-datadir', '/content/KataGo/python/selfplay//shuffleddata/current/', '-exportdir', '/content/KataGo/python/selfplay//tfsavedmodels_toexport', '-exportprefix', 'b6c96', '-pos-len', '19', '-batch-size', '256', '-gpu-memory-frac', '0.6', '-model-kind', 'b6c96', '-sub-epochs', '4', '-swa-sub-epoch-scale', '4', '-lr-scale', '1.0'] Loading existing model config at /content/KataGo/python/selfplay//train/b6c96/model.config.json {'version': 10, 'support_japanese_rules': True, 'use_fixup': True, 'use_scoremean_as_lead': False, 'use_initial_conv_3': True, 'use_fixed_sbscaling': True, 'trunk_num_channels': 96, 'mid_num_channels': 96, 'regular_num_channels': 64, 'dilated_num_channels': 32, 'gpool_num_channels': 32, 'block_kind': [['rconv1', 'regular'], ['rconv2', 'regular'], ['rconv3', 'gpool'], ['rconv4', 'regular'], ['rconv5', 'gpool'], ['rconv6', 'regular']], 'p1_num_channels': 32, 'g1_num_channels': 32, 'v1_num_channels': 32, 'sbv2_num_channels': 48, 'v2_size': 64} WARNING:tensorflow:From /content/KataGo/python/model.py:1141: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where Build SWA graph for SWA update and saving, 68 variables Beginning training INFO:tensorflow:Using config: {'_model_dir': '/content/KataGo/python/selfplay//train/b6c96', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000000000, '_save_checkpoints_secs': None, '_session_config': gpu_options { per_process_gpu_memory_fraction: 0.6 } , '_keep_checkpoint_max': 10, '_keep_checkpoint_every_n_hours': 1000000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f69774000d0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} Shuffled data path does not exist, there seems to be no shuffled data yet, waiting and trying again later: /content/KataGo/python/selfplay/shuffleddata/current
My npz path is correct. I am a newbie, ha.
If you have the raw data already, then you need to run a shuffle script that will take the raw data and shuffle it and convert it to tfrecord files and output it into shuffleddata/current, then the training script will be able to find it the data it needs in shuffleddata/current.
I'm going afk shortly, so I can't help you any further, but if you still need help, please go to https://discord.gg/3jfxmrSqgC and ask there.
OK, thanks a lot!