KataGo Error during training

When I trained KataGo, it made a problem: Shuffled data path does not exist, there seems to be no shuffled data yet, waiting and trying again later: /content/KataGo/python/selfplay/shuffleddata/current How can I solve it?

Jun 10 '22 12:06 Harder-Run

Presumably you are running it with command line arguments such that it expects the shuffled data to be at /content/KataGo/python/selfplay/shuffleddata/current

Is that the path you intended? If not, you need specify the correct path, presumably the path that you are actually outputting the shuffled data to via shuffle.py or shuffle.sh.

If this is the very start of training from scratch, there might not be enough data yet, which is why it says it is waiting and trying again later. Check all the logs and paths for your selfplay processes and your shuffle processes.

Jun 10 '22 12:06 lightvector

The parameters I set are ./selfplay/train.sh selfplay b6c96 b6c96 256 main -lr-scale 1.0 >> log.txt 2>&1 & disown

Jun 10 '22 12:06 Harder-Run

Okay, but where is the data that you are trying to train on?

Jun 10 '22 13:06 lightvector

/content/KataGo/python/selfplay/

Jun 10 '22 13:06 Harder-Run

They are npz files, right?

Jun 10 '22 13:06 Harder-Run

Great, yes, npz files are the raw data, and assuming you're using the current training code that uses Tensorflow, the shuffled data should be tfrecord files. So where is your shuffle script outputting the shuffled data? Check the logs for your shuffle script.

Jun 10 '22 13:06 lightvector

Logs are in log.txt?

Jun 10 '22 13:06 Harder-Run

I have no idea, it depends on what file paths you're using. Are you shuffling the data at all? https://github.com/lightvector/KataGo/blob/master/python/selfplay/shuffle.sh

Jun 10 '22 13:06 lightvector

And: https://github.com/lightvector/KataGo/blob/master/python/selfplay/shuffle_loop.sh

Jun 10 '22 13:06 lightvector

mkdir -p /content/KataGo/python/selfplay//train/b6c96

git show --no-patch --no-color
git diff --no-color
git diff --staged --no-color ++ date +%Y%m%d-%H%M%S
DATE_FOR_FILENAME=20220610-125909
DATED_ARCHIVE=/content/KataGo/python/selfplay//scripts/train/dated/20220610-125909
mkdir -p /content/KataGo/python/selfplay//scripts/train/dated/20220610-125909
cp /content/KataGo/python/board.py /content/KataGo/python/common.py /content/KataGo/python/data.py /content/KataGo/python/elo.py /content/KataGo/python/export_model.py /content/KataGo/python/genboard_common.py /content/KataGo/python/genboard_run.py /content/KataGo/python/genboard_train.py /content/KataGo/python/inspect_variable.py /content/KataGo/python/migrate_sbscale.py /content/KataGo/python/modelconfigs.py /content/KataGo/python/model.py /content/KataGo/python/play.py /content/KataGo/python/set_global_step.py /content/KataGo/python/shuffle.py /content/KataGo/python/summarize_old_selfplay_files.py /content/KataGo/python/summarize_sgfs.py /content/KataGo/python/test.py /content/KataGo/python/tfrecordio.py /content/KataGo/python/train.py /content/KataGo/python/upload_model.py /content/KataGo/python/upload_poses.py /content/KataGo/python/visualize.py /content/KataGo/python/selfplay/train.sh /content/KataGo/python/selfplay//scripts/train/dated/20220610-125909
git show --no-patch --no-color
git diff --no-color
git diff --staged --no-color
'[' main == main ']'
EXPORT_SUBDIR=tfsavedmodels_toexport
EXTRAFLAG=
python3 /content/KataGo/python/train.py -traindir /content/KataGo/python/selfplay//train/b6c96 -datadir /content/KataGo/python/selfplay//shuffleddata/current/ -exportdir /content/KataGo/python/selfplay//tfsavedmodels_toexport -exportprefix b6c96 -pos-len 19 -batch-size 256 -gpu-memory-frac 0.6 -model-kind b6c96 -sub-epochs 4 -swa-sub-epoch-scale 4 -lr-scale 1.0
tee -a /content/KataGo/python/selfplay//train/b6c96/stdout.txt ['/content/KataGo/python/train.py', '-traindir', '/content/KataGo/python/selfplay//train/b6c96', '-datadir', '/content/KataGo/python/selfplay//shuffleddata/current/', '-exportdir', '/content/KataGo/python/selfplay//tfsavedmodels_toexport', '-exportprefix', 'b6c96', '-pos-len', '19', '-batch-size', '256', '-gpu-memory-frac', '0.6', '-model-kind', 'b6c96', '-sub-epochs', '4', '-swa-sub-epoch-scale', '4', '-lr-scale', '1.0'] Loading existing model config at /content/KataGo/python/selfplay//train/b6c96/model.config.json {'version': 10, 'support_japanese_rules': True, 'use_fixup': True, 'use_scoremean_as_lead': False, 'use_initial_conv_3': True, 'use_fixed_sbscaling': True, 'trunk_num_channels': 96, 'mid_num_channels': 96, 'regular_num_channels': 64, 'dilated_num_channels': 32, 'gpool_num_channels': 32, 'block_kind': [['rconv1', 'regular'], ['rconv2', 'regular'], ['rconv3', 'gpool'], ['rconv4', 'regular'], ['rconv5', 'gpool'], ['rconv6', 'regular']], 'p1_num_channels': 32, 'g1_num_channels': 32, 'v1_num_channels': 32, 'sbv2_num_channels': 48, 'v2_size': 64} WARNING:tensorflow:From /content/KataGo/python/model.py:1141: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where Build SWA graph for SWA update and saving, 68 variables Beginning training INFO:tensorflow:Using config: {'_model_dir': '/content/KataGo/python/selfplay//train/b6c96', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000000000, '_save_checkpoints_secs': None, '_session_config': gpu_options { per_process_gpu_memory_fraction: 0.6 } , '_keep_checkpoint_max': 10, '_keep_checkpoint_every_n_hours': 1000000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f69774000d0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} Shuffled data path does not exist, there seems to be no shuffled data yet, waiting and trying again later: /content/KataGo/python/selfplay/shuffleddata/current

Jun 10 '22 13:06 Harder-Run

My npz path is correct. I am a newbie, ha.

Jun 10 '22 13:06 Harder-Run

If you have the raw data already, then you need to run a shuffle script that will take the raw data and shuffle it and convert it to tfrecord files and output it into shuffleddata/current, then the training script will be able to find it the data it needs in shuffleddata/current.

I'm going afk shortly, so I can't help you any further, but if you still need help, please go to https://discord.gg/3jfxmrSqgC and ask there.

Jun 10 '22 13:06 lightvector

OK, thanks a lot!

Jun 10 '22 13:06 Harder-Run

KataGo KataGo copied to clipboard

Error during training

KataGo
KataGo copied to clipboard