KataGo icon indicating copy to clipboard operation
KataGo copied to clipboard

Error during training

Open Harder-Run opened this issue 2 years ago • 13 comments

When I trained KataGo, it made a problem: Shuffled data path does not exist, there seems to be no shuffled data yet, waiting and trying again later: /content/KataGo/python/selfplay/shuffleddata/current How can I solve it?

Harder-Run avatar Jun 10 '22 12:06 Harder-Run

Presumably you are running it with command line arguments such that it expects the shuffled data to be at /content/KataGo/python/selfplay/shuffleddata/current

Is that the path you intended? If not, you need specify the correct path, presumably the path that you are actually outputting the shuffled data to via shuffle.py or shuffle.sh.

If this is the very start of training from scratch, there might not be enough data yet, which is why it says it is waiting and trying again later. Check all the logs and paths for your selfplay processes and your shuffle processes.

lightvector avatar Jun 10 '22 12:06 lightvector

The parameters I set are ./selfplay/train.sh selfplay b6c96 b6c96 256 main -lr-scale 1.0 >> log.txt 2>&1 & disown

Harder-Run avatar Jun 10 '22 12:06 Harder-Run

Okay, but where is the data that you are trying to train on?

lightvector avatar Jun 10 '22 13:06 lightvector

/content/KataGo/python/selfplay/

Harder-Run avatar Jun 10 '22 13:06 Harder-Run

They are npz files, right?

Harder-Run avatar Jun 10 '22 13:06 Harder-Run

Great, yes, npz files are the raw data, and assuming you're using the current training code that uses Tensorflow, the shuffled data should be tfrecord files. So where is your shuffle script outputting the shuffled data? Check the logs for your shuffle script.

lightvector avatar Jun 10 '22 13:06 lightvector

Logs are in log.txt?

Harder-Run avatar Jun 10 '22 13:06 Harder-Run

I have no idea, it depends on what file paths you're using. Are you shuffling the data at all? https://github.com/lightvector/KataGo/blob/master/python/selfplay/shuffle.sh

lightvector avatar Jun 10 '22 13:06 lightvector

And: https://github.com/lightvector/KataGo/blob/master/python/selfplay/shuffle_loop.sh

lightvector avatar Jun 10 '22 13:06 lightvector

mkdir -p /content/KataGo/python/selfplay//train/b6c96

  • git show --no-patch --no-color
  • git diff --no-color
  • git diff --staged --no-color ++ date +%Y%m%d-%H%M%S
  • DATE_FOR_FILENAME=20220610-125909
  • DATED_ARCHIVE=/content/KataGo/python/selfplay//scripts/train/dated/20220610-125909
  • mkdir -p /content/KataGo/python/selfplay//scripts/train/dated/20220610-125909
  • cp /content/KataGo/python/board.py /content/KataGo/python/common.py /content/KataGo/python/data.py /content/KataGo/python/elo.py /content/KataGo/python/export_model.py /content/KataGo/python/genboard_common.py /content/KataGo/python/genboard_run.py /content/KataGo/python/genboard_train.py /content/KataGo/python/inspect_variable.py /content/KataGo/python/migrate_sbscale.py /content/KataGo/python/modelconfigs.py /content/KataGo/python/model.py /content/KataGo/python/play.py /content/KataGo/python/set_global_step.py /content/KataGo/python/shuffle.py /content/KataGo/python/summarize_old_selfplay_files.py /content/KataGo/python/summarize_sgfs.py /content/KataGo/python/test.py /content/KataGo/python/tfrecordio.py /content/KataGo/python/train.py /content/KataGo/python/upload_model.py /content/KataGo/python/upload_poses.py /content/KataGo/python/visualize.py /content/KataGo/python/selfplay/train.sh /content/KataGo/python/selfplay//scripts/train/dated/20220610-125909
  • git show --no-patch --no-color
  • git diff --no-color
  • git diff --staged --no-color
  • '[' main == main ']'
  • EXPORT_SUBDIR=tfsavedmodels_toexport
  • EXTRAFLAG=
  • python3 /content/KataGo/python/train.py -traindir /content/KataGo/python/selfplay//train/b6c96 -datadir /content/KataGo/python/selfplay//shuffleddata/current/ -exportdir /content/KataGo/python/selfplay//tfsavedmodels_toexport -exportprefix b6c96 -pos-len 19 -batch-size 256 -gpu-memory-frac 0.6 -model-kind b6c96 -sub-epochs 4 -swa-sub-epoch-scale 4 -lr-scale 1.0
  • tee -a /content/KataGo/python/selfplay//train/b6c96/stdout.txt ['/content/KataGo/python/train.py', '-traindir', '/content/KataGo/python/selfplay//train/b6c96', '-datadir', '/content/KataGo/python/selfplay//shuffleddata/current/', '-exportdir', '/content/KataGo/python/selfplay//tfsavedmodels_toexport', '-exportprefix', 'b6c96', '-pos-len', '19', '-batch-size', '256', '-gpu-memory-frac', '0.6', '-model-kind', 'b6c96', '-sub-epochs', '4', '-swa-sub-epoch-scale', '4', '-lr-scale', '1.0'] Loading existing model config at /content/KataGo/python/selfplay//train/b6c96/model.config.json {'version': 10, 'support_japanese_rules': True, 'use_fixup': True, 'use_scoremean_as_lead': False, 'use_initial_conv_3': True, 'use_fixed_sbscaling': True, 'trunk_num_channels': 96, 'mid_num_channels': 96, 'regular_num_channels': 64, 'dilated_num_channels': 32, 'gpool_num_channels': 32, 'block_kind': [['rconv1', 'regular'], ['rconv2', 'regular'], ['rconv3', 'gpool'], ['rconv4', 'regular'], ['rconv5', 'gpool'], ['rconv6', 'regular']], 'p1_num_channels': 32, 'g1_num_channels': 32, 'v1_num_channels': 32, 'sbv2_num_channels': 48, 'v2_size': 64} WARNING:tensorflow:From /content/KataGo/python/model.py:1141: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where Build SWA graph for SWA update and saving, 68 variables Beginning training INFO:tensorflow:Using config: {'_model_dir': '/content/KataGo/python/selfplay//train/b6c96', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000000000, '_save_checkpoints_secs': None, '_session_config': gpu_options { per_process_gpu_memory_fraction: 0.6 } , '_keep_checkpoint_max': 10, '_keep_checkpoint_every_n_hours': 1000000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f69774000d0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} Shuffled data path does not exist, there seems to be no shuffled data yet, waiting and trying again later: /content/KataGo/python/selfplay/shuffleddata/current

Harder-Run avatar Jun 10 '22 13:06 Harder-Run

My npz path is correct. I am a newbie, ha.

Harder-Run avatar Jun 10 '22 13:06 Harder-Run

If you have the raw data already, then you need to run a shuffle script that will take the raw data and shuffle it and convert it to tfrecord files and output it into shuffleddata/current, then the training script will be able to find it the data it needs in shuffleddata/current.

I'm going afk shortly, so I can't help you any further, but if you still need help, please go to https://discord.gg/3jfxmrSqgC and ask there.

lightvector avatar Jun 10 '22 13:06 lightvector

OK, thanks a lot!

Harder-Run avatar Jun 10 '22 13:06 Harder-Run