Unable to train outside of docker

Open fmocking opened this issue 4 years ago • 0 comments

Hello, I enjoyed your work. Thank you for sharing the code. I'm facing some issues training the model. I've already downloaded and placed the data folder. I believe the issue is with parameters, I couldn't notice anything special for the docker run to pass the parameters (possibly configs) to the runner. So, I modified to command to work outside of the docker environment. Here is the command:

python -m src.graphqa.train config/train.yaml --model config/model.yaml --session config/session.yaml --in_memory=yes

and here is the error message:

usage: train.py [-h] [--logger [LOGGER]] [--checkpoint_callback [CHECKPOINT_CALLBACK]] [--default_root_dir DEFAULT_ROOT_DIR] [--gradient_clip_val GRADIENT_CLIP_VAL] [--process_position PROCESS_POSITION] [--num_nodes NUM_NODES] [--num_processes NUM_PROCESSES] [--gpus GPUS] [--auto_select_gpus [AUTO_SELECT_GPUS]] [--tpu_cores TPU_CORES] [--log_gpu_memory LOG_GPU_MEMORY] [--progress_bar_refresh_rate PROGRESS_BAR_REFRESH_RATE] [--overfit_batches OVERFIT_BATCHES] [--track_grad_norm TRACK_GRAD_NORM] [--check_val_every_n_epoch CHECK_VAL_EVERY_N_EPOCH] [--fast_dev_run [FAST_DEV_RUN]] [--accumulate_grad_batches ACCUMULATE_GRAD_BATCHES] [--max_epochs MAX_EPOCHS] [--min_epochs MIN_EPOCHS] [--max_steps MAX_STEPS] [--min_steps MIN_STEPS] [--limit_train_batches LIMIT_TRAIN_BATCHES] [--limit_val_batches LIMIT_VAL_BATCHES] [--limit_test_batches LIMIT_TEST_BATCHES] [--limit_predict_batches LIMIT_PREDICT_BATCHES] [--val_check_interval VAL_CHECK_INTERVAL] [--flush_logs_every_n_steps FLUSH_LOGS_EVERY_N_STEPS] [--log_every_n_steps LOG_EVERY_N_STEPS] [--accelerator ACCELERATOR] [--sync_batchnorm [SYNC_BATCHNORM]] [--precision PRECISION] [--weights_summary WEIGHTS_SUMMARY] [--weights_save_path WEIGHTS_SAVE_PATH] [--num_sanity_val_steps NUM_SANITY_VAL_STEPS] [--truncated_bptt_steps TRUNCATED_BPTT_STEPS] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT] [--profiler [PROFILER]] [--benchmark [BENCHMARK]] [--deterministic [DETERMINISTIC]] [--reload_dataloaders_every_epoch [RELOAD_DATALOADERS_EVERY_EPOCH]] [--auto_lr_find [AUTO_LR_FIND]] [--replace_sampler_ddp [REPLACE_SAMPLER_DDP]] [--terminate_on_nan [TERMINATE_ON_NAN]] [--auto_scale_batch_size [AUTO_SCALE_BATCH_SIZE]] [--prepare_data_per_node [PREPARE_DATA_PER_NODE]] [--plugins PLUGINS] [--amp_backend AMP_BACKEND] [--amp_level AMP_LEVEL] [--distributed_backend DISTRIBUTED_BACKEND] [--automatic_optimization [AUTOMATIC_OPTIMIZATION]] [--move_metrics_to_cpu [MOVE_METRICS_TO_CPU]] [--enable_pl_optimizer [ENABLE_PL_OPTIMIZER]] [--multiple_trainloader_mode MULTIPLE_TRAINLOADER_MODE] [--stochastic_weight_avg [STOCHASTIC_WEIGHT_AVG]] [--resume RESUME] [rest [rest ...]] train.py: error: unrecognized arguments: --model config/model.yaml --session config/session.yaml --in_memory=yes

If I remove the unrecognized arguments, it passes the error but then I'm getting error because of the PyTorch-geometric version as follows: RuntimeError: The 'data' object was created by an older version of PyG. If this error occurred while loading an already existing dataset, remove the 'processed/' directory in the dataset's root folder and try again.

I tried the suggested 'fix' and removed the processed folder from CASP{i} where i in 9..13. Then I get the following error: ValueError: With n_samples=0, test_size=None and train_size=0.85, the resulting train set will be empty. Adjust any of the aforementioned parameters.

This is probably because the script directly tries to read processed files rather than checking/generating them again. Hence, n_samples=0

Edit: Solved by writing a custom script to load graphs with older version (1.7.2) of torch.geometric.data.Data. Then I convert them to a dictionary using to_dict() method. Then, with the newer torch_geometric (2.0.1) version I load them back with from_dict() method. This fixes the preprocessed data loading problem for recent versions of torch geometric.

But unfortunately, even after the fixes mentioned above the results are nowhere near the reported results. After a few runs, best RMSE I got was 0.165 where it was reported as 0.130. Please note that I trained using the provided train.yaml file. Although the hardware is different, the differences between reported and reproduced results are very high. I would like to contribute to improving reproducibility but as of right now I hit a roadblock. I hope the authors can provide some clarification to this.

Another problem I've noticed is train.raw.yml is not runnable, it returns the following error: AttributeError: 'NoneType' object has no attribute 'out_edge_feats'

Nov 18 '21 05:11 fmocking