BCAI_kaggle_CHAMPS
BCAI_kaggle_CHAMPS copied to clipboard
Bosch solution to CHAMPS Kaggle competition
Hello!
Below you can find a outline of how to reproduce our solution for the CHAMPS competition. If you run into any trouble with the setup/code or have any questions please contact us at [email protected]
Copyright 2019 Robert Bosch GmbH
Code authors: Zico Kolter, Shaojie Bai, Devin Wilmott, Mordechai Kornbluth, Jonathan Mailoa, part of Bosch Research (CR).
Archive Contents
config/: Configuration filesdata/: Raw datamodels/: Saved modelsprocessed/: Processed datasrc/: Source code for preprocessing, training, and predicting.submission/: Directory for the actual predictions
Hardware (The following specs were used to create the original solution)
The variety of models were trained on different machines, each running a Linux OS:
- 5 machines had 4 GPUs, each a NVIDIA GeForce RTX 2080 Ti
- 2 machines had 1 GPU NVIDIA Tesla V100 with 32 GB memory
- 6 machines had 1 GPU NVIDIA Tesla V100 with 16 GB memory
Software
- Python 3.5+
- CUDA 10.1
- NVIDIA APEX (Only available through the repo at this phase)
Python packages are detailed separately in requirements.txt.
Note: Though listed in requirements.txt, rdkit is not available with pip. We strongly suggest installing rdkit via conda:
conda install -c rdkit rdkit
Data Setup
We use only the train.csv, test.csv, and structures.csv files of the competition. They should be (unzipped and) placed in the data/ directory. All of the commands below are executed from the src/ directory.
Data Processing
cd src/python pipeline_pre.py 1(This could take 1-2 hours)python pipeline_pre.py 2
(You may need to change the permission to the .csv files to use the two scripts above via chmod.)
Model Build - There are three options to produce the solution.
While in src/:
- Very fast prediction:
predictor.py fastto use the precomputed results for ensembling. - Ordinary prediction:
predictor.pyto use the precomputed checkpoints for predicting and ensembling. - Re-train models:
train.pyto train a new model from scratch. Seetrain.py -hfor allowed arguments, andconfigfiles for each model for the arguments used.
The config/models.json file contains the following important keys:
- names: List of the names we will ensemble
- output file: The name of the ensembled output file
- num atom types, bond types, triplet types, quad types: These are arguments to pass to the GraphTransformer instantiator. Note that in the default setting, quadruplet information is not used by GTs.
model_dir: The directory inmodels/associated with each model. Each directory must havegraph_transformer.pywith aGraphTransformerclass (and any modules it needs);configfile with the kwargs to instantiate theGraphTransformerclass;[MODEL_NAME].ckptthat can be loaded viaload_state_dict(torch.load('[MODEL_NAME].ckpt').state_dict())(to avoid PyTorch version conflict).
Notes on (Pre-trained) Model Loading
All pretrained models are stored in models/. However, different models may have slightly different architecture (e.g., some GT models are followed by a 2-layer grouped residual network, while some others only have one residual block). The training script (train.py), when initiated without the --debug flag, will automatically create a log folder in CHAMPS-GT/ that contains the code for the GT used. When loading the model, use the graph_transformer.py in that log folder (instead of the default one in src/).
Notes on Model Training
When trained from scratch, the default parameters should lead to a model achieving a score of around -3.06 to -3.07. Using --debug flag will prevent the program from creating a log folder.
Notes on Saving Memory
What if you got a CUDA out of memory error? We suggest a few solutions:
- If you have a multi-GPU machine, use the
--multi_gpuflag, and tune the--gpu0_bszflag (which controls the minibatch size passed to GPU device 0). For instance, on a 4-GPU machine, you can dopython train.py [...] --batch_size 47 --multi_gpu --gpu0_bsz 11, which assigns a batch size of 12 to GPU1,2,3and a batch size of 11 to GPU0. - Use the
--fp16option, which applies NVIDIA APEX's mixed precision training. - Use the
--batch_chunkoption, which chunks a larger batch into a few smaller (equal) shares. The gradients from the smaller minibatches will accumulate, so the effective batch size is still the same as--batch_size. - Use fewer
--n_layer, or smaller--batch_size:P