Poor training performance on specific cell type Hi-C data
Hello,
Thank you for the fantastic tool! However, I encountered poor training performance when attempting to train a model on Hi-C data for a specific cell type. The resulting model's Pearson correlation is only around 0.06, and I’m unsure where the issue lies. I suspect it might be related to parameters specific to Hi-C data or preprocessing steps. Below are the details of my setup:
Preprocessing code:
! /basenji/bin/akita_data.py --sample 0.1 \
-g /basenji/data/hg38_gaps_binsize2048_numconseq10.bed \
-l 1048576 --crop 65536 --local -k 1 -o /basenji/data/1m --as_obsexp \
-p 16 -t .1 -v .1 -w 2048 --snap 2048 \
/basenji/data/hg38.ml.fa /basenji/data/HiC_cools.txt
Model parameters (params_tutorial.json):
{
"train": {
"batch_size": 2,
"optimizer": "sgd",
"learning_rate": 0.0065,
"momentum": 0.99575,
"loss": "mse",
"patience": 50,
"clip_norm": 10.0
},
"model": {
"seq_length": 1048576,
"target_length": 512,
"target_crop": 32,
"diagonal_offset": 2,
"augment_rc": true,
"augment_shift": 11,
"activation": "relu",
"norm_type": "batch",
"bn_momentum": 0.9265,
"trunk": [
{"name": "conv_block", "filters": 96, "kernel_size": 11, "pool_size": 2},
{"name": "conv_tower", "filters_init": 96, "filters_mult": 1.0, "kernel_size": 5, "pool_size": 2, "repeat": 10},
{"name": "dilated_residual", "filters": 48, "rate_mult": 1.75, "repeat": 8, "dropout": 0.4},
{"name": "conv_block", "filters": 64, "kernel_size": 5}
],
"head_hic": [
{"name": "one_to_two", "operation": "mean"},
{"name": "concat_dist_2d"},
{"name": "conv_block_2d", "filters": 48, "kernel_size": 3},
{"name": "symmetrize_2d"},
{"name": "dilated_residual_2d", "filters": 24, "kernel_size": 3, "rate_mult": 1.75, "repeat": 6, "dropout": 0.1},
{"name": "cropping_2d", "cropping": 32},
{"name": "upper_tri", "diagonal_offset": 2},
{"name": "final", "units": 1, "activation": "linear"}
]
}
}
Training code:
! akita_train.py -k -o ./data/1m/train_out/ ./data/1m/params_tutorial.json ./data/1m/
Besides, My cool file is binned to 2048 bp and iteratively corrected using cooler.
The trained model achieves a Pearson correlation of only ~0.06. I’m uncertain whether the issue stems from Hi-C-specific parameters or preprocessing steps. I noticed your comment in your orignal paper:
"To focus on locus-specific patterns and mitigate the impact of sparse sampling present in even the currently highest-resolution Hi-C maps, we adaptively coarse-grain, normalize for the distance-dependent decrease in contact frequency, take a natural log, clip to (−2,2), linearly interpolate missing bins and convolve with a small 2D Gaussian filter (sigma, 1 and width, 5). The first to third steps use cooltools functions (https://github.com/mirnylab/cooltools). Interpolation of low-coverage bins filtered out in typical Hi-C pipelines was crucial for learning with log(observed/expected) Hi-C targets, greatly outperforming replacing these bins with zeros."
Question:
Are these preprocessing steps (adaptive coarse-graining, distance normalization, log transformation, clipping, interpolation, and Gaussian filtering) already included in akita_data.py? If not, could you provide guidance on how to incorporate them or suggest other potential causes for the poor performance?
I greatly appreciate your help and look forward to your response!
Here is the prediction result image generated by the trained model.
Hi Gemma, looks like you are using the -sample argument, which downsamples from the full training set. This is just for test runs. Still 0.06 is so low that there might also be a preprocessing issue prior to akita_data. The example patch looks a little bit lower coverage by eye than most of the training data we used, so it might be a sparser dataset.
Hi Gemma, looks like you are using the
-sampleargument, which downsamples from the full training set. This is just for test runs. Still 0.06 is so low that there might also be a preprocessing issue prior to akita_data. The example patch looks a little bit lower coverage by eye than most of the training data we used, so it might be a sparser dataset.
Thank you for your prompt response. Indeed, I have also tried training with the complete dataset (without downsampling) but still didn't achieve satisfactory results. This might be related to the limited resolution of my HiC data.
I also attempted to directly append my HiC data to the microc_cools.txt from the tutorial, training it together with the two mcool files provided in the tutorial while using the same parameters. Unfortunately, the final val_pearsonr remained around 0.085, which is quite puzzling.
I'd like to confirm: Are there any additional preprocessing steps specific to particular HiC data that I might have missed? Or is this poor performance solely due to my data's limited resolution?
You can look into the filters used for cooler balance as well as making sure mapping steps etc worked (see documentation for pairtools, cooler, etc)
You can look into the filters used for cooler balance as well as making sure mapping steps etc worked (see documentation for pairtools, cooler, etc)
Thanks for pointing me to these steps—I’ll review the documentation for pairtools and cooler carefully to check for any preprocessing issues. Besides, given my sparse data, which parameters would you recommend adjusting? Thanks again for your help!
To be honest, I have not yet explored how to best mitigate sparsity in training data.
Good luck!
Hello. Thank you for this very impressive work.
I have the same issue as discussed in this thread. I also attempted to train the model using the HFF and H1hESC cool files provided in the training tutorial (https://nbviewer.org/github/gfudenberg/basenji/blob/master/manuscripts/akita/tutorial.ipynb), which I assume are balanced appropriately. I used the same preprocessing command as mentioned in this issue.
! /basenji/bin/akita_data.py --sample 0.1 \
-g /basenji/data/hg38_gaps_binsize2048_numconseq10.bed \
-l 1048576 --crop 65536 --local -k 1 -o /basenji/data/1m --as_obsexp \
-p 8 -t .1 -v .1 -w 2048 --snap 2048 \
/basenji/data/hg38.ml.fa /basenji/data/HiC_cools.txt
Pearson's correlation during training does not increase beyond 0.08. The predicted Hi-C looks exactly like this.
Here is the prediction result image generated by the trained model.
Can you please provide any insights on why this might be? Am I missing any preprocessing step?
Hi, like Geoff said, start by turning off the 10% down-sampling.
Hi David. Thank you for your prompt response. I indeed turned off the 10% down-sampling. Unfortunately, that did not help either.
Hi Geoff and David,
I would greatly appreciate your help with training Akita. This is a crucial comparison that the reviewers expect us to make.
We are using the H1hESC_hg38_4DNFI1O6IL1Q.mapq_30.2048.cool and HFF_hg38_4DNFIP5EUOFX.mapq_30.2048.cool files to train the model, as shown in the tutorial notebook.
To explain the issue better, here are the step-by-step scripts and hyperparameters that we used:
- Data preprocessing is done using the following command:
! /basenji/bin/akita_data.py -g /basenji/data/hg38_gaps_binsize2048_numconseq10.bed \
-l 1048576 --crop 65536 --local -o /basenji/data/1m --as_obsexp \
-p 8 -t .15 -v .15 -w 2048 --snap 2048 --stride_train 262144 --stride_test 32768 \
/basenji/data/hg38.ml.fa /basenji/data/HiC_cools.txt
Here, HiC_cools.txt looks as follows:
index identifier file clip sum_stat description
0 HFF /basenji/data/HFF_hg38_4DNFIP5EUOFX.mapq_30.2048.cool 2 sum HFF
1 H1 /basenji/data/H1hESC_hg38_4DNFI1O6IL1Q.mapq_30.2048.cool 2 sum H1
- Next, we trained the model using params.json with units of
head_hicset to be 2:
! akita_train.py -k -o /basenji/data/1m/train_out/ /basenji/data/1m/params_tutorial.json /basenji/data/1m/
The training converges after 24 epochs with a validation Pearson's R of 0.08030.
This trained model predicts the following contact maps for all inputs:
Am I missing some important argument during data preprocessing or training? Once again, thank you for your help, and I look forward to your response!
Hi Aayush forgetting to turn off the sample step is the usual issue but it seems you've addressed this...
Did you try training on the pre-generated tfrecords and/or generating tfrecords for the genomic regions in the sequences.bed file ? If that didn't work, then this might not work either. If that did work, then I am not sure what to suggest other than inspecting the training data patches manually.
Hi Geoff. Thank you for your suggestions! Training worked well with pre-generated tfrecords.
With the HFF and H1 datasets, the training also led to reasonable validation correlations when using the genomic regions in the provided sequences.bed file. In this case, I will continue using the regions in sequences.bed to train my Akita models and not generate them from scratch using akita_data.py. Thank you for your help!