basenji icon indicating copy to clipboard operation
basenji copied to clipboard

Poor training performance on specific cell type Hi-C data

Open Gemma-Zhang-326 opened this issue 11 months ago • 9 comments

Hello,

Thank you for the fantastic tool! However, I encountered poor training performance when attempting to train a model on Hi-C data for a specific cell type. The resulting model's Pearson correlation is only around 0.06, and I’m unsure where the issue lies. I suspect it might be related to parameters specific to Hi-C data or preprocessing steps. Below are the details of my setup:

Preprocessing code:

! /basenji/bin/akita_data.py --sample 0.1 \
    -g /basenji/data/hg38_gaps_binsize2048_numconseq10.bed \
    -l 1048576 --crop 65536 --local -k 1 -o /basenji/data/1m --as_obsexp \
    -p 16 -t .1 -v .1 -w 2048 --snap 2048 \
    /basenji/data/hg38.ml.fa /basenji/data/HiC_cools.txt

Model parameters (params_tutorial.json):



{
  "train": {
    "batch_size": 2,
    "optimizer": "sgd",
    "learning_rate": 0.0065,
    "momentum": 0.99575,
    "loss": "mse",
    "patience": 50,
    "clip_norm": 10.0
  },
  "model": {
    "seq_length": 1048576,
    "target_length": 512,
    "target_crop": 32,
    "diagonal_offset": 2,
    "augment_rc": true,
    "augment_shift": 11,
    "activation": "relu",
    "norm_type": "batch",
    "bn_momentum": 0.9265,
    "trunk": [
      {"name": "conv_block", "filters": 96, "kernel_size": 11, "pool_size": 2},
      {"name": "conv_tower", "filters_init": 96, "filters_mult": 1.0, "kernel_size": 5, "pool_size": 2, "repeat": 10},
      {"name": "dilated_residual", "filters": 48, "rate_mult": 1.75, "repeat": 8, "dropout": 0.4},
      {"name": "conv_block", "filters": 64, "kernel_size": 5}
    ],
    "head_hic": [
      {"name": "one_to_two", "operation": "mean"},
      {"name": "concat_dist_2d"},
      {"name": "conv_block_2d", "filters": 48, "kernel_size": 3},
      {"name": "symmetrize_2d"},
      {"name": "dilated_residual_2d", "filters": 24, "kernel_size": 3, "rate_mult": 1.75, "repeat": 6, "dropout": 0.1},
      {"name": "cropping_2d", "cropping": 32},
      {"name": "upper_tri", "diagonal_offset": 2},
      {"name": "final", "units": 1, "activation": "linear"}
    ]
  }
}

Training code:



! akita_train.py -k -o ./data/1m/train_out/ ./data/1m/params_tutorial.json ./data/1m/

Besides, My cool file is binned to 2048 bp and iteratively corrected using cooler.

The trained model achieves a Pearson correlation of only ~0.06. I’m uncertain whether the issue stems from Hi-C-specific parameters or preprocessing steps. I noticed your comment in your orignal paper:

"To focus on locus-specific patterns and mitigate the impact of sparse sampling present in even the currently highest-resolution Hi-C maps, we adaptively coarse-grain, normalize for the distance-dependent decrease in contact frequency, take a natural log, clip to (−2,2), linearly interpolate missing bins and convolve with a small 2D Gaussian filter (sigma, 1 and width, 5). The first to third steps use cooltools functions (https://github.com/mirnylab/cooltools). Interpolation of low-coverage bins filtered out in typical Hi-C pipelines was crucial for learning with log(observed/expected) Hi-C targets, greatly outperforming replacing these bins with zeros."

Question:

Are these preprocessing steps (adaptive coarse-graining, distance normalization, log transformation, clipping, interpolation, and Gaussian filtering) already included in akita_data.py? If not, could you provide guidance on how to incorporate them or suggest other potential causes for the poor performance?

I greatly appreciate your help and look forward to your response!

Gemma-Zhang-326 avatar May 18 '25 06:05 Gemma-Zhang-326

Image Here is the prediction result image generated by the trained model.

Gemma-Zhang-326 avatar May 18 '25 06:05 Gemma-Zhang-326

Hi Gemma, looks like you are using the -sample argument, which downsamples from the full training set. This is just for test runs. Still 0.06 is so low that there might also be a preprocessing issue prior to akita_data. The example patch looks a little bit lower coverage by eye than most of the training data we used, so it might be a sparser dataset.

gfudenberg avatar May 18 '25 13:05 gfudenberg

Hi Gemma, looks like you are using the -sample argument, which downsamples from the full training set. This is just for test runs. Still 0.06 is so low that there might also be a preprocessing issue prior to akita_data. The example patch looks a little bit lower coverage by eye than most of the training data we used, so it might be a sparser dataset.

Thank you for your prompt response. Indeed, I have also tried training with the complete dataset (without downsampling) but still didn't achieve satisfactory results. This might be related to the limited resolution of my HiC data.

I also attempted to directly append my HiC data to the microc_cools.txt from the tutorial, training it together with the two mcool files provided in the tutorial while using the same parameters. Unfortunately, the final val_pearsonr remained around 0.085, which is quite puzzling.

I'd like to confirm: Are there any additional preprocessing steps specific to particular HiC data that I might have missed? Or is this poor performance solely due to my data's limited resolution?

Gemma-Zhang-326 avatar May 18 '25 14:05 Gemma-Zhang-326

You can look into the filters used for cooler balance as well as making sure mapping steps etc worked (see documentation for pairtools, cooler, etc)

gfudenberg avatar May 18 '25 14:05 gfudenberg

You can look into the filters used for cooler balance as well as making sure mapping steps etc worked (see documentation for pairtools, cooler, etc)

Thanks for pointing me to these steps—I’ll review the documentation for pairtools and cooler carefully to check for any preprocessing issues. Besides, given my sparse data, which parameters would you recommend adjusting? Thanks again for your help!

Gemma-Zhang-326 avatar May 18 '25 14:05 Gemma-Zhang-326

To be honest, I have not yet explored how to best mitigate sparsity in training data.

Good luck!

gfudenberg avatar May 18 '25 15:05 gfudenberg

Hello. Thank you for this very impressive work.

I have the same issue as discussed in this thread. I also attempted to train the model using the HFF and H1hESC cool files provided in the training tutorial (https://nbviewer.org/github/gfudenberg/basenji/blob/master/manuscripts/akita/tutorial.ipynb), which I assume are balanced appropriately. I used the same preprocessing command as mentioned in this issue.

! /basenji/bin/akita_data.py --sample 0.1 \
    -g /basenji/data/hg38_gaps_binsize2048_numconseq10.bed \
    -l 1048576 --crop 65536 --local -k 1 -o /basenji/data/1m --as_obsexp \
    -p 8 -t .1 -v .1 -w 2048 --snap 2048 \
    /basenji/data/hg38.ml.fa /basenji/data/HiC_cools.txt

Pearson's correlation during training does not increase beyond 0.08. The predicted Hi-C looks exactly like this.

Image Here is the prediction result image generated by the trained model.

Can you please provide any insights on why this might be? Am I missing any preprocessing step?

AayushGrover avatar Jun 02 '25 11:06 AayushGrover

Hi, like Geoff said, start by turning off the 10% down-sampling.

davek44 avatar Jun 08 '25 21:06 davek44

Hi David. Thank you for your prompt response. I indeed turned off the 10% down-sampling. Unfortunately, that did not help either.

AayushGrover avatar Jun 10 '25 14:06 AayushGrover

Hi Geoff and David,

I would greatly appreciate your help with training Akita. This is a crucial comparison that the reviewers expect us to make.

We are using the H1hESC_hg38_4DNFI1O6IL1Q.mapq_30.2048.cool and HFF_hg38_4DNFIP5EUOFX.mapq_30.2048.cool files to train the model, as shown in the tutorial notebook.

To explain the issue better, here are the step-by-step scripts and hyperparameters that we used:

  1. Data preprocessing is done using the following command:
! /basenji/bin/akita_data.py -g /basenji/data/hg38_gaps_binsize2048_numconseq10.bed \
    -l 1048576 --crop 65536 --local -o /basenji/data/1m --as_obsexp \
    -p 8 -t .15 -v .15 -w 2048 --snap 2048 --stride_train 262144 --stride_test 32768 \
    /basenji/data/hg38.ml.fa /basenji/data/HiC_cools.txt

Here, HiC_cools.txt looks as follows:

index   identifier  file    clip    sum_stat    description
0   HFF /basenji/data/HFF_hg38_4DNFIP5EUOFX.mapq_30.2048.cool  2   sum HFF
1   H1  /basenji/data/H1hESC_hg38_4DNFI1O6IL1Q.mapq_30.2048.cool   2   sum H1
  1. Next, we trained the model using params.json with units of head_hic set to be 2:
! akita_train.py -k -o /basenji/data/1m/train_out/ /basenji/data/1m/params_tutorial.json /basenji/data/1m/

The training converges after 24 epochs with a validation Pearson's R of 0.08030.

This trained model predicts the following contact maps for all inputs:

Image

Am I missing some important argument during data preprocessing or training? Once again, thank you for your help, and I look forward to your response!

AayushGrover avatar Jan 13 '26 10:01 AayushGrover

Hi Aayush forgetting to turn off the sample step is the usual issue but it seems you've addressed this...

Did you try training on the pre-generated tfrecords and/or generating tfrecords for the genomic regions in the sequences.bed file ? If that didn't work, then this might not work either. If that did work, then I am not sure what to suggest other than inspecting the training data patches manually.

gfudenberg avatar Jan 13 '26 21:01 gfudenberg

Hi Geoff. Thank you for your suggestions! Training worked well with pre-generated tfrecords.

With the HFF and H1 datasets, the training also led to reasonable validation correlations when using the genomic regions in the provided sequences.bed file. In this case, I will continue using the regions in sequences.bed to train my Akita models and not generate them from scratch using akita_data.py. Thank you for your help!

AayushGrover avatar Jan 26 '26 14:01 AayushGrover