remora Model improvement questions

Greetings,

This is mostly about how to improve the quality of remora models and a few other questions will be asked below. I have trained a custom modification remora model and used it to basecall a modified strand of dna. The created model is pretty poor in regards of it mistaking natural CG sites with modified ones. I presumed this was due to poor modification efficiency on my end. I used the default 0.0.0 remora mC model to basecall a methylated control strand and created a model of it aswell. I was surprised to see that my mC model was poor quality aswell. I was wondering if you have any suggestions on how to improve the model training itself as I am unable to train a basic mC model using nearly 100% methylated dna strands. I'm attaching some IGV pictures visualizing the remora's pre trained mC model, a trained mC model, and a custom model for our modification in that order:

Megalodon mod_mappings using a pre-trained remora 5mC model

rem-pre-mC

Megalodon mod_mappings using a 5mC model trained by me model

rem-cmC

Megalodon mod_mappings using a 5ahyC model trained by me model

Note that IGV shows 5ahyC in blue just like 5hmC

rem-ahyC

The pre-trained model makes me believe that the methylation efficiency is sufficient. It is rather the model training where I could do some improvement. I used the workflow written in depo's readme. The models were prepared on a different ~1kb substrate of relatively spaced CG motifs (similar amount and spacing as in the pictures). At this moment I have a few questions regarding this type of model training and some unrelated:

Do you perhaps have any suggestions how I could improve the model training process? Some settings to fiddle with? Or is it the substrate that is lacking?
How exactly is the hmC-mC model trained? Is it possible to train a model which could seperate hmC and my custom modification as the hmC-mC model deos?
Is it possible to train a model using only + strands? I.e. mapping the signals of + strands only or seperating them afterwards? Or rather is there a way to process a fast5 file to seperate the strands assuming the sequence is not palindromic and is barcoded. This is important since we have difficulty modifying both strands.
What exactly is described by accuracy when a remora model is in training? Note that my trained models had >0.99 accuracy
More importantly what kind of substrate do you recommend for model creation? CG content/length etc.?

Sep 14 '22 15:09 jorisbalc

It would be helpful if you could post a bit more detail on the nature of your training data. If your data is composed of 100% modified bases (i.e. all C swapped for a modified base) this may be the source of the poor results. Fully modified samples do not show high quality results on samples with sparse modified bases. Generally the default parameters work quite well out of the box and we have found that data preparation is generally the best item on which to focus for improved results.

The 5mC+5hmC samples is trained with 3 samples derived from HG002. These are 1) a standard PCR sample, 2) PCR+M.SssI treated sample to convert all cytosines in CG contexts to 5mC and 3) a treatment with TET enzymes along with other co-factors to convert most 5mC to 5hmC. This training data after some filtering is provided to the standard training command.

It is currently not very simple to select the training data. We are working on a major re-write of the data preparation pipeline that will start from BAM and POD5 files to directly create Remora datasets. With this command you can filter a bam file before input into the training pipeline. I will post here once this update is available.

In training accuracy is the number of chunk matching the training label over the total chunks tested.

The substrate for the training data is key to the robust performance of the model. At a high level the training data should look like the target data at inference time. I realize that this is often quite difficult to produce, but this should be the goal in producing the highest accuracy Remora models. We are aiming to provide more details on an exact protocol for the sample preparation and training of more exotic modified bases in the future.

Oct 06 '22 22:10 marcus1487

For both the custom modification and methylation model training we used two substrates. Both were ~1kb in lenght with a GC content of ~40%. Only the CpG sites were modified, with a modification efficiency of 50-100% on the custom modified DNA substrate and near 100% on the methylated DNA substrate. The PCR products had in total 24 and 29 CpG sites. The CpG sites were quite spaced out as you can see in the pictures I've attached on my first post (note that the pictures practically show the whole PCR product). Also I forgot to mention that the unmodified DNA has only canonical base pairs.

The most major problem in the models is that canon CpG sites are recognized as modified ones (this applies to both methylation and custom modification models) whereas modified ones are recognized quite well. Is there any resolution to this?

Oct 07 '22 13:10 jorisbalc

I see a couple of issues here. The first is that this limited context is not very likely to produce a robust model outside of the ~50 sites in this training data set. Production models are trained from millions of different contexts.

The second problem sounds like the partially unmodified training data. If the sites that are unmodified in the custom DNA substrate are included in training, these will train the Remora model to identify canonical sites as this modified base. You will likely have to perform some type of filtering or custom machine learning approaches to enhance this data to produce a robust model. This is not something that we can provide for the vast array of possible training data types.

Oct 14 '22 16:10 marcus1487

Thank you for the reply. That's what we figured aswell. You also mentioned that a model differentiating two distinct modifications can be trained (similar to the mC/hmC model provided). Could you provide a little bit more information on how this is done? Are the three mapped signals merged together or are both modified DNA signals mapped to a reference and then provided to a remora training command?

Oct 20 '22 09:10 jorisbalc

This is primarily a function of training data. Given training chunks from a canonical and two different modified bases, these datasets can simply be merged and a model trained, just as in the case of a single modified base. The key here is that the ground truth data be very high quality. The model can only learn what it is presented. So poorly annotated or mislabeled data will likely produce a poor quality model. I hope this helps, but please ask if you have any further issues.

Dec 08 '22 16:12 marcus1487

remora remora copied to clipboard

Model improvement questions

remora
remora copied to clipboard