remora icon indicating copy to clipboard operation
remora copied to clipboard

Various questions about Remora

Open tcb72 opened this issue 2 years ago • 8 comments

Hello,

I have a few questions that are not really related to each other.

First: I always assumed when training and running the model on new data, you had to know the reference in each case in order to generate ground-truth sequences. However, I just noticed this paragraph:

"The Remora API can be applied to make modified base calls given a basecalled read via a RemoraRead object. sig should be a float32 numpy array. seq is a string derived from sig (can be either basecalls or other downstream derived sequence; e.g. mapped reference positions). seq_to_sig_map should be an int32 numpy array of length len(seq) + 1 and elements should be indices within sig array assigned to each base in seq."

Lets say I know the reference sequence for the training data but may not know the reference for some new unseen data. Would it be advisable to do the following process?

Training:

  1. Basecall using Guppy
  2. Generate ground-truth sequences by mapping basecalls to a reference
  3. Use Taiyaki prepare_mapped_reads.py to map signals to ground-truth sequences
  4. Convert Taiyaki .hdf5 into Remora .npz using remora dataset prepare
  5. Remora model train on resulting .npz
  6. Generate .onnx model file

Testing on new data:

  1. Basecall using Guppy
  2. Using Taiyaki prepare_mapped_reads.py (?) to map signals to basecalls (NOT a reference)
  3. Convert Taiyaki .hdf5 into Remora .npz using remora dataset prepare
  4. Run remora infer from_remora_dataset on resulting .npz file with the .onnx model file generated during training.

If the answer to the above question is yes, then what is the best way to map signals to the basecalls? Would I just use the same process (prepare_mapped_reads.py with basecalls.fastq as reference?)

My second question has to do with remora dataset prepare. The default for the --motif parameter is N 0. However, from what I understand, a canonical base (ACTG or any combination) motif/position has to be declared when running Taiyaki's prepare_mapped_reads.py. Generating predictions in any context would be ideal for my situation, but I'm not sure how to get the default here to work. If I try to run prepare_mapped_reads.py with --alphabet ACTG --mod Y N mod_long_name_here, it throws an assertion error saying "Canonical coding for modified base must be a canonical base, got N.) If I try running remora dataset prepare with default parameters after successfully running prepare_mapped_reads.py with something that works like --alphabet ACTG --mod Y A mod_long_name_here, then remora throws a RemoraError saying "Canonical base within motif does not match canonical equivalent for modified base (A)."

What I'm getting at here is it doesn't seem to be possible to run remora with the default "any context" --motif parameter N 0 because of limitations of the tools used further upstream, such as Taiyaki's prepare_mapped_reads.py. If there is a way to generate a dataset in which the default Remora --motif parameter works, it would be of great help to know how to do that.

Thanks!

tcb72 avatar Mar 25 '22 20:03 tcb72

I would strongly suggest using Megalodon to prepare the training data for Remora (as shown in the README). The procedure for generating the ground truth data in Megalodon is essentially the same as in the first 2 steps of the Taiyaki pipeline you have produced. After this though Megalodon preserved the mapping from Guppy basecalls to signal, combines this with the mapping from basecalls to reference bases in order to prepare training data. Alternatively, Taiyaki performs a (slow) procedure to map reference bases to signal via the basecalling model output (without processing through the basecalls themselves). The key distinction is that the signal to sequence alignment is more consistent between Megalodon data prep and Guppy/Bonito/Megalodon inference. We have seen this increase accuracy a percent or two in internal tests (and thus why Remora models are linked to a basecalling model). Also this means that the first three steps above are completed in a single Megalodon command (simplifying the data prep procedure).

At inference time Guppy, Bonito and Megalodon can each apply the above trained model against basecalls (so no reference sequence is required at inference time). In Megalodon this output corresponds to the --outputs mod_basecalls. Megalodon additionally has the --outputs mod_mappings output type. This output performs the exact steps required for training data preparation. Thus this command will generally produce the highest accuracy calls. For SUP basecalling models though this difference is negligible.

I hope this answers your first set of questions, but please ask more questions if this is still unclear.

Remora has an experimental feature to call SNPs (termed base prediction within the Remora codebase). In this mode the --motif N 0 is allowed. For any modified base model a canonical base alternative is required. If I understand your use case correctly you have a modification that can be applied to any canonical base (e.g. backbone modifications). In this case I would suggest preparing 4 models, one for each of the canonical bases. This may not be ideal, but I would consider this a quite small use case (not that it is not important to explore!) and thus we don't have a nice interface for this at the moment.

The other bit that I would note here is that "completely modified" samples are probably not the best training sample unless this matches the target modification at inference time. I'm not sure exactly what the use case is here so this comment may be irrelevant to your research, but we have found that, for example, a PCR sample where cytosine is replaced with 5mC does not produce robust training data for biologically relevant data (with sparse modified base content). It is certainly a useful sample, but just a warning that a fully modified sample may not produce robust results from experience.

Best of luck with your research!

marcus1487 avatar Mar 30 '22 21:03 marcus1487

Thanks for the response Marcus.

I would strongly suggest using Megalodon to prepare the training data for Remora (as shown in the README). The procedure for generating the ground truth data in Megalodon is essentially the same as in the first 2 steps of the Taiyaki pipeline you have produced. After this though Megalodon preserved the mapping from Guppy basecalls to signal, combines this with the mapping from basecalls to reference bases in order to prepare training data. Alternatively, Taiyaki performs a (slow) procedure to map reference bases to signal via the basecalling model output (without processing through the basecalls themselves). The key distinction is that the signal to sequence alignment is more consistent between Megalodon data prep and Guppy/Bonito/Megalodon inference. We have seen this increase accuracy a percent or two in internal tests (and thus why Remora models are linked to a basecalling model). Also this means that the first three steps above are completed in a single Megalodon command (simplifying the data prep procedure).

At inference time Guppy, Bonito and Megalodon can each apply the above trained model against basecalls (so no reference sequence is required at inference time). In Megalodon this output corresponds to the --outputs mod_basecalls. Megalodon additionally has the --outputs mod_mappings output type. This output performs the exact steps required for training data preparation. Thus this command will generally produce the highest accuracy calls. For SUP basecalling models though this difference is negligible.

I hope this answers your first set of questions, but please ask more questions if this is still unclear.

Remora has an experimental feature to call SNPs (termed base prediction within the Remora codebase). In this mode the --motif N 0 is allowed. For any modified base model a canonical base alternative is required. If I understand your use case correctly you have a modification that can be applied to any canonical base (e.g. backbone modifications). In this case I would suggest preparing 4 models, one for each of the canonical bases. This may not be ideal, but I would consider this a quite small use case (not that it is not important to explore!) and thus we don't have a nice interface for this at the moment.

The other bit that I would note here is that "completely modified" samples are probably not the best training sample unless this matches the target modification at inference time. I'm not sure exactly what the use case is here so this comment may be irrelevant to your research, but we have found that, for example, a PCR sample where cytosine is replaced with 5mC does not produce robust training data for biologically relevant data (with sparse modified base content). It is certainly a useful sample, but just a warning that a fully modified sample may not produce robust results from experience.

Best of luck with your research!

Our modified sample has a single modified base at a single known point in a plasmid construct. So if I understand correctly, the process should be as follows:

  1. Use megalodon to generate four different sets of mapped signals. The four sets correspond to a different base at this particular position. Will need four different references as well here, with the corresponding base at that particular position. In order to prevent false positives, --ref-mods-all-motifs will be a unique 13mer with the 0th position at the base of interest.

  2. Merge the negative and positive datasets into a merged dataset using Taiyaki's merge_mappedsignalfiles.py

  3. Use remora dataset prepare on the four merged hdf5 files (each with a different alt canonical base) so it is compatible with Remora.

  4. Train each dataset separately, generating four different models.

  5. On some new, unknown fast5 signal data, feed the four models separately into Guppy, bonito basecaller with --modified-base-model , or Megalodon with --remora-modified-bases.

  6. Custom code to combine the results from each model into a single file so we can have predictions at every base.

The other thing I wanted to ask was about the randomer model that was introduced (but not public) yesterday. Funny enough, we came up with a near-identical idea a few weeks ago for our use case when we were trying to figure out how to generate some new data. Our problem with our current data (from 2019) is the modified base is always in the same sequence context. Of course, the problem with this is the model will just learn the sequence instead of the modified signal. That is why we want to generate new data w/ random sequences around the modified base.

It will probably look something like this:

modified sample: [plasmid vector sequence] [30 bp randomer] [1 bp modified base] [30 bp randomer] [plasmid vector sequence]

control 1: [plasmid vector sequence] [30 bp randomer] [1 bp A] [30 bp randomer] [plasmid vector sequence]

control 2: [plasmid vector sequence] [30 bp randomer] [1 bp T] [30 bp randomer] [plasmid vector sequence]

control 3: [plasmid vector sequence] [30 bp randomer] [1 bp G] [30 bp randomer] [plasmid vector sequence]

control 4: [plasmid vector sequence] [30 bp randomer] [1 bp C] [30 bp randomer] [plasmid vector sequence]

We would then perform a similar pipeline I outlined above. Of course, the only problem is this type of analysis isn't supported yet in Remora as the references cannot have ambiguous sequences. Is this something that will be supported in the near future? Would be a massive benefit to our research purpose.

Again, I greatly appreciate the detailed response and looking forward to any details you might have about randomer model support in Remora.

Best,

Tom

PS the readme is out of date as of 1.0, --mod-bases m does not work anymore in remora dataset prepare. Should be --mod-base m 5mC :)

tcb72 avatar Mar 31 '22 16:03 tcb72

This is very similar to our internal randomer modified base training setup. We are continuing to work on processing this data type to achieve models suitable for release. Thus we do not have an immediate timeline for the release of this pre-processing code base or exact specifications for the samples, but hope to release it at some point.

As you've likely deduced, the main idea is to use custom processing to identify duplex pairs from these runs, produce duplex basecalls for these randomer sequences and then provide this data to Remora for training. There are a number of steps on the lab and algorithm side here that took a long while to iron out (we've been working on this premise for about 4 years now), but are quite close to providing a useful product here. I will post here with any updates on our offerings for randomer models once they are available.

Thank you also for the docs note. Did not get to updating the docs in the release, but will push those updates to github soon.

marcus1487 avatar Mar 31 '22 17:03 marcus1487

This is very similar to our internal randomer modified base training setup. We are continuing to work on processing this data type to achieve models suitable for release. Thus we do not have an immediate timeline for the release of this pre-processing code base or exact specifications for the samples, but hope to release it at some point.

As you've likely deduced, the main idea is to use custom processing to identify duplex pairs from these runs, produce duplex basecalls for these randomer sequences and then provide this data to Remora for training. There are a number of steps on the lab and algorithm side here that took a long while to iron out (we've been working on this premise for about 4 years now), but are quite close to providing a useful product here. I will post here with any updates on our offerings for randomer models once they are available.

Thank you also for the docs note. Did not get to updating the docs in the release, but will push those updates to github soon.

Thanks for the response! A few more questions:

First, Is it safe to assume the duplex basecalls will sort of act as the "reference" (due to high acc), and this pseudo-reference will be passed to Megalodon? Don't worry about answering if it's too complex/private info that you don't want public yet.

Second, I wanted to get your opinion on getting more positive class data into my final remora training dataset. As I mentioned before, we currently have a plasmid with a single modified base at a known location. As a result, right now I'm doing something like --ref-mods-all-motifs <mod base code> <mod base long name> <unique-13mer where first base corresponds to the modified base> 0 in Megalodon.

Using the signal_mappings.hdf5 dataset produced from above, I then run two separate remora dataset prepare scripts. One of these contains the parameters --motif <same unique 13-mer used in Megalodon script> --mod-base <mod base code> <mod base long name>. This will produce a bunch of positive class samples. In order to generate the negative class data, the other script has the parameters --ref-mods-all-motifs <mod base code> <mod base long name> T 0 where T is the modified base and the first base of the unique 13-mer. As you can imagine, this produces a very unbalanced class dataset with a ton of 0-class labels and very little 1-class labels.

Finally, I merge these two using remora dataset merge with the --balance parameter. I'm currently ignoring any sequence data with some custom code, using only the 'sig_tensor' as X and 'labels' as y.

Is this a good way to get a large, balanced dataset for my particular use case? Or is there another method that is less hack-y?

tcb72 avatar Apr 01 '22 17:04 tcb72

First, Is it safe to assume the duplex basecalls will sort of act as the "reference" (due to high acc), and this pseudo-reference will be passed to Megalodon? Don't worry about answering if it's too complex/private info that you don't want public yet.

That is the main idea, but some extra filters/processing to improve further here.

For the second point, I would suggest that the control data should also be from the exact same motif. In this case this would likely involve a second plasmid with a canonical base at the modified location from the positive control. In the case you describe I would venture that the model would simply learn the fixed sequence and not learn the modified base signal.

That being said I can imagine that this 13-mer context is likely not all that useful outside of the training data. Unfortunately the fix for this is much more complicated using the randomer data and has been a large focus internally.

marcus1487 avatar Apr 14 '22 00:04 marcus1487

First, Is it safe to assume the duplex basecalls will sort of act as the "reference" (due to high acc), and this pseudo-reference will be passed to Megalodon? Don't worry about answering if it's too complex/private info that you don't want public yet.

That is the main idea, but some extra filters/processing to improve further here.

For the second point, I would suggest that the control data should also be from the exact same motif. In this case this would likely involve a second plasmid with a canonical base at the modified location from the positive control. In the case you describe I would venture that the model would simply learn the fixed sequence and not learn the modified base signal.

That being said I can imagine that this 13-mer context is likely not all that useful outside of the training data. Unfortunately the fix for this is much more complicated using the randomer data and has been a large focus internally.

To be honest, I was using this old data we had just to get acclimated to working with raw signal data, different tools, etc, with the outside shot it would somehow learn the modified base signal instead of this sequence. We confirmed, unsurprisingly, that it learns the surrounding sequence rather than the modified base itself. Unfortunate but gives us confidence in the randomer design.

We're hoping to order some double stranded sequences soon with our preferred construct, nearly identical to the randomer slide in that late-March presentation. One pool will contain the modified base in the center of the randomer, and the other pool will be all standard nucleotides. Our idea is to use code I found deep in the ONT Community boards that takes start and end base-called sequence positions and returns the corresponding signal (see: https://gist.github.com/fbrennen/257130d54fd2325c6ac1a7cfb708286d.) First, we will align the basecalled reads to the known region just prior to the randomer oligo insertion. The start position in this case is the end of the alignment (via minimap2.), and the end position is start + randomer_size. Labels will be 0 or 1 depending if it contains modified base or standard. The returned signal data will be one part of our training data. Since we also plan on doing duplex basecalling, we can incorporate sequence information as well, similar to how Remora does training currently.

Not sure how this will work but the design seems pretty good. Looking forward to the eventual Remora implementation as well!

tcb72 avatar Apr 14 '22 16:04 tcb72

Hi!

I came across this thread while looking into creating a new model for a specific DNA modification. The generation of oligos where the modified base is surrounded by randomers is essentially the same strategy our group landed on as well, and I was just wondering if there were any updates regarding the incorporation of randomers into the Remora training step?

Apologies if this is more of a Megalodon question, I can post there as well.

NanoCoreUSA avatar Aug 03 '22 20:08 NanoCoreUSA

@marcus1487 Finally got our randomers sequenced and have started to build the model. Two more questions:

  1. In the production models, approx how many training samples do you guys use?

  2. I'm slightly confused by the sequence input into the model. How I understand Remora is as follows:

First, the location of the motif (aka focus base) is determined. By default, the signal chunk is 50 signals up and downstream of this position. Easy enough.

The sequence portion is what I would call an "extended sequence". Lets say the sequence is TAGAC, and with the move table + first sample template + block stride info + etc info, T spans 5 signals, A spans 5 signals, G spans 10 signals, A spans 5 signals, and C spans 5 signals. The input to the model would be a one-hot encoding of TTTTTAAAAAGGGGGGGGGGAAAAACCCCC.

My confusion comes from the "kmer context" aspect of the input. I watched one of your presentations on Remora, and the example was using 3-mers. But after staring at that slide forever, I still couldn't figure it out. If you could provide a better example of how to incorporate k-mer context, it would be greatly appreciated.

tcb72 avatar Aug 31 '22 18:08 tcb72

For randomer processing, we have now released the Betta tool as a developer release. See the community note for instructions on accessing the repository here: https://community.nanoporetech.com/posts/betta-tool-release

For the production models, we aim for 60-90 million training chunks.

For the k-mer encoding, You have expanded to the central base of each k-mer in your example. Now instead of each base you could replace the base with the k-mer centered on that base. So you end up with a matrix with kmer-bases on one axis and signal positions on the other. Then this is 1-hot encoded for input into the neural network.

Here is a slide from a slightly older presentation which might help to understand the encoding. image

marcus1487 avatar May 18 '23 23:05 marcus1487