methylseq icon indicating copy to clipboard operation
methylseq copied to clipboard

Support analysis of spike-in sequences

Open sunta3iouxos opened this issue 4 years ago • 4 comments

Hi all, Does this pipeline handles spike ins (like the PhiX from illumina)? thank you in advance

sunta3iouxos avatar Aug 23 '19 15:08 sunta3iouxos

Hi @sunta3iouxos,

Do you mean PhiX from illumina? Those reads are typically removed in the demultiplexing step to generate the FastQ files, which occurs before this pipeline runs. So it shouldn't require any special behaviour to handle PhiX.

Phil

ewels avatar Aug 24 '19 08:08 ewels

Thank you Phil for your time,

Correct, I was refering to PhiX, typo! [Edit: fixed in original post]

"Those reads are typically removed in the demultiplexing step to generate the FastQ files, which occurs before this pipeline runs"

I was not aware of that, I need to check it.

I would like to expand the spike-ins question:

  1. Mainly we would like to use spike ins to calculate the conversion efficiency. Is it possible with this pipeline?
  2. What about the new EM-kit from NEB? https://international.neb.com/products/e7120-nebnext-enzymatic-methyl-seq-kit#Product%20Information or https://www.diagenode.com/files/products/kits/premium-RRBS-spike-in%20controls-08_18.pdf

Sorry for all this but as I work in the university core facility we are mainly interested in producing nice quality control reports.

theo

sunta3iouxos avatar Sep 10 '19 13:09 sunta3iouxos

Mainly we would like to use spike ins to calculate the conversion efficiency. Is it possible with this pipeline?

Not with PhiX spike-ins, as these are added to the pool immediately prior to sequencing. As such they do not undergo bisulfite treatment.

I have used the pipeline for analysis with spike-ins before, but it doesn't have native support as such. What I did was to run the pipeline twice: once with the Human reference genome and once with a reference genome made from a Fasta file of the spike-in sequences. This second analysis will hopefully have a terrible alignment rate (roughly corresponding to a little less than the percentage of reads you spiked in), and the methylation rate will tell you the conversion efficiency.

What about the new EM-kit from NEB?

I have not personally tested the pipeline with data from this kit yet. In theory, I believe that the sequencing data should look pretty similar, so I see no reason why it shouldn't work. We have a project starting now to test it out, so I will be running it and testing it in the coming months. Maybe @felixkrueger has tried running Bismark with NEB EM-seq data already?

or https://www.diagenode.com/files/products/kits/premium-RRBS-spike-in%20controls-08_18.pdf

This appears to be a totally separate product from NEB EM-seq, it's from the Diagenode Premium RRBS Kit. I have no experience of this kit.

Sorry for all this but as I work in the university core facility we are mainly interested in producing nice quality control reports.

No problem! I do too, and as you might be able to guess from some of my other projects I'm also a big fan of nice QC reports. I think it would be a nice feature to have native support for analysis of spike-in sequences along with the main run. With the current pipeline this would involve a lot of code, however, when we move to Nextflow DSL2 with modules it should be fairly trivial to write in this support. See https://github.com/nf-core/modules for more info (with modules we can have a small clean conditional block that re-runs all of the same processes with a different set of input channels).

I will leave this issue open as a reminder to add support for this in the future.

Phil

ewels avatar Sep 10 '19 20:09 ewels

Good morning both. I don't think we have actively used the NEB EM-kit nor the Diagenode RRBS kit outselves, but I seem to recall that a few issues with these technologies have been raised before (e.g. https://github.com/FelixKrueger/Bismark/issues/225 or https://github.com/FelixKrueger/Bismark/issues/208).

One thing to add regarding spike-ins:

Instead of running two consecutive rounds of alignments, one to the genome of interest and then second one against the spike-in sequence, another possibility would be to include the spike-in sequence (e.g. lambda, phiX, M13 etc...) as an additional 'chromosome' to the genome of interest, and then carry our the genome indexing once more. In this way you should be able to get both alignments and conversion rates in a single step.

This is not to say that the spike-ins will be a useful control, almost more often than not the spike-ins seem to behave in a slightly weird way, e.g. the conversion efficiencies appear to be worse than what one sees for methylation in non-CG context of the genome of interest. This could have to do with tertiary structures or other conversion artefacts, as has been nicely demonstrated for the methylation of mitochondria.

FelixKrueger avatar Sep 11 '19 09:09 FelixKrueger

Closing this issue as it's now very old. Happy to reopen in another state if we have a concrete proposal of what should be added.

ewels avatar Nov 03 '22 09:11 ewels

Closing this issue as it's now very old. Happy to reopen in another state if we have a concrete proposal of what should be added.

Hi! I'm really interested in this functionality, mostly for alignment to unmethylated lambda as a negative control for the conversion. But puc19 would be really useful as well (both spike-ins are included in the NEB EM-seq kit). I am currently running the pipeline three times, once for each genome or Bismark stand alone. It would be awesome if it was possible to have this as a option in the pipeline instead.

I continued on this old and closed ticket since it has some really good comments above and the mentioning of re-opening in the last comment, but if you want a new feature request, I can create one.

/Sara

sarek928 avatar Oct 16 '23 09:10 sarek928

Hi Sara,

We also run it 3 times, but for lambda and puc19 we use only unaligned reads to human from Bismark with --unmapped parameter on, which is about 15% of all the reads hence the alignment is much faster and resource-efficient for the control sequences. We also tested the pipeline with all the input reads and got identical mapped reads mapping and methylation reads, which is expected as the 3 genomes are dissimilar generating distinct reads.

But I agree that adding those into the pipeline would be beneficial as a subworkflow starting from the unmapped reads.

bounlu avatar Oct 16 '23 15:10 bounlu

Oh, that was a smart take on it, to only use the unaligned reads! I'll do that next time, great tip!

... I still want to run the pipeline just once though....

sarek928 avatar Oct 17 '23 08:10 sarek928