dorado icon indicating copy to clipboard operation
dorado copied to clipboard

trimming confusion

Open JWDebler opened this issue 1 year ago • 15 comments

Hi all,

I'm a little confused about what gets trimmed when and what doesn't.

As far as I understand dorado basecaller trims adapter and barcode if demultiplexing is turned on.

My workflow works like this:

  1. simplex calling with demultiplexing
  2. extracting barcoded reads from bam file and using the ids to demultiplex the raw pod5s into barcode specific pod5s.
  3. Then I duplex call the individual barcode pod5s and run dorado trim on the resultant bam file.

My question now is, does dorado trim only trim the adapter, or will it also trim the barcodes?

Cheers, Johannes

JWDebler avatar May 14 '24 08:05 JWDebler

Hi @JWDebler,

The trim command has no concept of barcodes - it will only trim the adapters (and primers, unless --no-trim-primers is specified).

malton-ont avatar May 14 '24 13:05 malton-ont

Any chance that can be added? Or maybe an option that does something like 'if Adapter trimmed, also trim the next X bases'. Cheers

JWDebler avatar May 14 '24 13:05 JWDebler

Alternatively, I could just crop 60 bp (NA top + barcode) from each end of an untrimmed read with something like chopper before assembly and skip dorado trim I suppose.

JWDebler avatar May 15 '24 00:05 JWDebler

Hi @JWDebler - that heuristic might work reasonably well (I'd maybe go up to 75).

Alternatively, you can adjust your pipeline to be -

  1. run dorado basecaller w/ demux and trimming enabled --> this will call all simplex reads with adapters/barcodes trimmed
  2. extract the per barcode read ids, and then run duplex for each set
  3. from the duplex output, simply keep the dx:1 reads and merge them with output from step 1.

this will keep all simplex reads + duplex reads. you can also extract read ids for dx:0 and filter those from output of 1 and merge with 3. It's a bit more effort but will handle all trimming, etc. correctly. you don't need to run trim on duplex reads because by virtue of how duplex overlapping is determined, all barcodes/adapters will get trimmed anyway.

tijyojwad avatar May 15 '24 01:05 tijyojwad

Hmm, good idea. I'm gonna give that a go. I keep my simplex and duplex fastq files separate anyways so I can extract them from separate bams. Any progress on integrating all that into 'dorado duplex'? 😊

JWDebler avatar May 15 '24 01:05 JWDebler

Barcoding and trimming in duplex is still planned, but it has been a bit lower priority compared to some other stuff in the pipeline. So it won't make it into the upcoming release, but I'll raise priority on this for the one after that.

tijyojwad avatar May 15 '24 01:05 tijyojwad

I used your suggestion above, extracting the trimmed simplex reads from the inital bam. Thanks, this works fine. There is still the odd barcode in there, but overall looks much better. However, I just had a closer look at my duplex reads, and even though I keep hearing that duplex reads should be free of adapters and barcodes due to the way they are generated, I still have lots of adapters and barcodes left on mine.

JWDebler avatar May 16 '24 01:05 JWDebler

I still have lots of adapters and barcodes left on mine

what barcode kit are you using?

tijyojwad avatar May 16 '24 16:05 tijyojwad

SQK-NBD114-24

On Fri, 17 May 2024, 00:47 Joyjit Daw, @.***> wrote:

I still have lots of adapters and barcodes left on mine

what barcode kit are you using?

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/808#issuecomment-2115749035, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBHB2QDF34YKU3ENWRYZILZCTPLHAVCNFSM6AAAAABHVXEQIOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJVG42DSMBTGU . You are receiving this because you were mentioned.Message ID: @.***>

JWDebler avatar May 17 '24 00:05 JWDebler

Thanks! Going through the actual structure -

For the NBD barcode, a typical read would look like this for the template strand

5' - ADAPTER1 -  FRONT_FLANK1 - BC - REAR_FLANK1 - DNA - RC(REAR_FLANK2) - RC(BC) - RC(FRONT_FLANK2) - RC(ADAPTER2) - 3'

and the complement strand

5' - ADAPTER2 - FRONT_FLANK2 - BC - REAR_FLANK2 - RC(DNA) - RC(REAR_FLANK1) - RC(BC) - RC(FRONT_FLANK1) - RC(ADAPTER1) - 3'

in this case when we determine the duplex pair overlaps, the reverse complement of the complement strand would align with the template strand. So here I would actually expect barcodes/adapters to be retained (at least on one end). So theoretically running demux/trim on the duplex output should also work!

However, if it was a different kit like RBK, then in the duplex pair overlap both adapter and barcode will be trimmed.

template => 5' - ADAPTER - FRONT_FLANK - BC - REAR_FLANK - DNA - 3'
complement => 5' - FRONT_FLANK - BC - REAR_FLANK - RC(DNA) - RC(ADAPTER) - 3'

So I apologize for the confusion earlier - whether or not barcodes/adapters get trimmed is kit dependent.

Would you be open to sharing a few reads from your duplexed barcoded dataset?

tijyojwad avatar May 17 '24 00:05 tijyojwad

Barcoding and trimming in duplex is still planned, but it has been a bit lower priority compared to some other stuff in the pipeline. So it won't make it into the upcoming release, but I'll raise priority on this for the one after that.

@tijyojwad Great to hear this is planned! I really need this as well. An alternative would be to simply add a feature to dorado trim that trims a specific number of bases from start or end. The problem for me is that other tools only removes the bases from the seqeunce and quality string, but does not keep the methylation information in sync.

simondrue avatar May 24 '24 06:05 simondrue

Hi @simondrue - thanks for the feedback! We're working on this now to get it out by the next release.

tijyojwad avatar May 25 '24 01:05 tijyojwad

Trimming barcodes is only possible during the basecalling? We have untrimmed already basecalled data (which took a month!). Barcode trimming in dorado trim would be greatly appreciated.

jonkristoffersen avatar Jun 20 '24 13:06 jonkristoffersen

Hi @jonkristoffersen

dorado demux will trim barcodes if it is classifying, but not if using --no-classify. It is possible to re-barcode untrimmed barcoded basecall data in order to apply trimming - just ensure that you use v0.7.1 or later so that the BC tag is updated rather than a second one being created.

malton-ont avatar Jun 20 '24 13:06 malton-ont

Hi @jonkristoffersen

dorado demux will trim barcodes if it is classifying, but not if using --no-classify. It is possible to re-barcode untrimmed barcoded basecall data in order to apply trimming - just ensure that you use v0.7.1 or later so that the BC tag is updated rather than a second one being created.

Thanks, that worked!

jonkristoffersen avatar Jun 21 '24 09:06 jonkristoffersen