samtools icon indicating copy to clipboard operation
samtools copied to clipboard

samtools ```split``` on a select list of tags

Open cathalgking opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please specify.

I would like to split my BAM file based on specific flag in each entry. The input is a BAM file and a csv file containing each tag (one per line). Doessamtools split -d do this for each entry in the BAM file? Or can I specify to groups reads based on a common tag?

Describe the solution you would like.

Be able to feed in a select list of tags (one per line) to samtools split.

cathalgking avatar May 22 '24 06:05 cathalgking

I'm not quite sure what you want here? Is it just to split on a specific set of values, and put everything else in the unrecognised file? The -d option doesn't quite do that, as it makes a split for every value that it finds. If you only want a subset, one way to do it at the moment would be to make a script that adds an extra tag with just the values you want, split on that, and then maybe strip it out again at the end. It might not be the quickest solution, but it would get the job done...

daviesrob avatar May 23 '24 11:05 daviesrob

Thanks for your reply @daviesrob. I will just reply here with exactly what I want to do and maybe you can suggest the best option. I have 1 BAM file which contains a 'CB' flag for each read. This CB flag contains a tag which is a 16 nt string followed by a '-1' as shown in the screenshot below. Multiple reads should contain the same tag. I would like to split my main BAM file into multiple BAM files based on this CB flag, based on a list of tags as read in via a txt file. The most amount of BAM files I should end up with is 5,000 which each contain a collection of reads based on the CB tag. I tried this before with samtools view but it was much too slow, as shown in the example below, so I started looking into samtools split.

Can you suggest the best way to handle this task? Thanks!

BAM entry containing CB flag: Screenshot 2024-06-17 at 11 57 30 AM

Tried with samtools view

for i in $(cat PATH/spbars.txt) ; do samtools view -b -d CB:$i /PATH/genome_bam.bam > $i.bam ; done

Example of spbars.txt file:

Screenshot 2024-06-17 at 12 11 31 PM

cathalgking avatar Jun 17 '24 02:06 cathalgking

I don't think split or view will do what you want on their own, but combining the two might work. You could try:

samtools view -u -D CB:spbars.txt /PATH/genome_bam.bam | samtools/samtools split -d CB -M 6000 --output-fmt bam -f 'out_%!.bam' -

The samtools view -D selects the alignments with the tags you want, then split separates them into individual files. Note that split limits the number of output files it opens to prevent accidents where it opens a huge number of them. You want to output quite a lot, so you'll need to ensure the -M option is set high enough to stop it from hitting the limit. You may also need to adjust the shell limit on the number of open files (ulimit -n) to ensure split can open all of its outputs at the same time.

daviesrob avatar Jun 18 '24 10:06 daviesrob