Controlling delimiter in -nms option of mergeBed
I use mergeBed frequently with the -nms feature to list multiple transcripts that BED intervals belong to, e.g.:
chr1 3195981 3197398 ENSMUST00000162897;ENSMUST00000159265 -
chr1 3203519 3207049 ENSMUST00000159265;ENSMUST00000162897;ENSMUST00000070533 -
This lists the transcripts in ;-separated list. When I call tagBam against a set of BED files that have these semicolon separated lists, you can get results like:
merged_exons:chr10:3426454-3428987,ENSMUST00000078070;ENSMUST00000144622;ENSMUST00000145156;ENSMUST00000165510;ENSMUST00000154998;ENSMUST00000086896;ENSMUST00000105617;ENSMUST00000149708;ENSMUST00000170680,+,;introns:chr10:3323400-3434581,ENSMUST00000086896,1,+,introns:chr10:3366295-3445776,ENSMUST00000165510,1,+,introns:chr10:3374343-3438478,ENSMUST00000129954,1,-
Since tagBam itself uses semicolons to delimit the intervals that a particular BAM read maps to, this makes it a pain to parse, since semicolons delineate the fields of the tagBam optional entry in the BAM, but have additional meaning inside the fields as delimiters of the BED intervals that the reads were mapped against.
Is it possible to add an option to mergeBed that would specify the delimiter? E.g. ; instead of ,? That would be useful and not require post-processing of each BED produced by mergeBed to manually change the delimiter.
A perhaps more general alternative: it would be probably better to allow the user to specify the delimiter for tagBam. That would also solve the problem.
Thanks.
Good suggestion, I'll look into it.
Thank you, Aaron!
I just pushed a commit (65e377d29c) to enable this through the use of a -delim option. The default delimiter for merge is now a "," instead of a ";". This is in the interest of consistency.
http://bedtools.readthedocs.org/en/latest/content/tools/merge.html#delim-change-the-delimiter-for-nms-and-scores-collapse
Let me know if this works for you.
Thanks Aaron, this is great. I did run into another twist with this though, that the -delim option does not address.
I noticed that mergeBed does not incorporate the score field in its output by default, without the -scores option. For example:
$ head exons/ensGene.exons.bed
chr1 3044313 3044814 ENSMUST00000160944 1 +
chr1 3092096 3092206 ENSMUST00000082908 1 +
chr1 3195981 3197398 ENSMUST00000162897 1 -
Loses the score strand when calling mergeBed:
$ sortBed -i exons/ensGene.exons.bed | mergeBed -i stdin -nms -s | head
chr1 3044313 3044814 ENSMUST00000160944 +
chr1 3092096 3092206 ENSMUST00000082908 +
chr1 3456667 3456768 ENSMUST00000161581 +
chr1 3503485 3503634 ENSMUST00000161581 +
When one of the input BED files to tagBam is score-less, it appears to me that the output uses inconsistent orderings of the strand and score fields. One of my tagBam reads has the following YK field:
merged_exons:chr1:132588831-132595320,ENSMUST00000050406,ENSMUST00000169659,ENSMUST00000066863,ENSMUST00000164463,ENSMUST00000164084,ENSMUST00000169659,ENSMUST00000066863,ENSMUST00000050406,ENSMUST00000171479,-,,merged_exons:chr1:132595764-132595827,ENSMUST00000066863,ENSMUST00000050406,ENSMUST00000171479,ENSMUST00000164463,-,;introns:chr1:132587726-132604084,ENSMUST00000171479,1,-,introns:chr1:132595321-132598280,ENSMUST00000050406,1,-;cds_only.merged_exons:chr1:132595066-132595320,ENSMUST00000164463,ENSMUST00000171479,ENSMUST00000050406,ENSMUST00000169659,ENSMUST00000066863,ENSMUST00000164084,-,,cds_only.merged_exons:chr1:132595764-132595827,ENSMUST00000050406,ENSMUST00000164463,ENSMUST00000066863,ENSMUST00000171479,-,
The merged_exons regions are sometimes separated by ,,. It turns out this empty field comes from the fact that the input BED merged_exons does not have the score field, and so the output order is:
nms,strand,[empty]
But for an input bed file like introns, which does have values for the score field, the order is:
nms,score,strand
One potential solution is to use -scores when calling mergeBed, but just wanted to point out the ordering issue. As is there's no consistent delimiter one can split on to determine which regions of each input BED does the read map to. Thanks again.