modkit
modkit copied to clipboard
modkit pileup bed output file has space delimited columns 10-18 and tab delimited 1-9
Seems there is a bug in the bed output from modkit pileup:
$ modkit --version
mod_kit 0.1.12
$ modkit pileup -n 4 --preset traditional -r ref.fasta methylated.ref.sorted.bam test.bed
$ head test.bed | awk -F'\t' '{ print $10; }'
38 0.00 0 38 0 0 1 2 2
36 0.00 0 36 0 6 1 0 4
37 0.00 0 37 0 0 4 4 2
29 3.45 1 28 0 0 12 3 3
39 2.56 1 38 0 0 1 4 3
40 0.00 0 40 0 1 2 0 4
41 0.00 0 41 0 1 1 0 4
40 0.00 0 40 0 0 3 0 4
43 0.00 0 43 0 0 4 0 1
44 0.00 0 44 0 0 1 0 3
All columns after 10 are space separated and should be tab delimited?
Hello @nextgenusfs
That is correct. As noted in a previous issue, the mixed delimiters are used so that the output is directly viewable with most commonly used genome viewers. In our testing, using all tabs resulted in a file that was not accepted by some. The --only-tabs flag will produce output that does not have the mixed delimiters. If you have a use case not supported by these options, I'd be happy to discuss it - enabling easy analysis is a primary goal of this project.
Sorry I missed the previous issue. Okay. So I started to look into this as I couldn't figure out how to get jbrowse2 to display the file. Notably, this doesn't conform then to the encode bed9+2 bedMethyl format does it? As it suggests that the percent/fraction of methylated should be in column 11.
just FYI opened this on jbrowse2: https://github.com/GMOD/jbrowse-components/issues/3842. So if you have a better way to do this I'm all ears! Thanks!
Hey @nextgenusfs,
Probably the easiest thing to do is run awk 'BEGIN{OFS="\t"} {NF=11; print} in.bed > out.bed' then give that output to jbrowse2. The default output from modkit should conform to the bedMethyl specification, with extra fields. We've added these since often folks want that data but they often aren't correctly interpreted by browsers. It would be great if some of the browsers would be a little more flexible to this format, I'd be more than happy to contribute to those discussions (thanks for linking). There is also the --bedgraph output option from modkit pileup if that helps.
@ArtRand which browsers are you thinking of when you say
the mixed delimiters are used so that the output is directly viewable with most commonly used genome viewers. In our testing, using all tabs resulted in a file that was not accepted by some.
@Ge0rges the default bedMethyl output should be acceptable input to IGV. I'm using 2.15.2 locally and it works fine.
I wonder if the default output shouldn't be reverted to a tab-only bedmethyl file, with the adjusted format for genome viewers as a flag instead. The motivation here is that TSVs are a much more standard format than a 2-delimiter type file, and are supported by default by other tools such as methylKit. This is not really a big issue as one can easily switch between formats, but I thought this question might deserve some new attention considering this and this issue were raised, and I'm unsure this workaround for the bedmethyl file will be supported in future versions of IGV (in fact, bedmethyl files do not work on the current IGV web app).
Just an opinion, but I think you will find more tools that expect BED files to be tab-delimited than whitespace delimited. I don't think UCSC ever defined this. I don't know what JBrowse requires, but Bedtools for example requires tabs https://bedtools.readthedocs.io/en/latest/content/overview.html#bed-starts-are-zero-based-and-bed-ends-are-one-based. @ArtRand I would be interested to know which genome viewers would not load tab-delimited BED files. The vast majority of BED files are in fact tab delimited.
Ahh, I was wrong, UCSC does provide some guidance here, note step 1.
Step 1. Format the data set: Format your data as a tab-separated file...
https://genome-preview.gi.ucsc.edu/goldenPath/help/customTrack.html
Hello @Ge0rges, @jrobinso ,
The motivation to use the mixed-delimiter output in modkit was to allow the output bedMethyl files to be directly loaded into browsers such as IGV. When you use the --only-tabs flag in modkit pileup you will get a tab-delimited BED, but IGV (testing 2.15.2) will raise Error parsing LineIteratorImpl(SynchronousLineReader) at the first record. Internally, we decided this is unacceptable, thus the mixed-delimiter hack. I appreciate that the schema for the bedMethyl is probably less conventional than typical BED data.
@ArtRand apologies for that, I am the IGV developer, that is just a bug. You can raise an issue at https://github.com/igvteam/igv or (for the web) at https://github.com/igvteam/igv.js when problems arise, I usually fix them pretty quickly.
I'll have a look at that for the next release, due out sometime in October.
WRT bed it is a somewhat constrained format in that client programs really have no way to know what extra columns are. The bigBed format addresses this with autosql. Nevertheless IGV should have read the ".bed" file, I will look into it. I also plan to add the ".bedmethyl" extension, however it will still expect tab delimited separators. It is probably interpreting all the columns including and after the rgb column as one, and choking when trying to convert it to a color. Anyway I'll look into it.
BTW bigbed is recommended for files this size anyway (my test file is 125 MB). If making a bigbed file use the autosql option to describe your columns.
@jrobinso,
Sounds good. My assumption was that by using spaces IGV was effectively ignoring the extra information. I'm very excited for the .bedmethyl extension, as you can imagine accessible interoperable visualization of these data will become very useful. Not trying to discredit any of the advancements in methylation visualization, of course.
Also, I can't reproduce an error with my test data in IGV desktop. I do have to rename the extension to ".bed", the ".bedmethyl" extension will be added to the next release. I tried with release 2.15.1 as well as the latest development branch. The extra columns (after 8) are not recognized with plain .bed, but otherwise it loads. If you are able to reproduce this issue please open a ticket with a test file. Thanks.
@ArtRand When parsing "bed" files IGV continues to read columns until it encounters something unexpected. For example a non-numeric string when expecting a number, etc. This isn't perfect but works most of the time. Its smarter about bigBed if the columns are defined by autosql.
@jrobinso try the file attached, generated by the released version of modkit using the --only-tabs flag.
duplex-tabs.bed.zip. It should correspond to GRCh38.
Of course I had to zip it because of github. I opened up a new issue as well if you want to move the thread over there, doesn't matter to me.
@ArtRand thanks, I see why this one failed and my other test file did not. The extra columns actually look o.k. in the working test file, they are being parsed as integers like regular bed files there. IGV is fooled into thinking these describe exons. In the working file because they all contain integer numbers. In your file there is a floating point number there. I'm fixing it now.
IGV desktop code is much older than the web version, and it tries to figure out the delimiter for a bed file. This has caused issues, especially with the "name" column which sometimes contains strings with non-encoded spaces because people reasonably assume bed files are tab delimited. And it makes for complex code trying to guess. The web igv just assume tab delimited. This has never arising as an issue until now. With the "bedmethyl" extension we can make special rules, although special rules can lead to unexpected problems.
@jrobinso,
Great, happy to help iterate. I'll start testing with the web version too.
The latest snapshot of IGV desktop supports the current and --only-tabs version of modkit output. It would be potentially an API-breaking change but modkit could make --only-tabs the default in a future version.
Yes, I was able to code IGV to support any type of delimiter for columns beyond the standard fields. This is possible because we added a specific type, i.e. it is not treated as generic "bed".
As of version 0.3.0 the bedmethyl output uses only tabs, I've tested on IGV 2.17.4 and it parses the data as expected.