modkit modkit pileup bed output file has space delimited columns 10-18 and tab delimited 1-9

Seems there is a bug in the bed output from modkit pileup:

$ modkit --version
mod_kit 0.1.12

$ modkit pileup -n 4 --preset traditional  -r ref.fasta methylated.ref.sorted.bam test.bed

$ head test.bed | awk -F'\t' '{ print $10; }'
38 0.00 0 38 0 0 1 2 2
36 0.00 0 36 0 6 1 0 4
37 0.00 0 37 0 0 4 4 2
29 3.45 1 28 0 0 12 3 3
39 2.56 1 38 0 0 1 4 3
40 0.00 0 40 0 1 2 0 4
41 0.00 0 41 0 1 1 0 4
40 0.00 0 40 0 0 3 0 4
43 0.00 0 43 0 0 4 0 1
44 0.00 0 44 0 0 1 0 3

All columns after 10 are space separated and should be tab delimited?

Aug 10 '23 20:08 nextgenusfs

Hello @nextgenusfs

That is correct. As noted in a previous issue, the mixed delimiters are used so that the output is directly viewable with most commonly used genome viewers. In our testing, using all tabs resulted in a file that was not accepted by some. The --only-tabs flag will produce output that does not have the mixed delimiters. If you have a use case not supported by these options, I'd be happy to discuss it - enabling easy analysis is a primary goal of this project.

Aug 10 '23 22:08 ArtRand

Sorry I missed the previous issue. Okay. So I started to look into this as I couldn't figure out how to get jbrowse2 to display the file. Notably, this doesn't conform then to the encode bed9+2 bedMethyl format does it? As it suggests that the percent/fraction of methylated should be in column 11.

Aug 10 '23 23:08 nextgenusfs

just FYI opened this on jbrowse2: https://github.com/GMOD/jbrowse-components/issues/3842. So if you have a better way to do this I'm all ears! Thanks!

Aug 10 '23 23:08 nextgenusfs

Hey @nextgenusfs,

Probably the easiest thing to do is run awk 'BEGIN{OFS="\t"} {NF=11; print} in.bed > out.bed' then give that output to jbrowse2. The default output from modkit should conform to the bedMethyl specification, with extra fields. We've added these since often folks want that data but they often aren't correctly interpreted by browsers. It would be great if some of the browsers would be a little more flexible to this format, I'd be more than happy to contribute to those discussions (thanks for linking). There is also the --bedgraph output option from modkit pileup if that helps.

Aug 10 '23 23:08 ArtRand

@ArtRand which browsers are you thinking of when you say

the mixed delimiters are used so that the output is directly viewable with most commonly used genome viewers. In our testing, using all tabs resulted in a file that was not accepted by some.

Sep 24 '23 18:09 Ge0rges

@Ge0rges the default bedMethyl output should be acceptable input to IGV. I'm using 2.15.2 locally and it works fine.

Sep 24 '23 20:09 ArtRand

I wonder if the default output shouldn't be reverted to a tab-only bedmethyl file, with the adjusted format for genome viewers as a flag instead. The motivation here is that TSVs are a much more standard format than a 2-delimiter type file, and are supported by default by other tools such as methylKit. This is not really a big issue as one can easily switch between formats, but I thought this question might deserve some new attention considering this and this issue were raised, and I'm unsure this workaround for the bedmethyl file will be supported in future versions of IGV (in fact, bedmethyl files do not work on the current IGV web app).

Sep 25 '23 03:09 Ge0rges

Just an opinion, but I think you will find more tools that expect BED files to be tab-delimited than whitespace delimited. I don't think UCSC ever defined this. I don't know what JBrowse requires, but Bedtools for example requires tabs https://bedtools.readthedocs.io/en/latest/content/overview.html#bed-starts-are-zero-based-and-bed-ends-are-one-based. @ArtRand I would be interested to know which genome viewers would not load tab-delimited BED files. The vast majority of BED files are in fact tab delimited.

Sep 25 '23 05:09 jrobinso

Ahh, I was wrong, UCSC does provide some guidance here, note step 1.

Step 1. Format the data set: Format your data as a tab-separated file...

https://genome-preview.gi.ucsc.edu/goldenPath/help/customTrack.html

Sep 25 '23 06:09 jrobinso

Hello @Ge0rges, @jrobinso ,

The motivation to use the mixed-delimiter output in modkit was to allow the output bedMethyl files to be directly loaded into browsers such as IGV. When you use the --only-tabs flag in modkit pileup you will get a tab-delimited BED, but IGV (testing 2.15.2) will raise Error parsing LineIteratorImpl(SynchronousLineReader) at the first record. Internally, we decided this is unacceptable, thus the mixed-delimiter hack. I appreciate that the schema for the bedMethyl is probably less conventional than typical BED data.

Sep 25 '23 21:09 ArtRand

@ArtRand apologies for that, I am the IGV developer, that is just a bug. You can raise an issue at https://github.com/igvteam/igv or (for the web) at https://github.com/igvteam/igv.js when problems arise, I usually fix them pretty quickly.

I'll have a look at that for the next release, due out sometime in October.

WRT bed it is a somewhat constrained format in that client programs really have no way to know what extra columns are. The bigBed format addresses this with autosql. Nevertheless IGV should have read the ".bed" file, I will look into it. I also plan to add the ".bedmethyl" extension, however it will still expect tab delimited separators. It is probably interpreting all the columns including and after the rgb column as one, and choking when trying to convert it to a color. Anyway I'll look into it.

Sep 25 '23 21:09 jrobinso

BTW bigbed is recommended for files this size anyway (my test file is 125 MB). If making a bigbed file use the autosql option to describe your columns.

Sep 25 '23 21:09 jrobinso

@jrobinso,

Sounds good. My assumption was that by using spaces IGV was effectively ignoring the extra information. I'm very excited for the .bedmethyl extension, as you can imagine accessible interoperable visualization of these data will become very useful. Not trying to discredit any of the advancements in methylation visualization, of course.

Sep 25 '23 21:09 ArtRand

Also, I can't reproduce an error with my test data in IGV desktop. I do have to rename the extension to ".bed", the ".bedmethyl" extension will be added to the next release. I tried with release 2.15.1 as well as the latest development branch. The extra columns (after 8) are not recognized with plain .bed, but otherwise it loads. If you are able to reproduce this issue please open a ticket with a test file. Thanks.

Sep 25 '23 21:09 jrobinso

@ArtRand When parsing "bed" files IGV continues to read columns until it encounters something unexpected. For example a non-numeric string when expecting a number, etc. This isn't perfect but works most of the time. Its smarter about bigBed if the columns are defined by autosql.

Sep 25 '23 21:09 jrobinso

@jrobinso try the file attached, generated by the released version of modkit using the --only-tabs flag. duplex-tabs.bed.zip. It should correspond to GRCh38.

Of course I had to zip it because of github. I opened up a new issue as well if you want to move the thread over there, doesn't matter to me.

Sep 25 '23 21:09 ArtRand

@ArtRand thanks, I see why this one failed and my other test file did not. The extra columns actually look o.k. in the working test file, they are being parsed as integers like regular bed files there. IGV is fooled into thinking these describe exons. In the working file because they all contain integer numbers. In your file there is a floating point number there. I'm fixing it now.

IGV desktop code is much older than the web version, and it tries to figure out the delimiter for a bed file. This has caused issues, especially with the "name" column which sometimes contains strings with non-encoded spaces because people reasonably assume bed files are tab delimited. And it makes for complex code trying to guess. The web igv just assume tab delimited. This has never arising as an issue until now. With the "bedmethyl" extension we can make special rules, although special rules can lead to unexpected problems.

Sep 25 '23 22:09 jrobinso

@jrobinso,

Great, happy to help iterate. I'll start testing with the web version too.

Sep 25 '23 22:09 ArtRand

The latest snapshot of IGV desktop supports the current and --only-tabs version of modkit output. It would be potentially an API-breaking change but modkit could make --only-tabs the default in a future version.

Oct 12 '23 17:10 ArtRand

Yes, I was able to code IGV to support any type of delimiter for columns beyond the standard fields. This is possible because we added a specific type, i.e. it is not treated as generic "bed".

Oct 14 '23 02:10 jrobinso

As of version 0.3.0 the bedmethyl output uses only tabs, I've tested on IGV 2.17.4 and it parses the data as expected.

Oct 10 '24 21:10 ArtRand

modkit modkit copied to clipboard

modkit pileup bed output file has space delimited columns 10-18 and tab delimited 1-9

modkit
modkit copied to clipboard