Clearer `A` and `B` notation
In the output of dmr multi there are columns such as samplea_counts and sampleb_counts. Given that the command is given sample names it would be nice if the file itself contained pairings between A, B and the sample names. Currently this relies on the user ensuring they don't modify the sample name. Perhaps this is not very practical if we're following any standards though, and that would be understandable in which case perhaps a comment at the top of the file?
PS: What is the difference between the nucleotide count and Total number of base modification calls in the region, including unmodified, for sample A for the output column samplea total? Is it because some nucleotides don't pass the threshold to be confidently called as unmodified?
Hello @Ge0rges,
For human-readable information such as the sample names, the place to look is the log file. The first line will always have the command itself, so any data provenance questions can be answered (hopefully) by inspecting that. Granted, the log file is optional, so maybe there should be some guard rails around that. I've considered adding flags that would add headers to the BED output files (pileup and dmr), just to make parsing with tabular tools easier. So I'll think about adding the sample name as you suggest. On the other hand, adding the sample name changes the schema.
To your second question, I'm going to assume you mean nucleotide count is the number of bases with potential modifications (e.g. all Cs in a region). sample_a_total is the total number of calls in the region as reported by the bedMethyl. As you know, some calls are filtered out before making this table. Further, every base has some amount of coverage >= 1. So the maximum this number could be, assuming no filtering, is the nucleotide count (# C bases) multiplied by the coverage. Hope this helps,
A
Thank you for the prompt reply as usual. I would argue that the sample naming data should be within the file to make it more independent and less prone to errors when sharing, or coming back to the data after a while. I think log files are often thought of as temporary.
Hello @Ge0rges,
I tend to agree. I find myself relying on the "command" comments in VCF files and PG records in BAMs all the time. There is a lot going into the 0.2.3 release, so I may not get this feature in there - but I'll try. Thanks as always for the suggestion.