woltka
woltka copied to clipboard
Reduced coverage file size
The coverage files are big. So I made some tweaks to reduce its size while maintaining human readability.
Previously, the file is like:
G1 1 120
G1 145 180
G1 195 235
G1 250 250
G1 270 360
G1 365 390
G1 395 420
...
Now it is like:
G1 1-120,45-80,95-235,50,70-360,5-90,5-420,...
In a small-scale test, this treatment reduced the output file size from 11744975 to 4022174 (65.8% decrease). If we gzip the files (note: Woltka supports zipping output files using Gzip, Bzip2 and Xz), the file size reduced from 3358089 to 1775954 (47.1%) decrease.
If we adopt this design, the downstream analysis (Zebra filter?) also needs to be modified to take this format.
What do you think? @dhakim87 @ElDeveloper
@dhakim87 Could you please comment on this PR? I am curious about your thoughts. Thanks!
File size in this range makes no real difference to me. From what I remember, the sam files are so much larger than the range lists that I don't think the file size matters all too much. So, better to use the smaller format I suppose.
In: G1 1-120,45-80,95-235,50,70-360,5-90,5-420
I'm not sure if the ,50, is a real output or a typo. I don't think its possible to achieve based on our inputs, but if such a thing were to occur, I'd prefer it were written 50-50 or 50-51 (inclusive or exclusive).
The ranges should also be sorted by start index prior to being output (and I would have thought, already intersected?)
At some point we should probably come up with a way to handle circular genomes, but I don't think zebra filter understands that concept yet.
@dhakim87 Thanks for sharing your opinion!
The coverage files are not comparable to the original SAM files in size. But I would imagine that people usually treat them as metadata, and may analyze them using a local system, instead of a super cluster. Making them laptop friendly maybe a good thing. But I agree that this is not the main priority.
That 50
is not a typo. As you guessed, it means 50-50
(both ends inclusive). According to the current abbr. rule, this needs to be written as 50-
, because all digits that are identical to the left number are ignored in the right number. This scenario may be achieved if a read occupies just a single base (in an extreme case).
The ranges are already sorted by coordinate. This was enforced by the range merging function. In the outcome, there are no overlapping ranges. Therefore the start1, end1, start2, end2,... numbers are always in ascending order.
I think that handling circular genomes is a good idea.