woltka icon indicating copy to clipboard operation
woltka copied to clipboard

Reduced coverage file size

Open qiyunzhu opened this issue 3 years ago • 3 comments

The coverage files are big. So I made some tweaks to reduce its size while maintaining human readability.

Previously, the file is like:

G1	1	120
G1	145	180
G1	195	235
G1	250	250
G1	270	360
G1	365	390
G1	395	420
...

Now it is like:

G1	1-120,45-80,95-235,50,70-360,5-90,5-420,...

In a small-scale test, this treatment reduced the output file size from 11744975 to 4022174 (65.8% decrease). If we gzip the files (note: Woltka supports zipping output files using Gzip, Bzip2 and Xz), the file size reduced from 3358089 to 1775954 (47.1%) decrease.

If we adopt this design, the downstream analysis (Zebra filter?) also needs to be modified to take this format.

What do you think? @dhakim87 @ElDeveloper

qiyunzhu avatar Sep 05 '21 19:09 qiyunzhu

@dhakim87 Could you please comment on this PR? I am curious about your thoughts. Thanks!

qiyunzhu avatar Sep 17 '21 14:09 qiyunzhu

File size in this range makes no real difference to me. From what I remember, the sam files are so much larger than the range lists that I don't think the file size matters all too much. So, better to use the smaller format I suppose.

In: G1 1-120,45-80,95-235,50,70-360,5-90,5-420

I'm not sure if the ,50, is a real output or a typo. I don't think its possible to achieve based on our inputs, but if such a thing were to occur, I'd prefer it were written 50-50 or 50-51 (inclusive or exclusive).

The ranges should also be sorted by start index prior to being output (and I would have thought, already intersected?)

At some point we should probably come up with a way to handle circular genomes, but I don't think zebra filter understands that concept yet.

dhakim87 avatar Sep 17 '21 19:09 dhakim87

@dhakim87 Thanks for sharing your opinion!

The coverage files are not comparable to the original SAM files in size. But I would imagine that people usually treat them as metadata, and may analyze them using a local system, instead of a super cluster. Making them laptop friendly maybe a good thing. But I agree that this is not the main priority.

That 50 is not a typo. As you guessed, it means 50-50 (both ends inclusive). According to the current abbr. rule, this needs to be written as 50-, because all digits that are identical to the left number are ignored in the right number. This scenario may be achieved if a read occupies just a single base (in an extreme case).

The ranges are already sorted by coordinate. This was enforced by the range merging function. In the outcome, there are no overlapping ranges. Therefore the start1, end1, start2, end2,... numbers are always in ascending order.

I think that handling circular genomes is a good idea.

qiyunzhu avatar Sep 17 '21 20:09 qiyunzhu