woltka Reduced coverage file size

The coverage files are big. So I made some tweaks to reduce its size while maintaining human readability.

Previously, the file is like:

Now it is like:

G1	1-120,45-80,95-235,50,70-360,5-90,5-420,...

In a small-scale test, this treatment reduced the output file size from 11744975 to 4022174 (65.8% decrease). If we gzip the files (note: Woltka supports zipping output files using Gzip, Bzip2 and Xz), the file size reduced from 3358089 to 1775954 (47.1%) decrease.

If we adopt this design, the downstream analysis (Zebra filter?) also needs to be modified to take this format.

What do you think? @dhakim87 @ElDeveloper

Sep 05 '21 19:09 qiyunzhu

@dhakim87 Could you please comment on this PR? I am curious about your thoughts. Thanks!

Sep 17 '21 14:09 qiyunzhu

File size in this range makes no real difference to me. From what I remember, the sam files are so much larger than the range lists that I don't think the file size matters all too much. So, better to use the smaller format I suppose.

In: G1 1-120,45-80,95-235,50,70-360,5-90,5-420

I'm not sure if the ,50, is a real output or a typo. I don't think its possible to achieve based on our inputs, but if such a thing were to occur, I'd prefer it were written 50-50 or 50-51 (inclusive or exclusive).

The ranges should also be sorted by start index prior to being output (and I would have thought, already intersected?)

At some point we should probably come up with a way to handle circular genomes, but I don't think zebra filter understands that concept yet.

Sep 17 '21 19:09 dhakim87

@dhakim87 Thanks for sharing your opinion!

The coverage files are not comparable to the original SAM files in size. But I would imagine that people usually treat them as metadata, and may analyze them using a local system, instead of a super cluster. Making them laptop friendly maybe a good thing. But I agree that this is not the main priority.

That 50 is not a typo. As you guessed, it means 50-50 (both ends inclusive). According to the current abbr. rule, this needs to be written as 50-, because all digits that are identical to the left number are ignored in the right number. This scenario may be achieved if a read occupies just a single base (in an extreme case).

The ranges are already sorted by coordinate. This was enforced by the range merging function. In the outcome, there are no overlapping ranges. Therefore the start1, end1, start2, end2,... numbers are always in ascending order.

I think that handling circular genomes is a good idea.

Sep 17 '21 20:09 qiyunzhu

woltka woltka copied to clipboard

Reduced coverage file size

woltka
woltka copied to clipboard