agc Flag to overwrite existing samples?

Flag to overwrite existing samples?

Open niemasd opened this issue 1 year ago • 4 comments

Thanks again for this awesome tool! I'm trying to compress a large dataset of viral genomes, and sometimes we get updated genome sequences for an existing ID. I like that the default behavior is for agc to (seemingly) skip existing sample names and print an error message, but in some cases, it might be nice to have an "overwrite" flag to force replacing the old entry with the new one. Would such a feature be feasible?

More broadly, I see that there's no option to remove specific sequence(s) from the archive; would this be feasible to add, or would one have to extract everything, remove those specific entries, and recreate the archive?

Mar 22 '23 17:03 niemasd

Well, removing (or overwriting existing) samples is a difficult task. This is related to how agc handles contigs. Each one is split into segments (substrings of length a few (tens) kbp).

The segments with same edge k-mers are grouped together. One of them (the first one) is taken as a reference (usually it comes from the reference samples, but, in general, can be also from some other sample). Then, all other segments in the group are LZ-encoded using the reference. Then, the LZ-encoded segments are divided into batches. Each batch is zstd-compressed.

Let's think what happens when a single segment needs to be removed. If this is the reference segment for a group, all other segments (from other samples) must be decompressed, new reference must be selected, all other segments must be LZ-parsed using the new reference, and, finally, after dividing into new batches, they must be zstd-compressed. If the segment-to-remove is not a reference, it is a bit simpler. It suffices to zstd-deompress the batches in the group, remove the segment, form new batches, and zstd-compress. In both cases, it would be necessary to decompress the metadata describing the collection (segment ids will change for segments after the to-remove-one) and carefully changing segment ids.

Summing up, this is not easy. I'm adding this feature to my TODO list. I hope, we will be able to implement this in the future but not in the forthcoming v.3.1 release.

There is also one more problem. Currently, you can construct archive containing sample A. Then, add sample B. Then, add C, then D, then E. The archive will be exactly the same (if you do not use command-line storage) if you just construct the archive with samples A+B+C+D+E at a single run. Unfortunately, if the above-describing-sample removal will be implemented, and you compare the archives after building it from A+B+C+E with the one obtained after building for A+B+C+D+E and removing D, you would notice differences. The archives will contain exactly the same data, but the binary representation will not be the same. I do not know if this is an important issue. This can be solved but this would be equivalent to decompression of the whole archive, removing one sample and compression from scratch. This would take time.

Mar 04 '24 15:03 sebastiandeorowicz

Thank you, this is super helpful context! I appreciate the thorough and thoughtful details, and it helps me better understand how agc is working in my mind

There is also one more problem. Currently, you can construct archive containing sample A. Then, add sample B. Then, add C, then D, then E. The archive will be exactly the same (if you do not use command-line storage) if you just construct the archive with samples A+B+C+D+E at a single run. Unfortunately, if the above-describing-sample removal will be implemented, and you compare the archives after building it from A+B+C+E with the one obtained after building for A+B+C+D+E and removing D, you would notice differences. The archives will contain exactly the same data, but the binary representation will not be the same. I do not know if this is an important issue. This can be solved but this would be equivalent to decompression of the whole archive, removing one sample and compression from scratch. This would take time.

I think this specific issue (same data, different representation) is probably tolerable with respect to current state-of-the-art databases. For example, if I'm understanding correctly, SQLite runs into a similar issue (similar from the user perspective; I imagine quite different under the hood) when a user removes, adds, removes, etc. many entries. Performance degrades and file size is artificially large due to fragmented data, which is why SQLite provides the VACUUM command to rebuild the database. I can imagine a similar functionality in agc (e.g. "rebuild") that could be a time-consuming operation the user could choose to (infrequently) use, which performs agc compression on the entire dataset from scratch

Either way, I'll keep following your awesome work, and I look forward to trying out the new updates 😄

Mar 04 '24 15:03 niemasd

AGC 3.1 is ready. At the moment removing/updating is not implemented. We think about some provisional implementation in one of the next releases.

Mar 18 '24 20:03 sebastiandeorowicz

Thanks for the update! Looking forward to it 😄

Mar 18 '24 20:03 niemasd

agc agc copied to clipboard

Flag to overwrite existing samples?

agc
agc copied to clipboard