nextclade icon indicating copy to clipboard operation
nextclade copied to clipboard

ENH: allow user to subsample sequences to save on memory usage

Open corneliusroemer opened this issue 2 years ago • 2 comments

While it is easily possible for command line users to sample a big fasta file, users who are using the browser may not know how to do so.

When one has a big fasta, say 5k sequences, one may just be interested in a random subsample of the sequences.

We could offer a subsample mode that would do subsample either a certain proportion of sequences or (more complicated) a given number.

I'm not sure whether this use case is realistic, but for me it at least sometimes could be, when I download sequences from GISAID after having received EPI_ISLs for a query from covSpectrum.

The alternative would be for @chaoran-chen to allow specifying sampling number when exporting the GISAID EPI_ISL list - that may actually be useful more broadly in this use case. But it wouldn't help people who got a large number of samples from a colleague and just want to have a quick look at a representative sample.

corneliusroemer avatar Jun 01 '22 17:06 corneliusroemer

certain proportion of sequences or (more complicated) a given number

In order to sample a portion you need to know the total, which means parsing and storing the entire input, or parsing it twice - once to count and another to sample.

With a hard number it's just a matter of maintaining a counter. Very easy to do in CLI (the counter is already there), very tricky for the web app.

ivan-aksamentov avatar Jun 01 '22 17:06 ivan-aksamentov

@corneliusroemer, if you open the "GISAID list" from CoV-Spectrum, it is already randomly sorted (note that the URL contains orderBy=random). I.e., you just need to take the first n rows to subsample.

@ivan-aksamentov, do you know the file size before parsing? If you do, it should be possible to estimate the number of sequences.

chaoran-chen avatar Jun 01 '22 20:06 chaoran-chen