cycledash icon indicating copy to clipboard operation
cycledash copied to clipboard

Handle large VCFs well

Open ihodes opened this issue 10 years ago • 13 comments

Right now, if someone tried to load a VCF ~300MB (maybe even less) it just hangs without an error.

We should

  1. Show an error/explain to the user what's going on
  2. Fix this, so that we can load huge VCFs

ihodes avatar Jan 28 '15 18:01 ihodes

Preliminary tests with Impala show it handling massive VCF joins quickly.

http://f.cl.ly/items/351T0m2c0B0K0W2U0X3T/Postgres%20vs.Impala%20(notes%20re%3A%20CycleDash).html

ihodes avatar Feb 10 '15 21:02 ihodes

What's the size of the result set for select contig, position, reference, alternates, quality, "info:VAF", "info:DPR" from genotypes where vcf_id = 12;? It's odd that PG woulds be so much faster. Are you hitting an index in PG? Maybe something is in the block cache?

hammer avatar Feb 11 '15 04:02 hammer

Ah yeah, as noted in the preamble, there are indices on the PG table. Without an index, (on a freshly restarted PG server, so it should be uncached), that query takes ~9ms. I didn't try to clear the OS cache, but I don't think that's necessary.

ihodes avatar Feb 11 '15 15:02 ihodes

Hmm restarting server wouldn't clear OS block cache. Curious to see if that has an effect.

hammer avatar Feb 11 '15 15:02 hammer

Okay, clearing the block cache ups the time to ~6s. Thereafter ~0.5ms.

ihodes avatar Feb 11 '15 15:02 ihodes

~6s with no index?

hammer avatar Feb 11 '15 15:02 hammer

Yeah, about 26ms with index.

ihodes avatar Feb 11 '15 15:02 ihodes

Also could you include the number of rows in the response to the query in the document?

hammer avatar Feb 11 '15 15:02 hammer

The query in the document limits it to 100.

ihodes avatar Feb 11 '15 15:02 ihodes

There's no LIMIT clause? I see that you include response time with LIMIT but the original query does not have that clause.

hammer avatar Feb 11 '15 15:02 hammer

Ah I only included it for Impala in the document; all of the above have been run with limit 100; I'll run them again without that.

ihodes avatar Feb 11 '15 15:02 ihodes

With no limit, returning ~5k rows, ~10s (with no indices, cleared block cache), ~200ms with indices.

ihodes avatar Feb 11 '15 15:02 ihodes

+1'ing this, was just interested in browsing the 12GB (uncompressed) 1KG phase 3 VCF found here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz.

ryan-williams avatar Sep 16 '15 22:09 ryan-williams