cycledash
cycledash copied to clipboard
Handle large VCFs well
Right now, if someone tried to load a VCF ~300MB (maybe even less) it just hangs without an error.
We should
- Show an error/explain to the user what's going on
- Fix this, so that we can load huge VCFs
Preliminary tests with Impala show it handling massive VCF joins quickly.
http://f.cl.ly/items/351T0m2c0B0K0W2U0X3T/Postgres%20vs.Impala%20(notes%20re%3A%20CycleDash).html
What's the size of the result set for select contig, position, reference, alternates, quality, "info:VAF", "info:DPR" from genotypes where vcf_id = 12;
? It's odd that PG woulds be so much faster. Are you hitting an index in PG? Maybe something is in the block cache?
Ah yeah, as noted in the preamble, there are indices on the PG table. Without an index, (on a freshly restarted PG server, so it should be uncached), that query takes ~9ms. I didn't try to clear the OS cache, but I don't think that's necessary.
Hmm restarting server wouldn't clear OS block cache. Curious to see if that has an effect.
Okay, clearing the block cache ups the time to ~6s. Thereafter ~0.5ms.
~6s with no index?
Yeah, about 26ms with index.
Also could you include the number of rows in the response to the query in the document?
The query in the document limits it to 100.
There's no LIMIT
clause? I see that you include response time with LIMIT
but the original query does not have that clause.
Ah I only included it for Impala in the document; all of the above have been run with limit 100; I'll run them again without that.
With no limit, returning ~5k rows, ~10s (with no indices, cleared block cache), ~200ms with indices.
+1'ing this, was just interested in browsing the 12GB (uncompressed) 1KG phase 3 VCF found here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz.