gemini icon indicating copy to clipboard operation
gemini copied to clipboard

"filter" column has no index

Open jdelafon opened this issue 9 years ago • 1 comments

AFAICT the quality 'filter' column (VQSR etc.) is not indexed, probably because it has many NULL. The consequence of the index not being present is a big slowdown on queries involving that column. Do you think it would be possible to change NULL to '' or 'PASS' and index efficiently? Or I read that there are ways to index nullable columns.

[Edit] Example with 6 million rows:

sqlite> select count(*) from variants where filter is NULL;
4737271

takes 50s, while after

sqlite> update variants set filter='PASS' where filter is NULL;
sqlite> create index pass_filter_idx on variants (filter);

the equivalent query

sqlite> select count(*) from variants where filter='PASS';
4737271

takes less than 1 s. [Edit ]This is obvious because it is read directly from the index, but more generally,

$ time sqlite3 wgs_orig.db "select filter from variants where filter is NULL"
real    2m3.638s
user    0m8.821s
sys 0m34.472s

$ time sqlite3 wgs.db "select filter from variants where filter='PASS'"
real    0m50.561s
user    0m2.854s
sys 0m3.947s

[Edit] I am using GATK, which says 'PASS' in the VCF when a variant passes all VQSR filters. I'd like to understand why it got translated to NULL through Gemini.

jdelafon avatar May 18 '16 06:05 jdelafon

In htslib, PASS is stored as NULL, this gets translated to None in python then back to NULL in SQL. We could explicitly store PASS which would allow users to create indicies. But I think that would break backwards compat so we will have to think about it.

Meanwhile you could probably issue an update statement set filter = 'PASS' where filter is NULL.

brentp avatar May 31 '16 19:05 brentp