"filter" column has no index
AFAICT the quality 'filter' column (VQSR etc.) is not indexed, probably because it has many NULL. The consequence of the index not being present is a big slowdown on queries involving that column. Do you think it would be possible to change NULL to '' or 'PASS' and index efficiently? Or I read that there are ways to index nullable columns.
[Edit] Example with 6 million rows:
sqlite> select count(*) from variants where filter is NULL;
4737271
takes 50s, while after
sqlite> update variants set filter='PASS' where filter is NULL;
sqlite> create index pass_filter_idx on variants (filter);
the equivalent query
sqlite> select count(*) from variants where filter='PASS';
4737271
takes less than 1 s. [Edit ]This is obvious because it is read directly from the index, but more generally,
$ time sqlite3 wgs_orig.db "select filter from variants where filter is NULL"
real 2m3.638s
user 0m8.821s
sys 0m34.472s
$ time sqlite3 wgs.db "select filter from variants where filter='PASS'"
real 0m50.561s
user 0m2.854s
sys 0m3.947s
[Edit] I am using GATK, which says 'PASS' in the VCF when a variant passes all VQSR filters. I'd like to understand why it got translated to NULL through Gemini.
In htslib, PASS is stored as NULL, this gets translated to None in python then back to NULL in SQL. We could explicitly store PASS which would allow users to create indicies. But I think that would break backwards compat so we will have to think about it.
Meanwhile you could probably issue an update statement set filter = 'PASS' where filter is NULL.