Description of proposed changes

Rewrite pandas logic with a database approach.
- Split logic into smaller files.
- sqlite.py contains the working class – an implementation of the abstract base.py.
- Add new io functions in io_support/.
- Rewrite date parsing: date_parsing.py
Breaking changes:
- --query takes a SQL expression instead of pandas syntax.
- Error when --include-where, --exclude-where, or --query column doesn't exist (partial fix for #754).
Behavioral changes:
- Report min/max date filters separately.
- (internal) less memory usage, more disk usage

Related issue(s)

#866

Testing

Rewrite tests for new implementation
Modify functional tests to check --non-nucleotide and incomplete date
Add tests to improve coverage and verify functionality.

TODO

[x] improve test coverage
[x] write outputs without loading entire output into memory
experiment:
- [x] ~use SQLite for FASTA~ (corneliusroemer/fasta_zstd_sqlite?) tried in 06e56981432be868939b9d40cadb0c529e99e38d, not valuable for augur filter alone.
- [x] date parsing with native SQLite string functions
- [x] re-consider using chunked pd.read_csv/to_sql for loading files, ~parallelized writes~ seems roundabout, better to load directly using new loader class in io
[ ] figure out a good story for query syntax transitioning: #920
[ ] figure out how to organize tests for multiple engines
[ ] add these ambiguous date error tests back
[ ] move db/ -> engines/
[ ] profile on ncov data
[ ] update docs with breaking/behavioral changes
[ ] merge #922
[ ] merge #923
[ ] merge #926
[ ] merge #928
[ ] merge #929
[ ] merge #930
[ ] merge #931
[ ] merge #937
[ ] drop/squash commits

for later:

[ ] parse dates only if needed

Feb 17 '22 20:02 victorlin

Codecov Report

Merging #854 (8acea55) into master (2e4aa88) will decrease coverage by 20.89%. The diff coverage is 85.84%.

:exclamation: Current head 8acea55 differs from pull request most recent head 0385409. Consider uploading reports for the commit 0385409 to get more accurate results

@@             Coverage Diff             @@
##           master     #854       +/-   ##
===========================================
- Coverage   59.64%   38.75%   -20.90%     
===========================================
  Files          53       49        -4     
  Lines        6317     6379       +62     
  Branches     1586     1608       +22     
===========================================
- Hits         3768     2472     -1296     
- Misses       2286     3802     +1516     
+ Partials      263      105      -158

Impacted Files	Coverage Δ
augur/frequencies.py	`10.25% <33.33%> (-41.42%)`	:arrow_down:
augur/filter_support/db/base.py	`67.94% <67.94%> (ø)`
augur/utils.py	`42.54% <83.33%> (-21.72%)`	:arrow_down:
augur/filter_support/output.py	`84.61% <84.61%> (ø)`
augur/io_support/db/sqlite.py	`89.13% <89.13%> (ø)`
augur/io_support/db/__init__.py	`92.15% <92.15%> (ø)`
augur/filter_support/db/sqlite.py	`95.43% <95.43%> (ø)`
augur/dates.py	`97.61% <97.34%> (+7.76%)`	:arrow_up:
augur/filter_support/subsample.py	`97.43% <97.43%> (ø)`
augur/filter.py	`100.00% <100.00%> (+4.17%)`	:arrow_up:
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2e4aa88...0385409. Read the comment docs.

Feb 17 '22 20:02 codecov[bot]

A few quick notes I had during our call just now:

Much of the templated SQL seems to interpolate values directly without proper quoting. If interpolation is used, quoting is essential, but using placeholders would be ideal.
Similarly, much of the templated SQL seems to interpolate identifiers (column names, table names) without quoting. At the very least it's essential to quote any identifiers that come from the user. It's ok to skip quoting things that we know don't need it, like our constants, although I think it's often easier to blanket quote it all and then not have to worry about it in the future.
It's worth checking how often load_tsv() commits the inserts. Especially during bulk inserts it can be beneficial to commit every X rows (where X is large but not too large). A quick check shows to me that load_tsv() doesn't commit at all?
Why register date handling functions from Python into SQLite instead of using SQLite's built-in date handling functions? (Or string handling functions, I guess, since that's all the Python is doing.)

Feb 17 '22 23:02 tsibley

@tsibley

Much of the templated SQL seems to interpolate values directly without proper quoting. If interpolation is used, quoting is essential, but using placeholders would be ideal.

Could you clarify what you mean by using interpolation vs. placeholders?

Similarly, much of the templated SQL seems to interpolate identifiers (column names, table names) without quoting. At the very least it's essential to quote any identifiers that come from the user. It's ok to skip quoting things that we know don't need it, like our constants, although I think it's often easier to blanket quote it all and then not have to worry about it in the future.

Good point! I did this for the ID and date columns here 54d6ddf572ab8fa77d96d56e7ff10194e8f81fe5...a83a17eecab4e3d23fdcdbcde2c3f81a3ca817a1, but this should at least be expanded to user input such as in --exclude-where:

https://github.com/nextstrain/augur/blob/1513f1e40d515085c7e776f9bcfbee17b0433dca/augur/filter_support/db/sqlite.py#L242-L244

And agreed on blanket quote.

It's worth checking how often load_tsv() commits the inserts. Especially during bulk inserts it can be beneficial to commit every X rows (where X is large but not too large). A quick check shows to me that load_tsv() doesn't commit at all?

Yeah, I don't know where I got the false notion that commit isn't necessary with executemany(). I'll do some experimenting with different data loading options as we discussed.

Why register date handling functions from Python into SQLite instead of using SQLite's built-in date handling functions? (Or string handling functions, I guess, since that's all the Python is doing.)

Mostly because I imagined the complexity of writing get_date_max() in SQL and thought it would be good to stick to existing logic implemented in Python.

However, date parsing now takes up a decent amount of time (2nd to loading metadata), so it might be worth trying out a SQL implementation.

Feb 18 '22 00:02 victorlin

Could you clarify what you mean by using interpolation vs. placeholders?

cur.execute(f"select * from x where y = '{z}'")       # interpolation (problematic)
cur.execute("select * from x where y = ?", (z,))      # placeholder
cur.execute("select * from x where y = :z", {"z": z}) # placeholder

Placeholders are ubiquitous in SQL interfaces, with ? being the most common syntax. SQLite supports several forms.

Good point! I did this for the ID and date columns here 54d6ddf...a83a17e

Ah, surrounding identifiers in double quotes is a start, but the identifier also needs double quotes escaped by way of doubling them up, e.g.

identifier.replace('"', '""')

before interpolation.

Yeah, I don't know where I got the false notion that commit isn't necessary with executemany(). I'll do some experimenting with different data loading options as we discussed.

:+1:

The default transaction handling of Python's sqlite3 module is a bit unusual: it's not auto-commit, which ok good, but it is auto-begin for DML, which is a weird combo. IME, it is easier to reason about transactions by disabling that default with isolation_level = None and using explicit transaction handling in our code. It's then clear when you need to issue a commit: it's when you've issued a begin earlier. This can be wrapped up into a context handler too (extending the default context manager for connections just a bit).

Mostly because I imagined the complexity of writing get_date_max() in SQL and thought it would be good to stick to existing logic implemented in Python.

However, date parsing now takes up a decent amount of time (2nd to loading metadata), so it might be worth trying out a SQL implementation.

Yeah, the time the date table creation takes makes me think a lot of it is spent shuttling between SQLite (C) and Python and back for those functions. I agree get_date_max() is not enticing to write in SQL, but (especially initially) it doesn't have to be all or nothing: we can register get_date_max() from Python but use built-in funcs for the others.

Feb 18 '22 18:02 tsibley

Force-pushing date parsing improvements (pre-rebase d3e466da...903d92ea)

Feb 23 '22 23:02 victorlin

Force-pushing date parsing improvements and force-include reporting (pre-rebase 903d92e...911e1379)

Feb 25 '22 17:02 victorlin

@tsibley

After some more experimenting and optimization of current code, I've decided to not to pursue native SQLite for date logic:

Caching the Python user-defined functions brings the total run time of date table creation for current GISAID data down from 58s without caching to 12s with caching on my machine. This should be sufficient.

Implementing in SQLite is not very straightforward with limited built-in functionality. To get year/month/day from an ISO 8601 date string:

with t1 as (
    select
        substr("2018-03-25", 0, instr("2018-03-25", '-')) as year,
        substr("2018-03-25", instr("2018-03-25", '-') + 1) as no_year
)
select
    t1.year,
    substr(no_year, 0, instr(no_year, '-')) as month,
    substr(no_year, instr(no_year, '-') + 1) as day
from t1;

and this is without error handling (returning NULL if not parseable).

Python functions have more potential to be re-used in other areas.

Feb 25 '22 17:02 victorlin

@victorlin Ok, sounds good. These are great candidates for caching, indeed.

That example SQL is a little more tortuous than I was expecting, which was using a date/time function or string handling function to extract the parts from an ISO date, e.g.:

sqlite> select cast(strftime('%m', '2018-03-25') as int) as month
3

Though I realize now it would be a little more complicated than that with date/time funcs thanks to our need to handle forms like 2018-03-XX and 2018 too. String handling (like the Python versions do) would sidestep those issues, but I see now that SQLite doesn't have a split() or equivalent (which is a bit boggling).

Feb 25 '22 23:02 tsibley

Force-pushing updated tests, SQL identifier sanitizing, connection handling improvements (pre-rebase 911e1379...3eb8c987)

Mar 01 '22 03:03 victorlin

I investigated some data loading options today:

executemany with a generator (I was doing this wrong until 6e14640c 😬)
isolation_level=None with explicit begin/commit
PRAGMA journal_mode = WAL
PRAGMA synchronous = OFF

Testing on my local machine, executemany without any non-default pragmas yields the fastest run time (difference with synchronous = OFF is negligible, and I was surprised that journal_mode = WAL was just a bit slower, but still noticeable).

Due to the single-threaded and sequential nature of augur filter (load data -> compute intermediate tables -> write outputs), there isn't a need for concurrency of multiple readers which is why people seem to use PRAGMA journal_mode = WAL. I don't see any logical way to change things where it would make sense to have concurrency, as there is no need to read until it is all loaded. The only benefit would be concurrent writers for potential speed-up, which isn't so great with one write lock.

For proper auto-committing, I changed to use the default connection context manager rather than sharing one connection across calls.

There is potential to re-use the same metadata for cases like subsampling schemes in the ncov workflow which prompts "concurrency", but for that I think it's easier to just create copies of the database file rather than sharing a single one:

Current SQLite logic relies on creating intermediate tables with predefined names.
Sharing a database file would require concurrent writes to the file which is tricky.

The alternative process would require a new pre-compute augur command similar to augur index, and may look something like:

augur load metadata.tsv db1.sqlite3
cp db1.sqlite3 db2.sqlite3
# the following can be run in parallel
augur filter --metadata-db db1.sqlite3 --query 'country = "USA"' --output-strains usa.txt
augur filter --metadata-db db2.sqlite3 --query 'country != "USA"' --output-strains non-usa.txt

This would come in a later PR, but point being there should be no need to worry about supporting concurrency in this PR.

Mar 01 '22 03:03 victorlin

Force-pushing additional tests, styling fixes, improved type hints and docstrings (pre-rebase 3eb8c987...a73aec30)

Mar 09 '22 08:03 victorlin

Force-pushing rebase onto master (equivalent merge commit 42370471c87fc4fecb1d16a0a086e8208344bdf2)

Mar 11 '22 21:03 victorlin

…but for that I think it's easier to just create copies of the database file rather than sharing a single one:

Current SQLite logic relies on creating intermediate tables with predefined names.

Sharing a database file would require concurrent writes to the file which is tricky.

Instead of copying the database to avoid the two problems above, augur filter could ATTACH to the input db from within the "main" temporary/in-memory db it uses to do its work. This treats the input db as something conceptually separate from the transient working db (which is, roughly speaking, an implementation detail) and more like an input tsv, rather than treating the input db as a pre-loaded snapshot of the transient working db. Input db would want to be in WAL mode to support concurrent reads.

May 04 '22 19:05 tsibley

Update: this stalled as I shifted gears to work on other projects. The current state is that it works but opening for formal review is pending:

Incorporating newer changes (e.g. weekly grouping).
Refactoring either the existing code or proposed changes to have a consistent "internal API" between the two engines.
Extensive testing to make sure results are comparable to --engine pandas.
A good story to maintain multiple engines so that new features are added to both simultaneously (at least until one is deprecated) and it is not cumbersome to do so. This applies to both implementation and automated tests.

Nov 04 '22 00:11 victorlin

augur
augur copied to clipboard

filter: re-implement with SQLite and add option to use with `--engine`

Description of proposed changes

Related issue(s)

Testing

TODO

Codecov Report

augur augur copied to clipboard

filter: re-implement with SQLite and add option to use with `--engine`

Description of proposed changes

Related issue(s)

Testing

TODO

Codecov Report

augur
augur copied to clipboard