augur
augur copied to clipboard
filter: re-implement with SQLite and add option to use with `--engine`
Description of proposed changes
- Rewrite pandas logic with a database approach.
- Split logic into smaller files.
- sqlite.py contains the working class – an implementation of the abstract base.py.
- Add new
io
functions in io_support/. - Rewrite date parsing: date_parsing.py
- Breaking changes:
-
--query
takes a SQL expression instead of pandas syntax. - Error when
--include-where
,--exclude-where
, or--query
column doesn't exist (partial fix for #754).
-
- Behavioral changes:
- Report min/max date filters separately.
- (internal) less memory usage, more disk usage
Related issue(s)
- #866
Testing
- Rewrite tests for new implementation
- Modify functional tests to check
--non-nucleotide
and incomplete date - Add tests to improve coverage and verify functionality.
TODO
- [x] improve test coverage
- [x] write outputs without loading entire output into memory
- experiment:
- [x] ~use SQLite for FASTA~ (corneliusroemer/fasta_zstd_sqlite?) tried in 06e56981432be868939b9d40cadb0c529e99e38d, not valuable for
augur filter
alone. - [x] date parsing with native SQLite string functions
- [x] re-consider using chunked pd.read_csv/to_sql for loading files, ~parallelized writes~ seems roundabout, better to load directly using new loader class in
io
- [x] ~use SQLite for FASTA~ (corneliusroemer/fasta_zstd_sqlite?) tried in 06e56981432be868939b9d40cadb0c529e99e38d, not valuable for
- [ ] figure out a good story for query syntax transitioning: #920
- [ ] figure out how to organize tests for multiple engines
- [ ] add these ambiguous date error tests back
- [ ] move
db/
->engines/
- [ ] profile on
ncov
data - [ ] update docs with breaking/behavioral changes
- [ ] merge #922
- [ ] merge #923
- [ ] merge #926
- [ ] merge #928
- [ ] merge #929
- [ ] merge #930
- [ ] merge #931
- [ ] merge #937
- [ ] drop/squash commits
for later:
- [ ] parse dates only if needed
Codecov Report
Merging #854 (8acea55) into master (2e4aa88) will decrease coverage by
20.89%
. The diff coverage is85.84%
.
:exclamation: Current head 8acea55 differs from pull request most recent head 0385409. Consider uploading reports for the commit 0385409 to get more accurate results
@@ Coverage Diff @@
## master #854 +/- ##
===========================================
- Coverage 59.64% 38.75% -20.90%
===========================================
Files 53 49 -4
Lines 6317 6379 +62
Branches 1586 1608 +22
===========================================
- Hits 3768 2472 -1296
- Misses 2286 3802 +1516
+ Partials 263 105 -158
Impacted Files | Coverage Δ | |
---|---|---|
augur/frequencies.py | 10.25% <33.33%> (-41.42%) |
:arrow_down: |
augur/filter_support/db/base.py | 67.94% <67.94%> (ø) |
|
augur/utils.py | 42.54% <83.33%> (-21.72%) |
:arrow_down: |
augur/filter_support/output.py | 84.61% <84.61%> (ø) |
|
augur/io_support/db/sqlite.py | 89.13% <89.13%> (ø) |
|
augur/io_support/db/__init__.py | 92.15% <92.15%> (ø) |
|
augur/filter_support/db/sqlite.py | 95.43% <95.43%> (ø) |
|
augur/dates.py | 97.61% <97.34%> (+7.76%) |
:arrow_up: |
augur/filter_support/subsample.py | 97.43% <97.43%> (ø) |
|
augur/filter.py | 100.00% <100.00%> (+4.17%) |
:arrow_up: |
... and 52 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 2e4aa88...0385409. Read the comment docs.
A few quick notes I had during our call just now:
-
Much of the templated SQL seems to interpolate values directly without proper quoting. If interpolation is used, quoting is essential, but using placeholders would be ideal.
-
Similarly, much of the templated SQL seems to interpolate identifiers (column names, table names) without quoting. At the very least it's essential to quote any identifiers that come from the user. It's ok to skip quoting things that we know don't need it, like our constants, although I think it's often easier to blanket quote it all and then not have to worry about it in the future.
-
It's worth checking how often
load_tsv()
commits the inserts. Especially during bulk inserts it can be beneficial to commit every X rows (where X is large but not too large). A quick check shows to me thatload_tsv()
doesn't commit at all? -
Why register date handling functions from Python into SQLite instead of using SQLite's built-in date handling functions? (Or string handling functions, I guess, since that's all the Python is doing.)
@tsibley
- Much of the templated SQL seems to interpolate values directly without proper quoting. If interpolation is used, quoting is essential, but using placeholders would be ideal.
Could you clarify what you mean by using interpolation vs. placeholders?
- Similarly, much of the templated SQL seems to interpolate identifiers (column names, table names) without quoting. At the very least it's essential to quote any identifiers that come from the user. It's ok to skip quoting things that we know don't need it, like our constants, although I think it's often easier to blanket quote it all and then not have to worry about it in the future.
Good point! I did this for the ID and date columns here 54d6ddf572ab8fa77d96d56e7ff10194e8f81fe5...a83a17eecab4e3d23fdcdbcde2c3f81a3ca817a1, but this should at least be expanded to user input such as in --exclude-where
:
https://github.com/nextstrain/augur/blob/1513f1e40d515085c7e776f9bcfbee17b0433dca/augur/filter_support/db/sqlite.py#L242-L244
And agreed on blanket quote.
- It's worth checking how often load_tsv() commits the inserts. Especially during bulk inserts it can be beneficial to commit every X rows (where X is large but not too large). A quick check shows to me that load_tsv() doesn't commit at all?
Yeah, I don't know where I got the false notion that commit isn't necessary with executemany()
. I'll do some experimenting with different data loading options as we discussed.
- Why register date handling functions from Python into SQLite instead of using SQLite's built-in date handling functions? (Or string handling functions, I guess, since that's all the Python is doing.)
Mostly because I imagined the complexity of writing get_date_max()
in SQL and thought it would be good to stick to existing logic implemented in Python.
However, date parsing now takes up a decent amount of time (2nd to loading metadata), so it might be worth trying out a SQL implementation.
Could you clarify what you mean by using interpolation vs. placeholders?
cur.execute(f"select * from x where y = '{z}'") # interpolation (problematic)
cur.execute("select * from x where y = ?", (z,)) # placeholder
cur.execute("select * from x where y = :z", {"z": z}) # placeholder
Placeholders are ubiquitous in SQL interfaces, with ?
being the most common syntax. SQLite supports several forms.
Good point! I did this for the ID and date columns here 54d6ddf...a83a17e
Ah, surrounding identifiers in double quotes is a start, but the identifier also needs double quotes escaped by way of doubling them up, e.g.
identifier.replace('"', '""')
before interpolation.
Yeah, I don't know where I got the false notion that commit isn't necessary with
executemany()
. I'll do some experimenting with different data loading options as we discussed.
:+1:
The default transaction handling of Python's sqlite3 module is a bit unusual: it's not auto-commit, which ok good, but it is auto-begin for DML, which is a weird combo. IME, it is easier to reason about transactions by disabling that default with isolation_level = None
and using explicit transaction handling in our code. It's then clear when you need to issue a commit: it's when you've issued a begin earlier. This can be wrapped up into a context handler too (extending the default context manager for connections just a bit).
Mostly because I imagined the complexity of writing
get_date_max()
in SQL and thought it would be good to stick to existing logic implemented in Python.However, date parsing now takes up a decent amount of time (2nd to loading metadata), so it might be worth trying out a SQL implementation.
Yeah, the time the date table creation takes makes me think a lot of it is spent shuttling between SQLite (C) and Python and back for those functions. I agree get_date_max()
is not enticing to write in SQL, but (especially initially) it doesn't have to be all or nothing: we can register get_date_max()
from Python but use built-in funcs for the others.
Force-pushing date parsing improvements (pre-rebase d3e466da...903d92ea)
Force-pushing date parsing improvements and force-include reporting (pre-rebase 903d92e...911e1379)
@tsibley
After some more experimenting and optimization of current code, I've decided to not to pursue native SQLite for date logic:
-
Caching the Python user-defined functions brings the total run time of date table creation for current GISAID data down from 58s without caching to 12s with caching on my machine. This should be sufficient.
-
Implementing in SQLite is not very straightforward with limited built-in functionality. To get year/month/day from an ISO 8601 date string:
with t1 as ( select substr("2018-03-25", 0, instr("2018-03-25", '-')) as year, substr("2018-03-25", instr("2018-03-25", '-') + 1) as no_year ) select t1.year, substr(no_year, 0, instr(no_year, '-')) as month, substr(no_year, instr(no_year, '-') + 1) as day from t1;
and this is without error handling (returning
NULL
if not parseable). -
Python functions have more potential to be re-used in other areas.
@victorlin Ok, sounds good. These are great candidates for caching, indeed.
That example SQL is a little more tortuous than I was expecting, which was using a date/time function or string handling function to extract the parts from an ISO date, e.g.:
sqlite> select cast(strftime('%m', '2018-03-25') as int) as month
3
Though I realize now it would be a little more complicated than that with date/time funcs thanks to our need to handle forms like 2018-03-XX
and 2018
too. String handling (like the Python versions do) would sidestep those issues, but I see now that SQLite doesn't have a split()
or equivalent (which is a bit boggling).
Force-pushing updated tests, SQL identifier sanitizing, connection handling improvements (pre-rebase 911e1379...3eb8c987)
I investigated some data loading options today:
-
executemany
with a generator (I was doing this wrong until 6e14640c 😬) -
isolation_level=None
with explicitbegin
/commit
-
PRAGMA journal_mode = WAL
-
PRAGMA synchronous = OFF
Testing on my local machine, executemany
without any non-default pragmas yields the fastest run time (difference with synchronous = OFF
is negligible, and I was surprised that journal_mode = WAL
was just a bit slower, but still noticeable).
Due to the single-threaded and sequential nature of augur filter
(load data -> compute intermediate tables -> write outputs), there isn't a need for concurrency of multiple readers which is why people seem to use PRAGMA journal_mode = WAL
. I don't see any logical way to change things where it would make sense to have concurrency, as there is no need to read until it is all loaded. The only benefit would be concurrent writers for potential speed-up, which isn't so great with one write lock.
For proper auto-committing, I changed to use the default connection context manager rather than sharing one connection across calls.
There is potential to re-use the same metadata for cases like subsampling schemes in the ncov workflow which prompts "concurrency", but for that I think it's easier to just create copies of the database file rather than sharing a single one:
- Current SQLite logic relies on creating intermediate tables with predefined names.
- Sharing a database file would require concurrent writes to the file which is tricky.
The alternative process would require a new pre-compute augur
command similar to augur index
, and may look something like:
augur load metadata.tsv db1.sqlite3
cp db1.sqlite3 db2.sqlite3
# the following can be run in parallel
augur filter --metadata-db db1.sqlite3 --query 'country = "USA"' --output-strains usa.txt
augur filter --metadata-db db2.sqlite3 --query 'country != "USA"' --output-strains non-usa.txt
This would come in a later PR, but point being there should be no need to worry about supporting concurrency in this PR.
Force-pushing additional tests, styling fixes, improved type hints and docstrings (pre-rebase 3eb8c987...a73aec30)
Force-pushing rebase onto master (equivalent merge commit 42370471c87fc4fecb1d16a0a086e8208344bdf2)
…but for that I think it's easier to just create copies of the database file rather than sharing a single one:
- Current SQLite logic relies on creating intermediate tables with predefined names.
- Sharing a database file would require concurrent writes to the file which is tricky.
Instead of copying the database to avoid the two problems above, augur filter
could ATTACH
to the input db from within the "main" temporary/in-memory db it uses to do its work. This treats the input db as something conceptually separate from the transient working db (which is, roughly speaking, an implementation detail) and more like an input tsv, rather than treating the input db as a pre-loaded snapshot of the transient working db. Input db would want to be in WAL mode to support concurrent reads.
Update: this stalled as I shifted gears to work on other projects. The current state is that it works but opening for formal review is pending:
- Incorporating newer changes (e.g. weekly grouping).
- Refactoring either the existing code or proposed changes to have a consistent "internal API" between the two engines.
- Extensive testing to make sure results are comparable to
--engine pandas
. - A good story to maintain multiple engines so that new features are added to both simultaneously (at least until one is deprecated) and it is not cumbersome to do so. This applies to both implementation and automated tests.