nanocompore
nanocompore copied to clipboard
Ongoing transition to DB backend (SQLite)
Changes to Whitelist.py coming up next.
Whitelist
is updated and seems to work, but the read references are now database keys, so SampComp
needs updating as well.
Hi @tleonardi!
Totally agree about moving the DB logic. In fact I've already done that, after I realised that much of the same functionality would be needed in SampComp
. I just wasn't sure about whether it should go into DataStore
; for the moment I've moved it to a new class DatabaseWrapper
. Is DataStore
intended to specifically handle storing of data? It currently creates the database if that doesn't exist and adds tables, which isn't ideal for just read access.
Do we want one class that handles general DB functionality? Or one lower-level/read access class plus DataStore
for writing? (My intention was to adapt DataStore
to use DatabaseWrapper
, but perhaps combining them into one class would indeed be cleaner.)
I'll push my updates from last night so you can have a look.
One more point: I've implemented the read-level filtering in Whitelist
via constraints in the database query. So the database and filtering are very closely linked. I think it makes sense to keep this code in Whitelist
, but it's not agnostic to the storage backend. We could move the code to the DB class, but then that class will contain quite a bit of the filtering logic. (I assume that it's more efficient to filter during the query than after it, but haven't tested that.)
I went ahead and put all the database code in DataStore
. I still need to test it and probably fix some bugs, but I wanted to avoid you looking at an outdated version.
I've made some more changes. SampComp
now works with data from the SQLite DB, including the statistical tests in txCompare
. It fails when SampCompDB
gets involved to handle the results, but the plan was anyway to replace SampCompDB
and shelve
with more SQLite. That I still need to do.
Started implementing SQLite output of SampComp
/txCompare
results. Work in progress, not yet tested.
Hi @tleonardi! Some points for discussion:
- It would be good to store the design matrix (samples/conditions) in the database. The slight hitch is that
Eventalign_collapse
currently creates the "samples" table, but it doesn't know what the conditions are (because it doesn't need to). So one option would be to pass this information toEventalign_collapse
just so it can be written out. Another option would be to update the table inSampComp
, but I'd quite like to keep the separation between input and output database files for the different steps (and "samples" would be in the input forSampComp
). - In the database design for
SampComp
results, I'm currently using one row per position (kmer) in a transcript, with columns for all the different p-values etc. Does it actually make sense to have intensity/dwell time test results for three different tests (t, KS, MW) at the same time? Or should that rather be an alternative? - You mentioned previously that ideally the models (GMM, logit) should be stored. I assume it would be possible to store pickle results in a blob, but I haven't implemented that. How often are these models actually used again? I'm thinking maybe a better option would be to make it easy to recreate an individual model on demand, if it's only going to be needed very occasionally.
Regarding the 2nd point above, I think I'll split the "test_results" table into three - for GMM, ANOVA and univariate (t/KS/MW-test) results. What do you think?
I just pushed more udpates. In principle, the whole pipeline (Eventalign_collapse -> Whitelist -> SampComp -> save_report
) should work now. Combining adjacent p-values in a sequence context and multiple testing correction are supported and (lightly) tested, but a lot of the other options aren't tested and may still need updating. In the export class (now called PostProcess
- should probably rename to DataExport
), save_to_bed
isn't updated yet. __main__.py
isn't updated yet, either.
Oh, SampCompDB.py
should be obsolete now and can probably be removed.
@tleonardi: As discussed I've updated the CLI options. Input files (e.g. "-i") must be specified as full paths, but for output files there's the option to specify a directory ("-d") and use the default filename. I've commented out the YAML input option for SampComp, but if you think it's useful just put it back in. There are some other changes (e.g. due to the simplification of tests performed by SampComp) but they should be straightforward.
I've done some light testing and it works on my example data. (Unfortunately the problem that TxComp spawns way too many threads persists for me.)