nanocompore icon indicating copy to clipboard operation
nanocompore copied to clipboard

Ongoing transition to DB backend (SQLite)

Open hendrikweisser opened this issue 3 years ago • 11 comments

Changes to Whitelist.py coming up next.

hendrikweisser avatar Mar 11 '21 18:03 hendrikweisser

Whitelist is updated and seems to work, but the read references are now database keys, so SampComp needs updating as well.

hendrikweisser avatar Mar 12 '21 19:03 hendrikweisser

Hi @tleonardi! Totally agree about moving the DB logic. In fact I've already done that, after I realised that much of the same functionality would be needed in SampComp. I just wasn't sure about whether it should go into DataStore; for the moment I've moved it to a new class DatabaseWrapper. Is DataStore intended to specifically handle storing of data? It currently creates the database if that doesn't exist and adds tables, which isn't ideal for just read access.

Do we want one class that handles general DB functionality? Or one lower-level/read access class plus DataStore for writing? (My intention was to adapt DataStore to use DatabaseWrapper, but perhaps combining them into one class would indeed be cleaner.)

I'll push my updates from last night so you can have a look.

hendrikweisser avatar Mar 17 '21 12:03 hendrikweisser

One more point: I've implemented the read-level filtering in Whitelist via constraints in the database query. So the database and filtering are very closely linked. I think it makes sense to keep this code in Whitelist, but it's not agnostic to the storage backend. We could move the code to the DB class, but then that class will contain quite a bit of the filtering logic. (I assume that it's more efficient to filter during the query than after it, but haven't tested that.)

hendrikweisser avatar Mar 17 '21 12:03 hendrikweisser

I went ahead and put all the database code in DataStore. I still need to test it and probably fix some bugs, but I wanted to avoid you looking at an outdated version.

hendrikweisser avatar Mar 17 '21 20:03 hendrikweisser

I've made some more changes. SampComp now works with data from the SQLite DB, including the statistical tests in txCompare. It fails when SampCompDB gets involved to handle the results, but the plan was anyway to replace SampCompDB and shelve with more SQLite. That I still need to do.

hendrikweisser avatar Mar 23 '21 20:03 hendrikweisser

Started implementing SQLite output of SampComp/txCompare results. Work in progress, not yet tested.

hendrikweisser avatar Mar 29 '21 19:03 hendrikweisser

Hi @tleonardi! Some points for discussion:

  • It would be good to store the design matrix (samples/conditions) in the database. The slight hitch is that Eventalign_collapse currently creates the "samples" table, but it doesn't know what the conditions are (because it doesn't need to). So one option would be to pass this information to Eventalign_collapse just so it can be written out. Another option would be to update the table in SampComp, but I'd quite like to keep the separation between input and output database files for the different steps (and "samples" would be in the input for SampComp).
  • In the database design for SampComp results, I'm currently using one row per position (kmer) in a transcript, with columns for all the different p-values etc. Does it actually make sense to have intensity/dwell time test results for three different tests (t, KS, MW) at the same time? Or should that rather be an alternative?
  • You mentioned previously that ideally the models (GMM, logit) should be stored. I assume it would be possible to store pickle results in a blob, but I haven't implemented that. How often are these models actually used again? I'm thinking maybe a better option would be to make it easy to recreate an individual model on demand, if it's only going to be needed very occasionally.

hendrikweisser avatar Mar 31 '21 14:03 hendrikweisser

Regarding the 2nd point above, I think I'll split the "test_results" table into three - for GMM, ANOVA and univariate (t/KS/MW-test) results. What do you think?

hendrikweisser avatar Mar 31 '21 17:03 hendrikweisser

I just pushed more udpates. In principle, the whole pipeline (Eventalign_collapse -> Whitelist -> SampComp -> save_report) should work now. Combining adjacent p-values in a sequence context and multiple testing correction are supported and (lightly) tested, but a lot of the other options aren't tested and may still need updating. In the export class (now called PostProcess - should probably rename to DataExport), save_to_bed isn't updated yet. __main__.py isn't updated yet, either.

hendrikweisser avatar Jul 23 '21 19:07 hendrikweisser

Oh, SampCompDB.py should be obsolete now and can probably be removed.

hendrikweisser avatar Jul 23 '21 19:07 hendrikweisser

@tleonardi: As discussed I've updated the CLI options. Input files (e.g. "-i") must be specified as full paths, but for output files there's the option to specify a directory ("-d") and use the default filename. I've commented out the YAML input option for SampComp, but if you think it's useful just put it back in. There are some other changes (e.g. due to the simplification of tests performed by SampComp) but they should be straightforward.

I've done some light testing and it works on my example data. (Unfortunately the problem that TxComp spawns way too many threads persists for me.)

hendrikweisser avatar Aug 17 '21 10:08 hendrikweisser