qurro
qurro copied to clipboard
Add parameter for taking only the n most "extreme" features
Motivation
For the EMP dataset I'm working with right now, it takes a super long time to convert the biom.Table
object to a dataframe (even a sparse one). Filtering out certain features would make this easier. (Thanks to Cameron for the suggestion.)
~~...That being said, it looks like getting the matrix_data
(which is a scipy CSR sparse matrix) from the Table is actually pretty fast. If we can use that instead of a dataframe for the table, that would make things a lot faster (and might enable us to delay implementing the feature filtering option).~~ UPDATE: Yeah, that does make loading the table to a DF faster in python, but 1) matching the table to the sample metadata still takes a really long time and 2) we'll still eventually have to throw that data into JSON for the JS side of things... and I doubt that that'd go over well if we throw in a sparse version of the data.
Implementation
~~I guess this should also be accompanied by another parameter that lets the user specify which ranks to care about, if there are multiple.~~ UPDATE: a better idea is to just find the extreme features for each rank. So if the user says, e.g. -x 10
, then get the top/bottom 20 features for all of the ranks available.
Corollaries to this:
- If a sample doesn't contain any of the remaining features, it should be removed from the data. (It wouldn't show up in the scatterplot anyway, since its log ratio of any combination of the remaining features would always be
log(0/0)
. So there's no reason not to remove it.) - If a sample is an "extreme" for multiple ranks, count it multiple times. (I'm not 100% sure if this is the best course of action, but it should be easy-ish to change this once this has been implemented.)
- If there are less than (the
-x
value * 2) ranked features, don't do anything different. (This should result in a warning being printed, I guess.)
Progress
- [x] Add
-x
option to standalone rrv - [x] Add
-x
option to Q2 rrv - [x] Add function
filter_unextreme_features()
in_rank_processing.py
that converts a BIOM table object to a table with the non-extreme features filtered out, and also filters out now-empty samples.- its parameters should be 1) the original table (as just a
biom.Table
), 2) the feature rankings (in aDataFrame
), and 3) the-x
value (anint
).
- its parameters should be 1) the original table (as just a
- [ ] Add unit tests for
filter_unextreme_features()
:- [x] Standard situation:
x * 2 < (# of features)
, no empty samples, etc. - [x] Empty samples
- [x]
x * 2 >= (# of features)
- [x]
x < 1
, ortype(x) != int
- [x] When feature(s) are in multiple rankings' extrema
- [ ] When there are "ties" in extreme rankings that surpass
x
(e.g. the least extreme extreme feature on one end has the exact same rank as another feature). I think if we just represent retrieval of features as slicing off the ends of a sorted array of features this should be fine. Maybe not like perfect, but it should at least be deterministic?
- [x] Standard situation:
- [ ] Update the HTML interface when
-x
has been used to let the user know what's going on (so they don't think the features are just weirdly binary for some reason).