qurro icon indicating copy to clipboard operation
qurro copied to clipboard

Add parameter for taking only the n most "extreme" features

Open fedarko opened this issue 5 years ago • 0 comments

Motivation

For the EMP dataset I'm working with right now, it takes a super long time to convert the biom.Table object to a dataframe (even a sparse one). Filtering out certain features would make this easier. (Thanks to Cameron for the suggestion.)

~~...That being said, it looks like getting the matrix_data(which is a scipy CSR sparse matrix) from the Table is actually pretty fast. If we can use that instead of a dataframe for the table, that would make things a lot faster (and might enable us to delay implementing the feature filtering option).~~ UPDATE: Yeah, that does make loading the table to a DF faster in python, but 1) matching the table to the sample metadata still takes a really long time and 2) we'll still eventually have to throw that data into JSON for the JS side of things... and I doubt that that'd go over well if we throw in a sparse version of the data.

Implementation

~~I guess this should also be accompanied by another parameter that lets the user specify which ranks to care about, if there are multiple.~~ UPDATE: a better idea is to just find the extreme features for each rank. So if the user says, e.g. -x 10, then get the top/bottom 20 features for all of the ranks available.

Corollaries to this:

  • If a sample doesn't contain any of the remaining features, it should be removed from the data. (It wouldn't show up in the scatterplot anyway, since its log ratio of any combination of the remaining features would always be log(0/0). So there's no reason not to remove it.)
  • If a sample is an "extreme" for multiple ranks, count it multiple times. (I'm not 100% sure if this is the best course of action, but it should be easy-ish to change this once this has been implemented.)
  • If there are less than (the -x value * 2) ranked features, don't do anything different. (This should result in a warning being printed, I guess.)

Progress

  • [x] Add -x option to standalone rrv
  • [x] Add -x option to Q2 rrv
  • [x] Add function filter_unextreme_features() in _rank_processing.py that converts a BIOM table object to a table with the non-extreme features filtered out, and also filters out now-empty samples.
    • its parameters should be 1) the original table (as just a biom.Table), 2) the feature rankings (in a DataFrame), and 3) the -x value (an int).
  • [ ] Add unit tests for filter_unextreme_features():
    • [x] Standard situation: x * 2 < (# of features), no empty samples, etc.
    • [x] Empty samples
    • [x] x * 2 >= (# of features)
    • [x] x < 1, or type(x) != int
    • [x] When feature(s) are in multiple rankings' extrema
    • [ ] When there are "ties" in extreme rankings that surpass x (e.g. the least extreme extreme feature on one end has the exact same rank as another feature). I think if we just represent retrieval of features as slicing off the ends of a sorted array of features this should be fine. Maybe not like perfect, but it should at least be deterministic?
  • [ ] Update the HTML interface when -x has been used to let the user know what's going on (so they don't think the features are just weirdly binary for some reason).

fedarko avatar Apr 30 '19 01:04 fedarko