BPCells icon indicating copy to clipboard operation
BPCells copied to clipboard

[r][cpp] add pseudobulking to matrices

Open immanuelazn opened this issue 1 year ago • 4 comments

Description

Added function matrix_quantile_per_cell() to find the nth quantile value of each cell in a matrix. Allows for clipping using min_by_row() and min_by_col() Added pseudobulk_matrix() with option to clip by quantile, and to aggregate by non-zeros, mean, sum, variance.

Tests

  • For pseudobulk_matrix(), add tests for clipping, non-zeros, mean, sum, and variance in both transpose states.
  • For matrix_quantile_per_cell(), add test comparing against R quantile function across transpose states and quantile values

Details

I've iterated on pseudobulk_matrix() a few times as shown in commit history. I tried to be a little bit smarter by using matrix multiplies to utilize multi-threading. However, the solution for non-zeros and variance is probably non-optimal due to requiring the additional iterative functions.

These iterative functions are not multi-threaded, which makes me think I should have utilized a strategy like computeMatrixStats() in Concat{Cols,Rows}. I think a better way would be to create a child class inheriting from MatrixLoader that finds matrix subsets based off of cell grouping. Then utilize the default MatrixLoader<T>::computeMatrixStats() and manipulate it to fit the output matrix we're looking for. This would probably allow for use of threading, while also limiting the amount of duplicate code.

Follow-up checklist:

(Added during code review)

  • [ ] Move row/colQuantiles into a different file. Perhaps make a matrixStats.R file that collects all of the matrixStats interface functions
  • [ ] Split function documentation to have the matrixStats interfaces as a page, and then maybe split up the arithmetic functions into a few groups too

immanuelazn avatar Sep 18 '24 05:09 immanuelazn

Added in the following changes

  • R interrupt checking as well as removal of Rcpp code from physical logic has been finished for pseudobulk_matrix() and matrix_quantile_per_cell(). These functions now follow the same styling as other functions, with their own cpp and header files.
  • pseudobulk_matrix() calculates everything within a single pass, rather than separate passes for each aggregation method. Using inspiration from calculateMatrixStats()
  • pseudobulk_matrix() now returns a named list built from a struct, as we calculate the simpler aggregation methods during our pass anyways.
  • matrix_quantile_per_cell() has been restricted to usage in non transposed matrices. Should the user call this on a transposed matrix, they will be advised to call transpose_storage_order() first.
  • Simple typo fixes and docstring clarity changes

immanuelazn avatar Sep 21 '24 00:09 immanuelazn

change matrix_quantile_per_cell() to S3 generalizable colQuantile() function change colQuantile() to use type 7 quantile calculation change pseudobulk_matrix() to use a numeric clip_values representing quantile rather than boolean set to .99 change pseudobulk_matrix() to return a single matrix rather than a named list when only one method given remove matrix_quantile_per_row() fix problem with sum calculation in pseudobulk_matrix() when requesting more complex method various documentation changes

immanuelazn avatar Sep 27 '24 08:09 immanuelazn

Few notes:

  • Added rowQuantiles()
  • Change colQuantiles() to return a matrix if multiple probs are given
  • Change colQuantiles() to work with any of the continuous quantile algorithms (4-9)
  • Added tests for checking other types of quantiles, overall consolidation of quantile tests
  • Changed PseudobulkStatMethod to use bitlags
  • Changes pseudobulk_matrix() to not check the transpose conditional with every value check in the matrix. Also cleaned up the code and incorporated aforementioned bit flags

immanuelazn avatar Oct 08 '24 00:10 immanuelazn

I think I have addressed all comments, I also gave it a pass with manually checking everything. As for the two points that you have put up, I put them into an issuue within the projects page, for a new PR. Thanks for being so patient and detailed with your review Ben

immanuelazn avatar Oct 09 '24 22:10 immanuelazn

  • Changed quantile tests to use matrixStats functions, so we can ensure correct syntax and behaviour
  • Reverted PseudobulkStatsMethod bit flagging to no longer be cast as ints.
  • Transposed colQuantiles() to match matrixStats

immanuelazn avatar Oct 21 '24 22:10 immanuelazn