[r][cpp] add pseudobulking to matrices
Description
Added function matrix_quantile_per_cell() to find the nth quantile value of each cell in a matrix. Allows for clipping using min_by_row() and min_by_col()
Added pseudobulk_matrix() with option to clip by quantile, and to aggregate by non-zeros, mean, sum, variance.
Tests
- For
pseudobulk_matrix(), add tests for clipping, non-zeros, mean, sum, and variance in both transpose states. - For
matrix_quantile_per_cell(), add test comparing against R quantile function across transpose states and quantile values
Details
I've iterated on pseudobulk_matrix() a few times as shown in commit history. I tried to be a little bit smarter by using matrix multiplies to utilize multi-threading. However, the solution for non-zeros and variance is probably non-optimal due to requiring the additional iterative functions.
These iterative functions are not multi-threaded, which makes me think I should have utilized a strategy like computeMatrixStats() in Concat{Cols,Rows}. I think a better way would be to create a child class inheriting from MatrixLoader that finds matrix subsets based off of cell grouping. Then utilize the default MatrixLoader<T>::computeMatrixStats() and manipulate it to fit the output matrix we're looking for. This would probably allow for use of threading, while also limiting the amount of duplicate code.
Follow-up checklist:
(Added during code review)
- [ ] Move row/colQuantiles into a different file. Perhaps make a
matrixStats.Rfile that collects all of the matrixStats interface functions - [ ] Split function documentation to have the
matrixStatsinterfaces as a page, and then maybe split up the arithmetic functions into a few groups too
Added in the following changes
- R interrupt checking as well as removal of Rcpp code from physical logic has been finished for
pseudobulk_matrix()andmatrix_quantile_per_cell(). These functions now follow the same styling as other functions, with their own cpp and header files. pseudobulk_matrix()calculates everything within a single pass, rather than separate passes for each aggregation method. Using inspiration fromcalculateMatrixStats()pseudobulk_matrix()now returns a named list built from a struct, as we calculate the simpler aggregation methods during our pass anyways.matrix_quantile_per_cell()has been restricted to usage in non transposed matrices. Should the user call this on a transposed matrix, they will be advised to calltranspose_storage_order()first.- Simple typo fixes and docstring clarity changes
change matrix_quantile_per_cell() to S3 generalizable colQuantile() function
change colQuantile() to use type 7 quantile calculation
change pseudobulk_matrix() to use a numeric clip_values representing quantile rather than boolean set to .99
change pseudobulk_matrix() to return a single matrix rather than a named list when only one method given
remove matrix_quantile_per_row()
fix problem with sum calculation in pseudobulk_matrix() when requesting more complex method
various documentation changes
Few notes:
- Added
rowQuantiles() - Change
colQuantiles()to return a matrix if multiple probs are given - Change
colQuantiles()to work with any of the continuous quantile algorithms (4-9) - Added tests for checking other types of quantiles, overall consolidation of quantile tests
- Changed
PseudobulkStatMethodto use bitlags - Changes
pseudobulk_matrix()to not check the transpose conditional with every value check in the matrix. Also cleaned up the code and incorporated aforementioned bit flags
I think I have addressed all comments, I also gave it a pass with manually checking everything. As for the two points that you have put up, I put them into an issuue within the projects page, for a new PR. Thanks for being so patient and detailed with your review Ben
- Changed quantile tests to use
matrixStatsfunctions, so we can ensure correct syntax and behaviour - Reverted
PseudobulkStatsMethodbit flagging to no longer be cast as ints. - Transposed
colQuantiles()to matchmatrixStats