tabmat
tabmat copied to clipboard
`_split_col_subsets` ignores columns when non-monotonic
Maybe this is not the right function for what I am looking for. I expected this function to be perfect for the __getitem__ column subsetting but it's not working if the columns are not monotonic. PR #91 depends on fixing this issue.
Minimal working example:
import quantcore.matrix as mx
import numpy as np
a = mx.DenseMatrix(np.arange(16, dtype=float).reshape(4, 4))
sm = mx.SplitMatrix(matrices=[a])
sm._split_col_subsets([0, 2, 3])
# Correct - Out[6]: ([array([0, 1, 2], dtype=int32)], [array([0, 2, 3], dtype=int32)], 3)
ms._split_col_subsets([0, 3, 2])
# Incorrect - Out[9]: ([array([0, 1], dtype=int32)], [array([0, 3], dtype=int32)], 3)
# Expected: ([array([0, 1, 2], dtype=int32)], [array([0, 3, 2], dtype=int32)], 3)
@tbenthompson, if you could look into this I would like to get your opinion on whether we should spend the time fixing the _split_col_subsets method or write a new method for the column subsetting.
Yes, you're correct that _split_col_subsets currently assumes the input columns are ordered monotonically. One thing we could do is sort the columns first, then call _split_col_subsets and then "unsort" the outputs. That would probably be the easiest thing to do. In order to minimize the amount of Cython, it might be nice to implement that sort/unsort step in a python wrapper function rather than directly in the Cython code.
I could help with this tomorrow or soon. Let me know if that would be helpful.
I actually ran into this same issue from the other "end" recently when I fixed some bugs with subsets that were not monotonically specified. So this same bug is a problem if either:
- The
colsto choose aren't monotonic. Or - The
subset_cols_indicesaren't monotonic.
I fixed the subset_cols_indices problem by correcting the function that was creating the SplitMatrix and then requiring that input indices be monotonic.
Sorry about not realizing at the time that the cols problem would also be an issue!
My PR currently contains a fix.
My solution was to create a column_map attribute to the SplitMatrix object. It stores the mapping from index location to actual location in the matrices.
Then, creating a _split_col_subsets-like function that supports unordered columns and duplicate column was much more easy. I also modified the getcol method to use this.
We can discuss this in the coming days. Downside to this is that we need to store an array of size n_cols x 2, but I think it's worth it.
@MarcAntoineSchmidtQC did this get fixed or did it get dropped when you closed your PR? Is it worth reviving just this fix?