scanpy
scanpy copied to clipboard
hvg flavors seurat and cellranger with batch: bug in subset
- [x] Closes #3027
- [x] Tests included or not required because:
- [ ] Release notes not necessary because:
This PR fixes the bug reported in the linked issue.
A new test which spots the erroneous computations has been added.
I would use this chance to refactor the _highly_variable_genes.py
, rather than using the 2-lines fix suggested in the first commit:
Doing the multi-batch hvg flagging differently for seurat_v3 and seurat/cell_ranger is what made this bug hard to spot in the first place I think.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 76.31%. Comparing base (
896e249
) to head (13b1f6c
). Report is 50 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #3042 +/- ##
=======================================
Coverage 76.31% 76.31%
=======================================
Files 109 109
Lines 12513 12515 +2
=======================================
+ Hits 9549 9551 +2
Misses 2964 2964
Files with missing lines | Coverage Δ | |
---|---|---|
src/scanpy/preprocessing/_highly_variable_genes.py | 95.23% <100.00%> (+0.03%) |
:arrow_up: |
Looks simple enough! Please deduplicate the tests though, they have too many identical lines.
Please deduplicate the tests though, they have too many identical lines.
To do so did across-setting tests with for loop on top of former test...
Would you prefer one separate test with that for loop for the across-settings check?
To test in the future that this bug doesnt happen, the 4 combinations of inplace=True/False and subset=True/False need to be compared for their selected var_names
- if these arguments were be in pytest.mark.parametrize
, the different produced anndatas don't live at the same time and can't be compared...
Could
- extract this specific section into a new, dedicated smaller test
# check that the results are consistent for subset True/False: inplace True
adata_post_subset = adatas["subset_False_inplace_True"][
:, adatas["subset_False_inplace_True"].var["highly_variable"]
]
assert adata_post_subset.var_names.equals(
adatas["subset_True_inplace_True"].var_names
)
# check that the results are consistent for subset True/False: inplace False
df_post_subset = dfs["subset_False_inplace_False"][
dfs["subset_False_inplace_False"]["highly_variable"]
]
assert df_post_subset.index.equals(dfs["subset_True_inplace_False"].index)
# check that the results are consistent for inplace True/False: subset True
assert adatas["subset_True_inplace_True"].var_names.equals(
dfs["subset_True_inplace_False"].index
)
- hardcode the known-to-be-selected features and check against them
what's your preference here?
Ah, silly me, this makes sense.
Since some entries of the dicts are never used, I just removed them and replaced the string with a bool (subset=True/False). This makes it all quite a bit more compact.
Also please remember itertools
: If we’re not writing numba code, it’s always preferable to use it as opposed to nesting for loops.
Timeout and Scrublet failing in Python3.12?
Ah on main too I see
coverage decreased, I think not detected because some pytest.parametrize were removed? I can also add the new specific test separately do avoid this, else it looks good to me
coverage decreased, I think not detected because some pytest.parametrize were removed?
I think codecov just hadn’t updated its comment yet when you saw that.
What you say can’t be, it doesn’t matter how a line was hit: If a line is run, it’ll be reported as hit, if your changes would have caused it to no longer be it, it would have been reported as a miss.