scanpy hvg flavors seurat and cellranger with batch: bug in subset

[x] Closes #3027
[x] Tests included or not required because:

[ ] Release notes not necessary because:

This PR fixes the bug reported in the linked issue.

A new test which spots the erroneous computations has been added.

I would use this chance to refactor the _highly_variable_genes.py, rather than using the 2-lines fix suggested in the first commit: Doing the multi-batch hvg flagging differently for seurat_v3 and seurat/cell_ranger is what made this bug hard to spot in the first place I think.

May 02 '24 14:05 eroell

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 76.31%. Comparing base (896e249) to head (13b1f6c). Report is 50 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3042   +/-   ##
=======================================
  Coverage   76.31%   76.31%           
=======================================
  Files         109      109           
  Lines       12513    12515    +2     
=======================================
+ Hits         9549     9551    +2     
  Misses       2964     2964

Files with missing lines	Coverage Δ
src/scanpy/preprocessing/_highly_variable_genes.py	`95.23% <100.00%> (+0.03%)`	:arrow_up:

May 02 '24 14:05 codecov[bot]

Looks simple enough! Please deduplicate the tests though, they have too many identical lines.

May 03 '24 11:05 flying-sheep

Please deduplicate the tests though, they have too many identical lines.

To do so did across-setting tests with for loop on top of former test...

Would you prefer one separate test with that for loop for the across-settings check?

May 21 '24 06:05 eroell

To test in the future that this bug doesnt happen, the 4 combinations of inplace=True/False and subset=True/False need to be compared for their selected var_names - if these arguments were be in pytest.mark.parametrize, the different produced anndatas don't live at the same time and can't be compared...

Could

extract this specific section into a new, dedicated smaller test

    # check that the results are consistent for subset True/False: inplace True
    adata_post_subset = adatas["subset_False_inplace_True"][
        :, adatas["subset_False_inplace_True"].var["highly_variable"]
    ]
    assert adata_post_subset.var_names.equals(
        adatas["subset_True_inplace_True"].var_names
    )

    # check that the results are consistent for subset True/False: inplace False
    df_post_subset = dfs["subset_False_inplace_False"][
        dfs["subset_False_inplace_False"]["highly_variable"]
    ]
    assert df_post_subset.index.equals(dfs["subset_True_inplace_False"].index)

    # check that the results are consistent for inplace True/False: subset True
    assert adatas["subset_True_inplace_True"].var_names.equals(
        dfs["subset_True_inplace_False"].index
    )

hardcode the known-to-be-selected features and check against them

what's your preference here?

Jun 27 '24 07:06 eroell

Ah, silly me, this makes sense.

Since some entries of the dicts are never used, I just removed them and replaced the string with a bool (subset=True/False). This makes it all quite a bit more compact.

Also please remember itertools: If we’re not writing numba code, it’s always preferable to use it as opposed to nesting for loops.

Jun 27 '24 13:06 flying-sheep

Timeout and Scrublet failing in Python3.12?

Jun 27 '24 15:06 eroell

Ah on main too I see

Jun 27 '24 15:06 eroell

coverage decreased, I think not detected because some pytest.parametrize were removed? I can also add the new specific test separately do avoid this, else it looks good to me

Jun 27 '24 15:06 eroell

coverage decreased, I think not detected because some pytest.parametrize were removed?

I think codecov just hadn’t updated its comment yet when you saw that.

What you say can’t be, it doesn’t matter how a line was hit: If a line is run, it’ll be reported as hit, if your changes would have caused it to no longer be it, it would have been reported as a miss.

Jun 28 '24 08:06 flying-sheep

scanpy scanpy copied to clipboard

hvg flavors seurat and cellranger with batch: bug in subset

Codecov Report

scanpy
scanpy copied to clipboard