spatialdata icon indicating copy to clipboard operation
spatialdata copied to clipboard

More robust name validation

Open aeisenbarth opened this issue 1 year ago • 1 comments

Closes #624

  • This pull request changes name validation rules:
    • allow additionally . (now allowing _-. and alphanumeric, which includes 0-9a-zA-Z but also other Unicode like ɑ and ²)
    • forbid full names ., ..
    • forbid prefix __
    • forbid names only differing in character case, like abc, Abc (only one of them allowed, no matter which case)
  • Name validation is now also applied to AnnData tables (keys/columns in obs, obsm, obsp, var, varm, varp, uns).
    • For obs and var dataframes, _index is forbidden.
  • Validation happens at construction time when adding elements to an element type dictionary (as before).
  • Additionally, validation happens before writing to Zarr.

aeisenbarth avatar Sep 09 '24 14:09 aeisenbarth

Codecov Report

Attention: Patch coverage is 98.27586% with 3 lines in your changes missing coverage. Please review.

Project coverage is 91.93%. Comparing base (39a10a1) to head (1dd32a5). Report is 34 commits behind head on main.

Files with missing lines Patch % Lines
src/spatialdata/_core/spatialdata.py 93.47% 3 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #703      +/-   ##
==========================================
+ Coverage   91.72%   91.93%   +0.20%     
==========================================
  Files          46       47       +1     
  Lines        7166     7290     +124     
==========================================
+ Hits         6573     6702     +129     
+ Misses        593      588       -5     
Files with missing lines Coverage Δ
src/spatialdata/_core/_elements.py 92.22% <100.00%> (-0.09%) :arrow_down:
src/spatialdata/_core/validation.py 100.00% <100.00%> (ø)
src/spatialdata/models/__init__.py 100.00% <100.00%> (ø)
src/spatialdata/models/models.py 87.76% <100.00%> (-0.08%) :arrow_down:
src/spatialdata/_core/spatialdata.py 91.26% <93.47%> (+0.65%) :arrow_up:
:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Sep 09 '24 16:09 codecov[bot]

Excellent PR @aeisenbarth, thank you!

I performed my code review and applied directly the code changes. I list them here:

  • I added a check for layers as well for tables (updated design docs and tests accordingly)
  • There was a bug in _validate_all_elements(): it should be element_type == 'tables' (instead of 'table'`)
    • That if condition was not covered by tests, so I added a test for that
  • In test_spatialdata_operations.py, some checks for tables were missing (due to old code, that was expecting a single table), I update that
  • Same for some code in test_readwrite.py
  • in test_writing_invalid_name() a test for labels was commented; I uncommented it
  • I extended test_writing_invalid_name() to consider:
    • writing of a table with a valid name bug invalid "subnames" (this is the test I mentioned above that was not covering the "table" vs "tables" bug)
    • incremental writing of single elements (before validation of table "subnames" was triggered only on write(), now also on write_element()).
  • I now trigger the name validation also on TableModel().validate() and not just on TableModel().parse(). I added tests for that in test_models.py

LucaMarconato avatar Sep 28 '24 11:09 LucaMarconato

I ask you please to give a double check to my code changes, and if you agree with them (or after your edits), let's merge 😊

LucaMarconato avatar Sep 28 '24 11:09 LucaMarconato

The explanation in the Discussions on how to be able to read datasets with naming problems is great! One minor todo:

  • [ ] add a link to the discussion https://github.com/scverse/spatialdata/discussions/707 in the exception that the code raises when reading a dataset with naming problems.

LucaMarconato avatar Sep 30 '24 18:09 LucaMarconato

Thanks, the changes are good.

add a link to the discussion https://github.com/scverse/spatialdata/discussions/707 in the exception that the code raises when reading a dataset with naming problems.

The exception is not raised in a single place. These exceptions would need to be changed: check_valid_name L74-84 (6 exceptions) _iter_anndata_attr_keys_collect_value_errors L196 The problem with _iter_anndata_attr_keys_collect_value_errors is that it collects one or more of the above exceptions, so it would include the link multiple times.

Probably we would rather refactor the code or create a wrapper function that adds the link to a raised exception, and call that function in place of check_valid_name, validate_table_attr_keys. (in Elements._check_key, SpatialData.write, SpatialData.write_element, SpatialData.write_transformations, TableModel.validate, TableModel.parse)

aeisenbarth avatar Oct 11 '24 23:10 aeisenbarth

We just discussed following ideas in the meeting:

  1. Add a flag to optionally skip validation on reading and maybe on model construction.
  • :heavy_plus_sign: This facilitates debugging or fixing data.
  • For reading it seems feasible.
  • For model construction, every operation adding an element to the dictionary would also need to offer the flag. This doesn't work for dict[key]= assignment, only for add_element(name, elem, validate=False). Also it would affect many places in code and increase complexity.
  1. It might be better to throw validation errors only on writing, and just warnings on reading or construction.
  • :heavy_plus_sign: Allows reading old/invalid files
  • :heavy_plus_sign: Allows users to easily fix invalid names by renaming in memory, without extra tools
  • :heavy_minus_sign: In-memory representation is not guaranteed to be valid
  • :heavy_minus_sign: More complex
  • I see the possibility of invalid in-memory objects problematic, because other functions in the code cannot trust that it is valid. Also when a not-disk-backed SpatialData is passed to other libraries (scanpy etc.) that assume all data is disk-backed and validated.
  • In my view, we should expect all newly created datasets after this PR to be valid. There should only exist very little older datasets that violate this constraints, which would benefit from a warning instead of an error. And they need to be migrated anyways, either due to an error on reading, or due to a warning.
  • The use case of reading possibly invalid data overlaps with the issues of 1) gracefully reading legacy formats into the latest in-memory representation (partially implemented for parquet) and 2) partially reading corrupted data #457.

Any opinions, @LucaMarconato, @giovp ?

aeisenbarth avatar Oct 17 '24 17:10 aeisenbarth

I extended the validation error message to include the link to the instructions for renaming misnamed elements. I think it is ready for a final review.

For example read_zarr displays now (in test test_reading_invalid_name):

Cannot construct SpatialData object, input contains invalid elements.
For renaming, please see the discussion here https://github.com/scverse/spatialdata/discussions/707 .
  shapes/non-alnum_#$%&()*+,?@: Name must contain only alphanumeric characters, underscores, dots and hyphens.
  points/has whitespace: Name must contain only alphanumeric characters, underscores, dots and hyphens.
  • There were still some redundant validations that I kept: When a given name is used to referr to an existing element (not to add a new one), we can assume the existing elements had been validated at construction time, and when an invalid name is queried, no element will be found. This is the case in SpatialData.write_element(element_name), SpatialData.write_transformations(element_name), SpatialData.write_metadata(element_name).
  • I decided to remove the name validation in SpatialData.delete_element_from_disk, for the reason above, and especially if an element somehow got an invalid name we should allow deleting it.

aeisenbarth avatar Oct 30 '24 18:10 aeisenbarth

Great addition @aeisenbarth! The mechanism to collect the exception and display a single message works great!

There were still some redundant validations that I kept: When a given name is used to referr to an existing element (not to add a new one), we can assume the existing elements had been validated at construction time, and when an invalid name is queried, no element will be found. This is the case in SpatialData.write_element(element_name), SpatialData.write_transformations(element_name), SpatialData.write_metadata(element_name). I decided to remove the name validation in SpatialData.delete_element_from_disk, for the reason above, and especially if an element somehow got an invalid name we should allow deleting it.

Good points!

Add a flag to optionally skip validation on reading and maybe on model construction.

I also prefer this option (option 1), but I would wait to see if there are multiple instances of having to correct the names of the datasets before making such change, so we keep the code simpler.

LucaMarconato avatar Jan 15 '25 16:01 LucaMarconato