More robust name validation
Closes #624
- This pull request changes name validation rules:
- allow additionally
.(now allowing_-.and alphanumeric, which includes0-9a-zA-Zbut also other Unicode likeɑand²) - forbid full names
.,.. - forbid prefix
__ - forbid names only differing in character case, like
abc,Abc(only one of them allowed, no matter which case)
- allow additionally
- Name validation is now also applied to AnnData tables (keys/columns in
obs,obsm,obsp,var,varm,varp,uns).- For
obsandvardataframes,_indexis forbidden.
- For
- Validation happens at construction time when adding elements to an element type dictionary (as before).
- Additionally, validation happens before writing to Zarr.
Codecov Report
Attention: Patch coverage is 98.27586% with 3 lines in your changes missing coverage. Please review.
Project coverage is 91.93%. Comparing base (
39a10a1) to head (1dd32a5). Report is 34 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/spatialdata/_core/spatialdata.py | 93.47% | 3 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #703 +/- ##
==========================================
+ Coverage 91.72% 91.93% +0.20%
==========================================
Files 46 47 +1
Lines 7166 7290 +124
==========================================
+ Hits 6573 6702 +129
+ Misses 593 588 -5
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/spatialdata/_core/_elements.py | 92.22% <100.00%> (-0.09%) |
:arrow_down: |
| src/spatialdata/_core/validation.py | 100.00% <100.00%> (ø) |
|
| src/spatialdata/models/__init__.py | 100.00% <100.00%> (ø) |
|
| src/spatialdata/models/models.py | 87.76% <100.00%> (-0.08%) |
:arrow_down: |
| src/spatialdata/_core/spatialdata.py | 91.26% <93.47%> (+0.65%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Excellent PR @aeisenbarth, thank you!
I performed my code review and applied directly the code changes. I list them here:
- I added a check for
layersas well for tables (updated design docs and tests accordingly) - There was a bug in
_validate_all_elements(): it should beelement_type == 'tables'(instead of 'table'`)- That if condition was not covered by tests, so I added a test for that
- In
test_spatialdata_operations.py, some checks for tables were missing (due to old code, that was expecting a single table), I update that - Same for some code in
test_readwrite.py - in
test_writing_invalid_name()a test for labels was commented; I uncommented it - I extended
test_writing_invalid_name()to consider:- writing of a table with a valid name bug invalid "subnames" (this is the test I mentioned above that was not covering the "table" vs "tables" bug)
- incremental writing of single elements (before validation of table "subnames" was triggered only on
write(), now also onwrite_element()).
- I now trigger the name validation also on
TableModel().validate()and not just onTableModel().parse(). I added tests for that intest_models.py
I ask you please to give a double check to my code changes, and if you agree with them (or after your edits), let's merge 😊
The explanation in the Discussions on how to be able to read datasets with naming problems is great! One minor todo:
- [ ] add a link to the discussion https://github.com/scverse/spatialdata/discussions/707 in the exception that the code raises when reading a dataset with naming problems.
Thanks, the changes are good.
add a link to the discussion https://github.com/scverse/spatialdata/discussions/707 in the exception that the code raises when reading a dataset with naming problems.
The exception is not raised in a single place. These exceptions would need to be changed:
check_valid_name L74-84 (6 exceptions)
_iter_anndata_attr_keys_collect_value_errors L196
The problem with _iter_anndata_attr_keys_collect_value_errors is that it collects one or more of the above exceptions, so it would include the link multiple times.
Probably we would rather refactor the code or create a wrapper function that adds the link to a raised exception, and call that function in place of check_valid_name, validate_table_attr_keys.
(in Elements._check_key, SpatialData.write, SpatialData.write_element, SpatialData.write_transformations, TableModel.validate, TableModel.parse)
We just discussed following ideas in the meeting:
- Add a flag to optionally skip validation on reading and maybe on model construction.
- :heavy_plus_sign: This facilitates debugging or fixing data.
- For reading it seems feasible.
- For model construction, every operation adding an element to the dictionary would also need to offer the flag. This doesn't work for
dict[key]=assignment, only foradd_element(name, elem, validate=False). Also it would affect many places in code and increase complexity.
- It might be better to throw validation errors only on writing, and just warnings on reading or construction.
- :heavy_plus_sign: Allows reading old/invalid files
- :heavy_plus_sign: Allows users to easily fix invalid names by renaming in memory, without extra tools
- :heavy_minus_sign: In-memory representation is not guaranteed to be valid
- :heavy_minus_sign: More complex
- I see the possibility of invalid in-memory objects problematic, because other functions in the code cannot trust that it is valid. Also when a not-disk-backed SpatialData is passed to other libraries (scanpy etc.) that assume all data is disk-backed and validated.
- In my view, we should expect all newly created datasets after this PR to be valid. There should only exist very little older datasets that violate this constraints, which would benefit from a warning instead of an error. And they need to be migrated anyways, either due to an error on reading, or due to a warning.
- The use case of reading possibly invalid data overlaps with the issues of 1) gracefully reading legacy formats into the latest in-memory representation (partially implemented for parquet) and 2) partially reading corrupted data #457.
Any opinions, @LucaMarconato, @giovp ?
I extended the validation error message to include the link to the instructions for renaming misnamed elements. I think it is ready for a final review.
For example read_zarr displays now (in test test_reading_invalid_name):
Cannot construct SpatialData object, input contains invalid elements.
For renaming, please see the discussion here https://github.com/scverse/spatialdata/discussions/707 .
shapes/non-alnum_#$%&()*+,?@: Name must contain only alphanumeric characters, underscores, dots and hyphens.
points/has whitespace: Name must contain only alphanumeric characters, underscores, dots and hyphens.
- There were still some redundant validations that I kept:
When a given name is used to referr to an existing element (not to add a new one), we can assume the existing elements had been validated at construction time, and when an invalid name is queried, no element will be found. This is the case in
SpatialData.write_element(element_name),SpatialData.write_transformations(element_name),SpatialData.write_metadata(element_name). - I decided to remove the name validation in
SpatialData.delete_element_from_disk, for the reason above, and especially if an element somehow got an invalid name we should allow deleting it.
Great addition @aeisenbarth! The mechanism to collect the exception and display a single message works great!
There were still some redundant validations that I kept: When a given name is used to referr to an existing element (not to add a new one), we can assume the existing elements had been validated at construction time, and when an invalid name is queried, no element will be found. This is the case in SpatialData.write_element(element_name), SpatialData.write_transformations(element_name), SpatialData.write_metadata(element_name). I decided to remove the name validation in SpatialData.delete_element_from_disk, for the reason above, and especially if an element somehow got an invalid name we should allow deleting it.
Good points!
Add a flag to optionally skip validation on reading and maybe on model construction.
I also prefer this option (option 1), but I would wait to see if there are multiple instances of having to correct the names of the datasets before making such change, so we keep the code simpler.