FIX: cloud paths checking against patterns
The _check_path_matches_patterns function was failing for cloud (e.g. s3, gc3) URIs (added in #1074):
- the path.relative_to() function doesn't work for cloud URIs (it just returns the original).
- combining Path('/') with the result of relative_to() now (ie since universal_pathlib > 0.2.3) throws an error when using cloud paths.
So if using universal_pathlib >0.2.3, any cloud URIs as bids datasets would throw an error message.
This fix drops the uri prefix by using .path, so that the regex check works as expected.
Codecov Report
Attention: Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.
Project coverage is 89.78%. Comparing base (
eb2f926) to head (962cfe7). Report is 13 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| bids/layout/tests/test_remote_bids.py | 80.00% | 2 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #1094 +/- ##
==========================================
- Coverage 89.79% 89.78% -0.01%
==========================================
Files 63 64 +1
Lines 7141 7154 +13
Branches 1367 835 -532
==========================================
+ Hits 6412 6423 +11
- Misses 531 534 +3
+ Partials 198 197 -1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Could you add tests with examples? That will make it easier to be explicit about what was going wrong and allow us to consider other solutions quickly.
Yes good idea, I've added one simple test using an s3 uri from openneuro, and also made another minor fix.
The test passes locally, but I see now it is failing in the CI as botocore can't find the credentials file.. I'll see if there is a good workaround for this later..
For reference: running the test on the code before this PR gives this error:
test_remote_bids.py F [100%]
============================================================= FAILURES =============================================================
____________________________ test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136] ____________________________
dataset = 's3://openneuro.org/ds000102', nb_files = 136
@pytest.mark.parametrize(
"dataset, nb_files",
[
("s3://openneuro.org/ds000102", 136),
],
)
def test_layout_on_s3_datasets_no_derivatives(dataset, nb_files):
> layout = BIDSLayout(dataset)
test_remote_bids.py:21:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../layout.py:177: in __init__
_indexer(self)
../index.py:148: in __call__
all_bfs, all_tag_dicts = self._index_dir(self._layout._root, self._config)
../index.py:239: in _index_dir
if force or self._validate_file(abs_fn):
../index.py:162: in _validate_file
matched_patt = _validate_path(
../index.py:64: in _validate_path
if _check_path_matches_patterns(path, excl_patt, root=root):
../index.py:49: in _check_path_matches_patterns
path = Path("/") / path.relative_to(root)
/usr/lib/python3.12/pathlib.py:721: in __truediv__
return self.joinpath(key)
/usr/lib/python3.12/pathlib.py:717: in joinpath
return self.with_segments(self, *pathsegments)
../../../.venv/lib/python3.12/site-packages/upath/core.py:560: in with_segments
return type(self)(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <[AttributeError('drive') raised in repr()] PosixUPath object at 0x7fca4b13de40>, protocol = ''
args = (PosixUPath('/'), S3Path('s3://openneuro.org/ds000102/.gitattributes')), storage_options = {}, base_options = {}
args0 = PosixUPath('/')
def __init__(
self, *args, protocol: str | None = None, **storage_options: Any
) -> None:
# allow subclasses to customize __init__ arg parsing
base_options = getattr(self, "_storage_options", {})
args, protocol, storage_options = type(self)._transform_init_args(
args, protocol or self._protocol, {**base_options, **storage_options}
)
if self._protocol != protocol and protocol:
self._protocol = protocol
# retrieve storage_options
if args:
args0 = args[0]
if isinstance(args0, UPath):
self._storage_options = {**args0.storage_options, **storage_options}
else:
if hasattr(args0, "__fspath__"):
_args0 = args0.__fspath__()
else:
_args0 = str(args0)
self._storage_options = type(self)._parse_storage_options(
_args0, protocol, storage_options
)
else:
self._storage_options = storage_options.copy()
# check that UPath subclasses in args are compatible
# TODO:
# Future versions of UPath could verify that storage_options
# can be combined between UPath instances. Not sure if this
# is really necessary though. A warning might be enough...
if not compatible_protocol(self._protocol, *args):
> raise ValueError("can't combine incompatible UPath protocols")
E ValueError: can't combine incompatible UPath protocols
../../../.venv/lib/python3.12/site-packages/upath/core.py:263: ValueError
========================================================= warnings summary =========================================================
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
/local/scratch/pybids/.venv/lib/python3.12/site-packages/botocore/auth.py:424: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
datetime_now = datetime.datetime.utcnow()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================== short test summary info ======================================================
FAILED test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136] - ValueError: can't combine incompatible UPath protocols
================================================== 1 failed, 8 warnings in 1.00s ===================================================
Okay, I'm starting to get a handle on the issue. Basically, although we can specify things like anon=True for the root path, there are many places where BIDSFiles are constructed without access to layout._root to ensure that these parameters get inherited. Because BIDSFile is an ORM object, the lack of round-trip equivalence in the serialization/deserialization process is a problem.
If universal_pathlib provides an API to access these additional parameters, we could consider adding a column to the BIDSFile model and use it while reconstructing. Another approach could be to mutate BIDSFiles after creation, provided the access happens through a BIDSLayout method call that has access to ._root. Alternately, if we could somehow put this information in the SQLAlchemy session, then the BIDSFile.path property could access it on demand.
Ah I see.
I wonder if we could use the storage_options dict for this purpose? This is something inherited from fsspec so can be used with universal_pathlib. The api within storage_options is specific to the underlying filesystem (eg anon is there for s3fs but not for others), but I hoping that shouldn't prevent it from being passed along and serialized with the object?
If you can figure out how to get storage_options to serialize and deserialize, I'm inclined to take any approach you care to use unless it causes performance degradation for local datasets.
I've run out of time to poke at this this week and am unlikely to get back to it very soon. Ping @adelavega in case you have ideas on how to tackle this.
Sounds good will give it a shot, thanks for helping out with this!!
If it turns out trickier than expected, wondering if a stopgap solution is just to document that custom options are not currently passed along. Doesn't resolve the anon s3 access issue but does resolve any other use of universal paths where custom options aren't needed. Eg I didn't even know there was an issue with anon s3 access because I had a credentials file set-up already.. anyways can consider that later if/when we get to that point..
Created https://github.com/bids-standard/pybids/issues/1098 to track the lesser bug. Let's merge this now.