pybids icon indicating copy to clipboard operation
pybids copied to clipboard

FIX: cloud paths checking against patterns

Open akhanf opened this issue 1 year ago • 2 comments

The _check_path_matches_patterns function was failing for cloud (e.g. s3, gc3) URIs (added in #1074):

  1. the path.relative_to() function doesn't work for cloud URIs (it just returns the original).
  2. combining Path('/') with the result of relative_to() now (ie since universal_pathlib > 0.2.3) throws an error when using cloud paths.

So if using universal_pathlib >0.2.3, any cloud URIs as bids datasets would throw an error message.

This fix drops the uri prefix by using .path, so that the regex check works as expected.

akhanf avatar Oct 14 '24 03:10 akhanf

Codecov Report

Attention: Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 89.78%. Comparing base (eb2f926) to head (962cfe7). Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
bids/layout/tests/test_remote_bids.py 80.00% 2 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1094      +/-   ##
==========================================
- Coverage   89.79%   89.78%   -0.01%     
==========================================
  Files          63       64       +1     
  Lines        7141     7154      +13     
  Branches     1367      835     -532     
==========================================
+ Hits         6412     6423      +11     
- Misses        531      534       +3     
+ Partials      198      197       -1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Oct 14 '24 04:10 codecov[bot]

Could you add tests with examples? That will make it easier to be explicit about what was going wrong and allow us to consider other solutions quickly.

effigies avatar Oct 17 '24 15:10 effigies

Yes good idea, I've added one simple test using an s3 uri from openneuro, and also made another minor fix.

The test passes locally, but I see now it is failing in the CI as botocore can't find the credentials file.. I'll see if there is a good workaround for this later..

For reference: running the test on the code before this PR gives this error:


test_remote_bids.py F                                                                                                        [100%]

============================================================= FAILURES =============================================================
____________________________ test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136] ____________________________

dataset = 's3://openneuro.org/ds000102', nb_files = 136

    @pytest.mark.parametrize(
        "dataset, nb_files",
        [
            ("s3://openneuro.org/ds000102", 136),
        ],
    )
    def test_layout_on_s3_datasets_no_derivatives(dataset, nb_files):
>       layout = BIDSLayout(dataset)

test_remote_bids.py:21: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../layout.py:177: in __init__
    _indexer(self)
../index.py:148: in __call__
    all_bfs, all_tag_dicts = self._index_dir(self._layout._root, self._config)
../index.py:239: in _index_dir
    if force or self._validate_file(abs_fn):
../index.py:162: in _validate_file
    matched_patt = _validate_path(
../index.py:64: in _validate_path
    if _check_path_matches_patterns(path, excl_patt, root=root):
../index.py:49: in _check_path_matches_patterns
    path = Path("/") / path.relative_to(root)
/usr/lib/python3.12/pathlib.py:721: in __truediv__
    return self.joinpath(key)
/usr/lib/python3.12/pathlib.py:717: in joinpath
    return self.with_segments(self, *pathsegments)
../../../.venv/lib/python3.12/site-packages/upath/core.py:560: in with_segments
    return type(self)(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <[AttributeError('drive') raised in repr()] PosixUPath object at 0x7fca4b13de40>, protocol = ''
args = (PosixUPath('/'), S3Path('s3://openneuro.org/ds000102/.gitattributes')), storage_options = {}, base_options = {}
args0 = PosixUPath('/')

    def __init__(
        self, *args, protocol: str | None = None, **storage_options: Any
    ) -> None:
        # allow subclasses to customize __init__ arg parsing
        base_options = getattr(self, "_storage_options", {})
        args, protocol, storage_options = type(self)._transform_init_args(
            args, protocol or self._protocol, {**base_options, **storage_options}
        )
        if self._protocol != protocol and protocol:
            self._protocol = protocol
    
        # retrieve storage_options
        if args:
            args0 = args[0]
            if isinstance(args0, UPath):
                self._storage_options = {**args0.storage_options, **storage_options}
            else:
                if hasattr(args0, "__fspath__"):
                    _args0 = args0.__fspath__()
                else:
                    _args0 = str(args0)
                self._storage_options = type(self)._parse_storage_options(
                    _args0, protocol, storage_options
                )
        else:
            self._storage_options = storage_options.copy()
    
        # check that UPath subclasses in args are compatible
        # TODO:
        #   Future versions of UPath could verify that storage_options
        #   can be combined between UPath instances. Not sure if this
        #   is really necessary though. A warning might be enough...
        if not compatible_protocol(self._protocol, *args):
>           raise ValueError("can't combine incompatible UPath protocols")
E           ValueError: can't combine incompatible UPath protocols

../../../.venv/lib/python3.12/site-packages/upath/core.py:263: ValueError
========================================================= warnings summary =========================================================
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
  /local/scratch/pybids/.venv/lib/python3.12/site-packages/botocore/auth.py:424: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
    datetime_now = datetime.datetime.utcnow()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================== short test summary info ======================================================
FAILED test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136] - ValueError: can't combine incompatible UPath protocols
================================================== 1 failed, 8 warnings in 1.00s ===================================================

akhanf avatar Oct 26 '24 01:10 akhanf

Okay, I'm starting to get a handle on the issue. Basically, although we can specify things like anon=True for the root path, there are many places where BIDSFiles are constructed without access to layout._root to ensure that these parameters get inherited. Because BIDSFile is an ORM object, the lack of round-trip equivalence in the serialization/deserialization process is a problem.

If universal_pathlib provides an API to access these additional parameters, we could consider adding a column to the BIDSFile model and use it while reconstructing. Another approach could be to mutate BIDSFiles after creation, provided the access happens through a BIDSLayout method call that has access to ._root. Alternately, if we could somehow put this information in the SQLAlchemy session, then the BIDSFile.path property could access it on demand.

effigies avatar Oct 30 '24 13:10 effigies

Ah I see.

I wonder if we could use the storage_options dict for this purpose? This is something inherited from fsspec so can be used with universal_pathlib. The api within storage_options is specific to the underlying filesystem (eg anon is there for s3fs but not for others), but I hoping that shouldn't prevent it from being passed along and serialized with the object?

akhanf avatar Oct 31 '24 14:10 akhanf

If you can figure out how to get storage_options to serialize and deserialize, I'm inclined to take any approach you care to use unless it causes performance degradation for local datasets.

I've run out of time to poke at this this week and am unlikely to get back to it very soon. Ping @adelavega in case you have ideas on how to tackle this.

effigies avatar Oct 31 '24 14:10 effigies

Sounds good will give it a shot, thanks for helping out with this!!

If it turns out trickier than expected, wondering if a stopgap solution is just to document that custom options are not currently passed along. Doesn't resolve the anon s3 access issue but does resolve any other use of universal paths where custom options aren't needed. Eg I didn't even know there was an issue with anon s3 access because I had a credentials file set-up already.. anyways can consider that later if/when we get to that point..

akhanf avatar Oct 31 '24 14:10 akhanf

Created https://github.com/bids-standard/pybids/issues/1098 to track the lesser bug. Let's merge this now.

effigies avatar Nov 11 '24 13:11 effigies