framework icon indicating copy to clipboard operation
framework copied to clipboard

Cannot read datapackage from s3

Open barbuz opened this issue 2 years ago • 2 comments

Overview

I want to use Frictionless datapackages to provide metadata about some collections hosted on s3, but I'm encountering issues when trying to read these files. I can load the data fine as a Resource, and I can even validate it against a local tableschema, but if I try loading the datapackage I get the following error:

>>> pak = frictionless.Package('s3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/datapackage.json')
Traceback (most recent call last):
  File "/home/leo/miniconda3/lib/python3.10/site-packages/frictionless/metadata.py", line 306, in metadata_retrieve
    response = session.get(descriptor, stream=True)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/requests/sessions.py", line 600, in get
    return self.request("GET", url, **kwargs)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/requests/sessions.py", line 695, in send
    adapter = self.get_adapter(url=request.url)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/requests/sessions.py", line 792, in get_adapter
    raise InvalidSchema(f"No connection adapters were found for {url!r}")
requests.exceptions.InvalidSchema: No connection adapters were found for 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/datapackage.json'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/leo/miniconda3/lib/python3.10/site-packages/frictionless/package/factory.py", line 38, in __call__
    cls.from_descriptor(source, basepath=basepath, **options),  # type: ignore
  File "/home/leo/miniconda3/lib/python3.10/site-packages/frictionless/metadata.py", line 162, in from_descriptor
    descriptor = cls.metadata_retrieve(descriptor)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/frictionless/metadata.py", line 324, in metadata_retrieve
    raise FrictionlessException(Error(note=note)) from exception
frictionless.exception.FrictionlessException: [package-error] The data package has an error: cannot retrieve metadata "s3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/datapackage.json" because "No connection adapters were found for 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/datapackage.json'"

I have also tried opening a local copy of the datapackage with its resource path pointing to s3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/data.parquet/part.0.parquet, but then the validation fails with:

>>> pak.validate()
{'valid': False,
 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.057},
 'warnings': [],
 'errors': [],
 'tasks': [{'name': 'data',
            'type': 'table',
            'valid': False,
            'place': 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/data.parquet/part.0.parquet',
            'labels': [],
            'stats': {'errors': 1, 'warnings': 0, 'seconds': 0.026},
            'warnings': [],
            'errors': [{'type': 'source-error',
                        'title': 'Source Error',
                        'description': 'Data reading error because of not '
                                       'supported or inconsistent contents.',
                        'message': 'The data source has not supported or has '
                                   'inconsistent contents: '
                                   's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/data.parquet/part.0.parquet',
                        'tags': [],
                        'note': 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/data.parquet/part.0.parquet'}]}]}

Finally, I've done some experiments with the CLI but encountered the same errors there too. In particular, trying to validate the remote data against a local tableschema.json file worked, but if the tableschema was also hosted on s3 I got the error "No connection adapters were found for 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/tableschema.json'"

All the files used here should be public, so you can try replicating the issue. Please let me know if I'm doing something wrong or if this is an actual bug.

barbuz avatar Oct 04 '23 01:10 barbuz

We have a similar use case.

I replicated this issue and tried various combinations and couldn't get it to resolve correctly.

Does the AWS plugin expose all of the necessary parts to validate a whole data package, or is it only at the Resource level such as in the guide here? https://framework.frictionlessdata.io/docs/schemes/aws.html

PeterBaker0 avatar Oct 12 '23 23:10 PeterBaker0

Thanks for reporting!

roll avatar Nov 22 '23 10:11 roll