pudl
pudl copied to clipboard
Update from `frictionless>=4.4,<5` to `frictionless>=5,<6` across the board
Once #1420 is closed, we will be using one version of frictionless
across the board.
Unfortunately, that version will be 4.40.8, which is missing critical SQL-describing functionality.
We should upgrade to version 5! Migration guide is here.
Fortunately, our existing usage of the frictionless
libraries is very limited - just checking if it can parse a datapackage.json and then checking to see if the metadata is valid:
ferc-xbrl-extractor/src/ferc-xbrl-extractor/xbrl.py
if datapackage_path:
# Verify that datapackage descriptor is valid before outputting
frictionless_package = Package(descriptor=datapackage.dict(by_alias=True))
if not frictionless_package.metadata_valid:
raise RuntimeError(
f"Generated datapackage is invalid - {frictionless_package.metadata_errors}"
)
pudl/src/pudl/workspace/datastore.py
dp = frictionless.Package(datapackage_json)
if not dp.metadata_validate():
Which is pretty easy to change to the frictionless 5 Package.validate_descriptor()
method.
Unfortunately, this is partly an artifact of us making mirror classes that look sort of like the frictionless classes, in pudl-archiver
, ferc-xbrl-extractor
, and pudl
. These classes sort of replicate the frictionless classes, but also include lots of custom logic for our own purposes.
My proposal is:
- rename our Fake Frictionless classes to
PudlResource
,XbrlResource
, andArchiverResource
, respectively (plus the Package/Schema/etc. equivalents) - add
to_frictionless
methods that allow us to convert these to theirfrictionless
counterparts - Use real
frictionless
classes to serializedatapackage.json
s that are valid, inferc-xbrl-extractor
/pudl-archiver
- Create new, valid archives if necessary
- Use real
frictionless
classes to validate the datapackages inpudl/workspace/datastore.py
My guess for this work is ~20h.
Note - since our classes have required functionality that frictionless
classes don't have, we should not implement from_frictionless
methods. For example, every XbrlResource
has to respond to get_fact_tables()
, but frictionless
has no such notion. So then from_frictionless
can only ever make some half-functional XbrlResource
- so we shouldn't do that at all.
I'm new to this issue but why not have our Resources be subclasses of the frictionless framework classes? I think we'll have 3 types of extensions we'd like to make:
- extensions to frictionless classes that make sense to add directly to the frictionless-py package
- extensions that don't fit into ^ but we'd like to use in all of our projects. For example, we want to store additional metadata for all of our projects that don't fit into the frictionless specification. Another example would be if we want a method to convert a frictionless schema to a pandera schema but frictionless isn't interested in providing that functionality. To handle these types of extensions we'd probably need a Catalyst specific package that extends the frictionless tools.
- extensions that are project specific (XBRL, PUDL, client projects...) For this we can just extend the catalyst or frictionless package.
I think it's probably better to have our Resource contain references to frictionless classes - roughly corresponding to the idea of preferring "composition over inheritance."
The benefit of this, to me, is that this reduces the coupling between our PudlResource
interface and the implementation of the frictionless classes it depends on. You'll only need to know the public frictionless APIs instead of the implementation details. But we would still be able to use the frictionless code to reduce duplication as necessary.