pudl icon indicating copy to clipboard operation
pudl copied to clipboard

Create infrastructure for publishing all raw FERC SQLite DB's extracted from XBRL data

Open zschira opened this issue 2 years ago • 4 comments

Our FERC XBRL Extractor works with FERC Froms 1, 2, 6, 60, and 714. Form 1 is the most well integrated with PUDL, but even then many tables have not yet been integrated into the ETL. So, the ability to publish the raw SQLite DB's will be very useful. The XBRL data is also much better structured than the historical data, but very hard to work with in the XBRL format, so the raw SQLite versions of all of these forms could provide a lot of value.

Tasks

Ingest Metadata generated by extraction tool

  • [ ] The FERC XBRL Extractor can generate a Frctionless Data Package using metadata extracted from the FERC taxonomy. This will us to publish each database with column level descriptions provided by FERC

Enable publication

  • [ ] Integrate new sources with datasette_metadata_to_yml
  • [ ] Update datasette publication bash script

zschira avatar Aug 08 '22 15:08 zschira

Is extracting the old VFP data within scope for these other forms?

bendnorman avatar Aug 08 '22 18:08 bendnorman

It seems like something we might as well do. The DBF data is going to be messier and we won't be able to provide the same level of documentation, but it would still be more accessible than in the DBF format.

zschira avatar Aug 08 '22 19:08 zschira

I suppose each Form would have to have 2 databases. How big of a lift would this be? @zaneselvans, which forms here are most valuable?

bendnorman avatar Aug 08 '22 19:08 bendnorman

The only forms we've said we would integrate historical data for are 1, 2, and 714

The Form 2 is analogous to Form 1 but for interstate natural gas utilities, so mostly transmission pipeline companies. We'd hoped there would be more state level has utilities in there as there are for electric utilities, but it seems like that's not the case.

The old Form 714 is partially integrated, and provides a bunch of data about balancing and planning areas, including hourly demand. The old data is bunch of CSVs dumped from DBF, all years in one partition.

So I think those are the highest priority, and the old 714 data will be easier to work with.

IIRC, Form 6 is like Forms 1 & 2, but for petroleum, and the old data is DBF. I think that would be the next priority. Form 60 seemed like a mysterious "other entities" category, and would be the lowest priority.

I imagine having the XBRL databases will make it easier to interpret the old data.

zaneselvans avatar Aug 08 '22 23:08 zaneselvans

PUDL is now able to construct SQLite DB's from all FERC XBRL forms, and ingest/convert the accompanying datapackage descriptors.We have not yet published these DB's on datasette, but the infrastructure is all in place on the xbrl_integration branch.

zschira avatar Sep 08 '22 17:09 zschira