pudl
pudl copied to clipboard
Integrate FERC XBRL data into PUDL
Background
After creating tools to translate the FERC XBRL filings into SQLite databases, we decided that the old Visual FoxPro DBF and new XBRL data will need their own independent extract + transform processes. The new data has much more structure and metadata, and will be quite a bit easier to work with than the historical data, so it doesn't make sense to convert the new data into the old structure just so we can run it through the same old transformations (this is discussed in catalyst-cooperative/pudl#1579).
This means a lot of new code and transformations, and has precipitated a major refactor of the FERC Form 1 transformations -- especially since we are going to be going after many additional Form 1 tables beyond the ones we've already cleaned up and integrated into the PUDL database.
Now that we have access to the raw XBRL data, we've been working on areas in parallel:
- Making the XBRL data acquisition & extraction process more robust and automated, and
- Re-writing the transform functions to accommodate both XBRL and DBF data, and to be more modular and re-usable as we expand our coverage to additional tables.
FERC XBRL Data Acquisition (@zschira)
- This includes the data scraping and archiving process, up to the point of having versioned Zenodo depositions available through the PUDL Datastore, and regularly updated with minimal human intervention.
- This work will take place in the
pudl-scrapers
&pudl-zenodo-storage
repositories.
Issues
- [x] catalyst-cooperative/pudl#1593
- [x] Package ferc-xbrl-extractor so it's installable by PUDL
- [x] https://github.com/catalyst-cooperative/pudl-scrapers/issues/41
- [x] catalyst-cooperative/pudl-scrapers#45
- [x] https://github.com/catalyst-cooperative/pudl-scrapers/issues/42
- [x] https://github.com/catalyst-cooperative/ferc-xbrl-extractor/issues/19
- [x] https://github.com/catalyst-cooperative/pudl-scrapers/issues/39
- [x] https://github.com/catalyst-cooperative/pudl-scrapers/issues/26
- [ ] #1418 (important since XBRL RSS posts individual filings)
Pre-extract (@zschira)
- This includes everything that takes the raw inputs archived on Zenodo and turns them into coherent SQLite databases which we can archive, publish and use as standalone resources.
- We will also produce detailed machine-readable metadata comparable in detail to what is available in the XBRL.
- This work will primarily take place in the
ferc-xbrl-extractor
repository. - Updates that impact the main PUDL repo will be reflected on the
xbrl_integration
branch.
Release Issues
- [x] catalyst-cooperative/pudl#1668
- [x] catalyst-cooperative/pudl#1861
- [x] FERC 1: Both DBF and XBRL now
- [x] FERC 714: XBRL for now, CSV later, see catalyst-cooperative/pudl#1859
- [x] FERC 2: XBRL now, DBF later, see catalyst-cooperative/pudl#1859
- [x] FERC 6 & 60: XBRL now, possibly DBF later, see catalyst-cooperative/pudl#1859
- [x] catalyst-cooperative/pudl#1667
- [x] Release Catalyst packages on conda-forge or isolate pre-extract steps from PUDL repo
- [x] Archive & Publish XBRL derived FERC SQlite DBs catalyst-cooperative/pudl#1830
- [x] https://github.com/catalyst-cooperative/ferc-xbrl-extractor/issues/17
- [x] catalyst-cooperative/pudl#1860
- [x] Integrate XBRL to SQLite conversion of all FERC forms into the nightly builds
- [x] Automate Datasette redeployment as part of the nightly builds
- [x] #2080
Post-release Issues
- [x] catalyst-cooperative/ferc-xbrl-extractor#33
- [ ] #2081
- [x] #1420
- [x] catalyst-cooperative/pudl#1863
Update Existing ETL
Updating our transformations is a mix of software engineering and data wrangling tasks. We want to get the software somewhat stable and documented before involving lots of people working in parallel on unfamiliar data, and so we've broken this into 3 phases of work:
Phase 1: Software Design and Alpha Testing (@zaneselvans & @cmgosnell)
- Get to the point where the
fuel_ferc1
andplants_steam_ferc1
are loading into the DB successfully, and provide functionality comparable to the old DBF, extending data coverage through 2021. - Work on this phase will branch off of and be merged into
xbrl_steam
. - Unit and Integration tests should exist and pass, at which point we can merge the
xbrl_steam
branch intoxbrl_integration
. - After this phase, documentation should be good enough, and the design stable enough that we can bring in other people to work on additional tables that we have existing transforms for, and familiarity with.
Issues
- [x] catalyst-cooperative/pudl#1739
- [x] catalyst-cooperative/pudl#1706
- [x] catalyst-cooperative/pudl#1738
- [x] catalyst-cooperative/pudl#1722
- [x] catalyst-cooperative/pudl#1707
- [x] catalyst-cooperative/pudl#1876
- [x] catalyst-cooperative/pudl#1853
- [x] catalyst-cooperative/pudl#1878
- [x] catalyst-cooperative/pudl#1877
- [x] catalyst-cooperative/pudl#1705
- [x] Merge catalyst-cooperative/pudl#1721
- [x] catalyst-cooperative/pudl#1924
- [x] Merge catalyst-cooperative/pudl#1962
Phase 2: Beta Testing w/ Familiar Tables (#1801)
- Refactor all of our our existing FERC Form 1 transform functions to use the new framework, extending coverage to the 2021 XBRL data.
- This will include creating additional transform functions and parameterizations as needed to deal with more kinds of tables and data problems not encountered in the
fuel_ferc1
andplants_steam_ferc1
tables. - Based on feedback from this experience we may also make some changes to the transform framework.
- Work in this phase will branch off of and merge back into the
xbrl_integration
branch. - When this phase is complete, and all unit & integration tests are passing, we will merge
xbrl_integration
intodev
in the PUDL repository, and can make an initial release of this data publicly.
Issues
- [x] #1981 @cmgosnell
- [x] #1801
- [x] catalyst-cooperative/pudl#1802 @cmgosnell
- [x] catalyst-cooperative/pudl#1803 @cmgosnell
- [x] catalyst-cooperative/pudl#1820 @cmgosnell
- [x] catalyst-cooperative/pudl#1735 @aesharpe
- [x] catalyst-cooperative/pudl#1807 @zaneselvans & @cmgosnell
- [x] Merge catalyst-cooperative/pudl#1665
Phase 3: Integrate New FERC 1 Data & Methods
- Now that we've got the 2021 data integrated with and working as well as the 2020 data, we'll move on to expanding coverage to other tables, in both the earlier DBF data and new XBRL data.
- At this point the software design will hopefully be stable and able to deal with whatever new problems we encounter.
- We will also start integrating new data cleaning methods that we haven't previously employed in our published data.
- Work in this phase will branch off of and merge into
dev
. - Which tables and data cleaning to prioritize will be guided by @arengel & @jrea-rmi, along the lines of #1568
key
- ⭐ = DBF-XBRL mapping is simple & worth delegating, table is ready to be taken on
- 🟧 = table is ready to be taken on, but may require new reshaping transforms
Issues
- [x] catalyst-cooperative/pudl#2040 @zaneselvans & @aesharpe
- [x] #2110 @zaneselvans
- [x] catalyst-cooperative/pudl#2012 @zaneselvans
- [x] catalyst-cooperative/pudl#2021 @zaneselvans
- [x] #2075
- [x] catalyst-cooperative/pudl#2014 @zaneselvans
- [x] #1804
- [x] catalyst-cooperative/pudl#1807 @zaneselvans & @cmgosnell
- [x] #1805 ⭐ @cmgosnell
- [x] #1806 @cmgosnell
- [x] #1808 (one-to-many, 4 XBRL tables) @zaneselvans
- [x] #1809
- [x] #1810 ⭐ @aesharpe
- [x] #1811 (one-to-many, 8 XBRL tables)
- [x] #1812
- [x] #1816 (one-to-one) 🟧 @zschira
- [x] #1817 (one-to-one) ⭐ @aesharpe
- [x] #1813 (many-to-one, 2 DBF tables) ⭐ 🟧 @cmgosnell
- [x] #1818
- [x] catalyst-cooperative/pudl#1820 @cmgosnell
- [x] #1819 @cmgosnell
- [x] #1822
2023 Q1 Issues
- [x] Transform other, lower priority RMI tables
- [x] #1815 (one-to-many, 2 XBRL tables)
- [x] #1821 (one-to-many, 8 XBRL tables)
- [x] #1823 (one-to-many, 9 XBRL tables)
- [x] #1824 (one-to-one) 🟧
- [ ] catalyst-cooperative/pudl#2016 @zaneselvans & @cmgosnell
- [ ] catalyst-cooperative/pudl#2015 @zaneselvans & @zschira
- [ ] #2076
- [ ] #2179
- [ ] Infrastructure Cleanup
- [ ] #2165
- [x] #2160
- [ ] #2180
- [ ] #471
- [ ] #2066
- [x] #1980
- [ ] #1825
- [ ] Metadata Documentation/Annotation
- [ ] #2074
- [ ] #2173
Icebox
- [ ] Integrate total labeler into small/hydro/pumped transform step @aesharpe
- [ ] catalyst-cooperative/pudl#1968 @zaneselvans
- [ ] Validation
- [ ] #2164
Given the need to integrate a whole second stream of data extraction and transformation into the FERC 1 process here, and the fact that we have wanted to disentangle some of the extraction vs. transformation code (deferring some complicated steps until later in the process) it might be worth thinking about those tasks together. Rather than shoehorning all of the XBRL data into the existing pipeline, and then subsequently disassembling the now even more complicated pipeline.
It seems like after the data is in the PUDL DB, there will be another layer of work that has to happen to integrate it with the FERC-EIA record linkage (and any other downstream analysis), since some of the old columns (record_id
) won't exist any more.
There are 4 combinations of data source / existing code. In order we want to tackle them:
- PUDL tables from XBRL
- Hub tables from XBRL
- Hub tables from FoxPro
- PUDL tables from FoxPro (into our new transform architecture)
Transform step design thoughts:
- we make a generic table transformer w/ some
execute()
or coordinating method that has standard cleaning/renaming/reshaping/encode/etc steps. - Make clean, nice lil table-specific transformers inheriting from the generic transformer for each of the new XBRL tables
- Make a lil shell of the generic transformer for the old foxpro tables that is mostly a wrapper for the existing transform functions
- Make a concatenater
Comments on Dagster:
I don't it is possible for part of our ETL to be implemented using Dagster. If the new FERC data was going to be an entirely separate pipeline then I'd say we should use dagster but given it will be intertwined with our existing ETL I don't think we should
However, we should keep some dagster concepts in mind while designing these FERC transformations. I would highly recommend reading the dagster concepts docs and our rough plan for applying these concepts to PUDL.
Each transform function will eventually be turned into a dagster "op" which is a node in the DAG. An op takes some upstream inputs, does some processing then passes the result to the next op in the DAG. Dagster is strict about what can be fed into ops:
An op only starts to execute once all of its inputs have been resolved. Inputs can be resolved in two ways:
- The upstream output that the input depends on has been successfully emitted and stored.
- The input was stubbed through config.
In other words, the input of an op has to be the output of another op OR configuration. Configuration can come in the form of Resources and Config which allow you to configure DAG wide resources in a yaml file. ETL wide abstractions like the Datastore, pudl_settings and dataset settings will likely become Resources or Config. This means ops won't accept python constants passed into functions. For example:
@op
def extract():
return pd.read_csv("poopy_data.csv")
@op
def fillnans(df, fill_value):
return df.fillna(fill_value)
@graph
def run_etl()
df = extract()
fillnans(df, "Gas")
won't work because the fill_value
parameter is a python string, not the output of another op or a Dagster Resource or Config. You could fix this by configuring the fill_value
parameter using Dagster Configs.
Also, these rules input rules only apply to functions you think should be included in the DAG. For example, helper functions probably won't be included in the DAG.
def fillnans(df, fill_value):
return df.fillna(fill_value)
@op
def transform(df):
df = fillnans(df, "Gas")
return df
The dagster refactor and xbrl stuff will probably be happening in parallel so Zane and I will probably have more guidance as we refactor!
Some more thoughts on Dagster:
You can create nested graphs in Dagster to allow for some logic hierarchy. @zaneselvans and I envision incrementally applying dagster graphs to our ETL. There are multiple levels we've identified:
- Turn our
_etl_{dataset}()
functions into ops and construct a graph. This is a very simple DAG but would enable us to run the ETLs in separate processes. - Turn our ETL functions into graphs where the ops are the extract and transform steps.
- ETL functions are graphs, E and T steps are graphs and individual table transforms are ops.
- ETL functions are graphs, E and T steps are graphs, individual table transforms are graphs and individual reusable cleaning functions are ops.
When we're writing XBRL functions we don't need to think much about the first two levels because we are only making changes within the ferc1.transform()
function.
Option 3
I wrote up a little prototype for option 3. It's ok but it seems awkward to have to define all of the transformed tables in the transform graph's outs param:
https://github.com/catalyst-cooperative/pudl/blob/aa36c91766e70eb3a178232655f7c79fb60b6a87/notebooks/work-in-progress/ferc_dagster_prototype.py#L25-L33
This is the recommended method for returning multiple outputs from graphs. I'm curious if it is possible for graphs to treat dictionaries as a single output instead of multiple.
Satisfying option three shouldn't be that difficult because we can use the existing structure in transform.ferc1
. I do have a couple of questions:
- Do the table transform functions depend on previously cleaned tables? It looks like the
plants_steam()
depends on thefuel_ferc1()
table for plant_id assignment. - Do the transform functions depend on multiple tables in
ferc1_raw_dfs
? This might not be the case in Form 1 land but I'm pretty sure final EIA transform functions rely on multiple raw tables to produce a single cleaned table.
Option 4
Option 4 is a bit trickier because we want generic cleaning functions that can be parameterized for each table but do not have to use dagster config. Dagster recommends using Op factories for this situation. They work but feel a little kludgy. Here is an example of a generic transform op:
def rename_columns_factory(
name="default_name",
ins=None,
column_mapping=None,
**kwargs,
):
"""
Args:
name (str): The name of the new op.
ins (Dict[str, In]): Any Ins for the new op. Default: None.
Returns:
function: The new op.
"""
@op(name=name, ins=ins, **kwargs)
def rename_df(context, df):
context.log.info(f"\n The DataFrame: {df}\n")
context.log.info(f"\n The Op Ins: {context.op_def.ins}\n")
t_df = df.rename(columns=column_mapping)
context.log.info(f"\n The Transformed DataFrame: {t_df}\n")
return t_df
return rename_df
@op
def extract():
return pd.DataFrame([1,2], columns=["col"])
@job()
def etl():
df = extract()
column_mapping = {"col": "column"}
transformed_df = rename_columns_factory(column_mapping=column_mapping)(df)
etl.execute_in_process()
rename_columns_factory()
parametrizes the inner function rename_df()
which is an op. It's kind of mind-bending because there is a lot of function wrapping / decorating going on here. If we like this pattern, this is what a dagster friendly version without dagster abstractions would look like:
def rename_columns_factory(
column_mapping=None,
):
"""
Args:
column_mapping: Dict of column rename mappings.
Returns:
function: the rename_df function.
"""
def rename_df(df):
print(f"\n The DataFrame: {df}\n")
t_df = df.rename(columns=column_mapping)
print(f"\n The Transformed DataFrame: {t_df}\n")
return t_df
return rename_df
def extract():
return pd.DataFrame([1,2], columns=["col"])
def etl():
df = extract()
column_mapping = {"col": "column"}
transformed_df = rename_columns_factory(column_mapping=column_mapping)(df)
etl()
An open question here is where we want to store the transform parameters.
This is kind of an out there idea, but could we develop our own op factory decorator that makes creating op factories a little less verbose. Something like:
from functools import partial
def op_factory(name="default_name", ins=None, **kwargs):
def decorator(op_func):
@wraps(op_func)
def wrapper(**op_kwargs):
return op(name=name, ins=ins, **kwargs)(partial(op_func, **op_kwargs))
I'm not 100% sure that this would work as is (decorators are confusing), but the idea here is that it could be applied like:
@op_factory
def rename_df_factory(context, df, column_mapping=None):
context.log.info(f"\n The DataFrame: {df}\n")
context.log.info(f"\n The Op Ins: {context.op_def.ins}\n")
t_df = df.rename(columns=column_mapping)
context.log.info(f"\n The Transformed DataFrame: {t_df}\n")
return t_df
@op
def extract():
return pd.DataFrame([1,2], columns=["col"])
@job()
def etl():
df = extract()
column_mapping = {"col": "column"}
transformed_df = rename_df_factory(column_mapping=column_mapping)(df)
Here, when you call rename_df_factory
with column_mapping
as a keyword arg, the decorator will create a partially specified method, then apply the op
decorator to that method, and return the created op
.
I updated the FERC cleaning dagster prototype so we don't have to pass the transformed_dfs
dictionary between table cleaning functions:
https://github.com/catalyst-cooperative/pudl/blob/52370c2b329acd06c6f74ca04ec8f35c970ccf13/notebooks/work-in-progress/ferc_dagster_prototype.py#L59-L76
Passing individual table depencies makes it easy to understand what tables depend on one another instead of passing around the entire transformed_dfs
dictionary.@cmgosnell let me know if this answers some of your questions in the xbrl_steam
PR.
It creates this fun dagster graph:
Vocab:
- Table transform functions are responsible for cleaning one table form a datasource. The functions typically accept some raw data and return a cleaned dataframe.
- Generic cleaning functions are specific steps within table transform functions. For example, renaming columns, filling na values, backfills... These functions typically accept a dataframe to clean and some parameters to specify how the data should be cleaned. For example, a column rename dictionary, fillna value...
Background
We want to develop a way to convert generic cleaning functions to dagster ops so they are documented in the DAG and could be validated using dagster-pandera.
There are two ways to configure ops, ConfigSchema
and op factories. Here are a couple of options for Dagster-friendly generic transform functions:
Option 1: Use Dagster ConfigSchema
s for each generic transform function
Description of ConfigSchema
from dagster docs:
Various Dagster entities (ops, assets, resources) can be individually configured. When launching a job that executes (ops), materializes (assets), or instantiates (resources) a configurable entity, you can provide run configuration for each entity. Within the function that defines the entity, you can access the passed-in configuration off of the context. Typically, the provided run configuration values correspond to a configuration schema attached to the op/asset/resource definition. Dagster validates the run configuration against the schema and proceeds only if validation is successful.
A common use of configuration is for a schedule or sensor to provide configuration to the job run it is launching. For example, a daily schedule might provide the day it's running on to one of the ops as a config value, and that op might use that config value to decide what day's data to read.
Configs are mostly used for parameters that are changed between runs. For example, a pudl_out
path or dataset processing settings. We initially did not entertain this option because these generic cleaning functions will be used dozens of times and the parameters will not change for different runs. Given we need to store all of the transformation parameters somewhere, we realized it might make sense to set a default dagster config with all of the static generic transform function metadata.
You can view the pseudo code here.
Pros
- Can overwrite transform parameters at runtime. (Is this desirable?)
- Don't have to use dagster factories, which might confuse some developers.
Cons
- Dagster doesn't allow for duplicate op names. For example, the
transform_steam
graph has an op namedrename_columns
and an op namedrename_columns_2
becauserename_columns()
is called for the dbf and xbrl dataframes. This prevents us from understanding what config is associated with xbrl and dbf. - Converting our transform metadata to a dagster config might be tricky.
- The generic cleaning functions might be hard to test because the parameters come from the dagster
context
keyword.
Option 2: Pass parameters from transform metadata classes into op factories
This option grabs the transform metadata for a given table in a table transform function and passes the relevant data to an op factory of a generic cleaning function.
You can view the pseudo-code here.
Pros
- Easier to test the generic cleaning functions because the
context
keyword isn't needed. - Can specify the op name instead of having the auto-incrementing suffix for duplicate ops.
Cons
- Need to specify the name of the transform op. The pseudo-code manually specifies the name, but it could probably be programmatically generated, given the metadata.
- Need to create a factory for every cleaning function. We could probably create a decorator that turns a python function into a parameterizable op.
Option 3: Lookup relevant transform metadata within op factory
This option grabs the relevant transform parameters from the transform metadata structure given the table name.
You can view the pseudo-code here.
Pros
- Only need to pass the name of the table name to the op factory.
Cons
- Increases coupling. The op factories need to know what the metadata classes look like.
Option 4: Use transformer class instead of dagster ops for generic cleaning functions
We might get diminishing returns applying dagster deeper into our call stack. Are the UI documentation and potential pandera validation worth the complexity of converting generic cleaning functions into ops? Most generic cleaning functions will be applied sequentially so there won't be complex dependencies to manage.
@bendnorman I think I understand these options and it seems like option 2 is our best case at least to move things forward before we fully integrate dagster as our orchestrator. Does this sound right to you?
Yeah, I think option 2 or 4 are the front runners. All options rely on a non-dagster data structure to store cleaning metadata. If we went with option 2, y'all could just create a library of generic cleaning functions that can be converted to ops. If we go with option 4, the cleaning functions would likely be methods in a transformer class that couldn't be easily converted to ops.
I'm kind of cooling off on converting cleaning functions to dagster ops:
- One reason for converting cleaning functions to dagster ops is to validate data after each cleaning operation using dagster-pandera. How important is this to us? I think unit testing cleaning functions and using dagster-pandera to validate the outputs of table transform ops should be adequate. We also haven't talked about how to store and access the hundreds of pandera schemas for each cleaning step of every PUDL table.
- Cleaning functions could be documented in the dagit UI and parallelized if they are ops. A majority of cleaning functions are applied sequentially, so there aren't many complex dependencies or opportunities to parallelize. It would be nice to view all of the cleaning functions applied to a table in Dagit, but users could also just visit the source code. I posted about this design question in the dagster community, and one engineer at dagster thought it might make more sense to keep our cleaning functions as pure python functions or methods.
- It seems like many contributors want to tweak how tables are transformed. If we kept the cleaning functions as pure python, these contributors wouldn't need to learn dagster concepts. However, people contributing new datasets or tables would need to wrap functions in dagster abstractions.
- Storing transform metadata and logic in classes could reduce coupling. This way, table transform functions, and cleaning functions don't need to know about the structure of the metadata.
I'm excited to use the integrated dataframe schemas provided by the Pandera/Dagster integration, but there are going to be a whole lot of table-level transformations and storing schemas for every one of them is probably overkill. I think we're going to be persisting more intermediate steps with the new setup than we were previously, and probably it makes sense to just check the schemas of the dataframes that we're persisting, since we'll need to think about those schemas for the database anyway.
It seems like something analogous to Option 4 could also be implemented by composing functions rather than using classes to store the transformation logic. The parameterizations of the transformations would be stored in the Pydantic classes, and then they would be consumed as arguments by individual transform functions. Only the top level table-specific transform functions (the functions containing all of the operations to be applied between persisted versions of a given table, like plants_steam_ferc1
) would be Dagster ops, and they would know what table they pertain to, and be able to call both generic cleaning functions (looking up their parameters in the Pydantic data structure using the table ID) and table-specific functions which could be defined within the table-level cleaning function, or declared as _private_functions()
elsewhere in the module.
I'm not sure which of these would feel cleaner, and will write some pseudo-code trying them both out.