Open
zaneselvans
opened this issue 9 months ago
•
3 comments
Overview
The etl_full.yml settings for the NREL ATB included the years 2019 and 2020, which aren't yet working (See #3576) and Pydantic's validation of the settings correctly failed. I went ahead and removed those years from the settings file on main so we can get another attempted build tonight.
However, this failure doesn't happen locally when I try to run the full ETL with the old settings, which is weird.
While investigating this I was confused by the NREL ATB extraction, which doesn't seem to make any use of the settings or datastore. So maybe this is only working because it's relying on defaults that aren't informed by the ETL settings at all?
The raw_nrelatb__data asset claims to require the datastore and dataset_settings resources, but doesn't actually make use of them.
The NREL ATB Extractor claims to require a Datastore as input, but doesn't receive one.
But looking at the other tabular extractors, they also don't seem to use any resources (even though they obviously must) so maybe there is a bunch of magic happening in the background? Can we document what is going on?
It looks like there's a bit of stale documentation in the extraction system, with a mix of references to Excel and CSV files in places where they are not appropriate.
class Extractor(ParquetExtractor):
"""Extractor for NREL ATB."""
def __init__(self, *args, **kwargs):
"""Initialize the module.
Args:
ds (:class:datastore.Datastore): Initialized datastore.
"""
self.METADATA = GenericMetadata("nrelatb")
super().__init__(*args, **kwargs)
raw_nrelatb__all_dfs = raw_df_factory(Extractor, name="nrelatb")
@asset(
required_resource_keys={"datastore", "dataset_settings"},
)
def raw_nrelatb__data(raw_nrelatb__all_dfs):
"""Extract raw NREL ATB data from annual parquet files to one dataframe.
Returns:
An extracted NREL ATB dataframe.
"""
return Output(value=raw_nrelatb__all_dfs["data"])
ty for catching the non-working partitions in the full settings! I'm also confused why the validations didn't fail for me locally. after changing the working partitions in sources i was able to re-run the full extraction and only get the working years. that's weird for sure.
A lot of the magic is happening via extract.extractor.raw_df_factory which runs extract.extractor.partition_extractor_factory which uses the datastore and the dataset_settings. I was mirroring the eia 176 extract which required those two as inputs into the asset but doesn't pass them around - but instead accesses them within raw_df_factory.
I agree in general that the extractor setup needs some documentation cleanup and maybe some higher level explanation somewhere.
Unless I'm missing something, I can't reproduce this error locally. Trying to extract just the raw_nrelatb asset group locally with the following dagster config failed as expected. As discussed, there is a fair bit happening in the pudl.extract.extractor module. I'm all for adding more documentation to clarify what's going on there, but as far as I'm concerned it doesn't seem like anything is fundamentally broken, and I'm inclined to close this issue?