pudl
pudl copied to clipboard
Missing and incorrect latitude / longitude data in `plants_entity_eia`
Describe the bug
Several related issues:
- It appears that somewhere in the pudl pipeline, the latitude (but not longitude) data is being dropped from
plants_entity_eia
and other tables that contain lat/long data. As far as I can tell from manually inspecting the raw EIA-860 plants file from 2022, there are a small number of plants that are missing both lat/long data, but none that are only missing latitude data. - It also appears that there are several plants where the sign of the longitude is being flipped from negative to positive (which locates these plants in China)
- There is one plant that is being assigned a nonsense longitude of -188 (plant 61445)
- There are a handful of plants that are assigned seemingly made-up coordinates in the middle of the Atlantic ocean, generally around (42, -42)
Bug Severity
Medium: With some effort, I can work around the bug.
To Reproduce
I downloaded data from https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/v2023.12.01/pudl.sqlite.gz, and am loading the plants_entity_eia
table using pd.read_sql("SELECT plant_id_eia, latitude, longitude FROM plants_entity_eia", PUDL_ENGINE)
213 plants are missing latitudes
16 plants have coordinates further east than the east coast of the US:
One plant has a non-existant coordinate:
Spot checking these plants revealed that they all appear to have valid/non-missing coordinates in the 2022 EIA-860 plants file.
Expected behavior
I expect the lat/long data in plants_entity_eia to match the lat/long data in the most recent raw EIA-860 table for which lat/long data is available.
Software Environment?
- Operating System Windows
- Python version and distribution Python 3.11.4
- How did you install PUDL? N/A
Additional context
Add any other context about the problem here.
Just wanted to bump this issue: we're having some issues with bad timezone data due to bad lat/long values (https://github.com/catalyst-cooperative/pudl/issues/1192). We're going to try and patch this on our end, but it would be helpful if this could be fixed in pudl as well!
Linking this to https://github.com/catalyst-cooperative/pudl/issues/971 and https://github.com/catalyst-cooperative/pudl/issues/402
We are also noticing this issue for Pegasus Wind (plant ID 61916), which in EIA-860 is listed with a coordinate of:
latitude 43.452003
longitude -83.50721
Which correctly places it in Michigan. However, for some reason in PUDL (v2024.5.0), the coordinates are changed to:
latitude 43.452003
longitude -111.55111
Which puts it in Idaho.
Not sure why this is happening - maybe has to do with inconsistent coordinates being reported? @zaneselvans
That is a big difference! Not sure what's happening there either.
@ktehranchi mentioned he might be interested in taking on this issue more generally and implementing a more principled method of choosing a best lat/lon point that actually treats the lat/lon as a geopoint. See also #1280 #656
After diving into the raw EIA-860 tables a bit more, it looks like in some of the earlier years (eg 2017) for plant 61916, they were incorrectly reporting a longitude of -111 for this plant, so maybe pudl is taking the first value as a default?
However, it looks like even in the yearly plant output table in pudl, -111 is reported for all years, even though this was fixed in some of the later years.