pudl icon indicating copy to clipboard operation
pudl copied to clipboard

Missing and incorrect latitude / longitude data in `plants_entity_eia`

Open grgmiller opened this issue 1 year ago • 5 comments

Describe the bug

Several related issues:

  • It appears that somewhere in the pudl pipeline, the latitude (but not longitude) data is being dropped from plants_entity_eia and other tables that contain lat/long data. As far as I can tell from manually inspecting the raw EIA-860 plants file from 2022, there are a small number of plants that are missing both lat/long data, but none that are only missing latitude data.
  • It also appears that there are several plants where the sign of the longitude is being flipped from negative to positive (which locates these plants in China)
  • There is one plant that is being assigned a nonsense longitude of -188 (plant 61445)
  • There are a handful of plants that are assigned seemingly made-up coordinates in the middle of the Atlantic ocean, generally around (42, -42)

Bug Severity

Medium: With some effort, I can work around the bug.

To Reproduce

I downloaded data from https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/v2023.12.01/pudl.sqlite.gz, and am loading the plants_entity_eia table using pd.read_sql("SELECT plant_id_eia, latitude, longitude FROM plants_entity_eia", PUDL_ENGINE)

213 plants are missing latitudes image

16 plants have coordinates further east than the east coast of the US: image

One plant has a non-existant coordinate: image

Spot checking these plants revealed that they all appear to have valid/non-missing coordinates in the 2022 EIA-860 plants file.

Expected behavior

I expect the lat/long data in plants_entity_eia to match the lat/long data in the most recent raw EIA-860 table for which lat/long data is available.

Software Environment?

  • Operating System Windows
  • Python version and distribution Python 3.11.4
  • How did you install PUDL? N/A

Additional context

Add any other context about the problem here.

grgmiller avatar Dec 08 '23 23:12 grgmiller

Just wanted to bump this issue: we're having some issues with bad timezone data due to bad lat/long values (https://github.com/catalyst-cooperative/pudl/issues/1192). We're going to try and patch this on our end, but it would be helpful if this could be fixed in pudl as well!

grgmiller avatar May 29 '24 18:05 grgmiller

Linking this to https://github.com/catalyst-cooperative/pudl/issues/971 and https://github.com/catalyst-cooperative/pudl/issues/402

grgmiller avatar Jul 10 '24 23:07 grgmiller

We are also noticing this issue for Pegasus Wind (plant ID 61916), which in EIA-860 is listed with a coordinate of:

latitude             43.452003
longitude            -83.50721

Which correctly places it in Michigan. However, for some reason in PUDL (v2024.5.0), the coordinates are changed to:

latitude         43.452003
longitude       -111.55111

Which puts it in Idaho.

Not sure why this is happening - maybe has to do with inconsistent coordinates being reported? @zaneselvans

grgmiller avatar Jul 10 '24 23:07 grgmiller

That is a big difference! Not sure what's happening there either.

@ktehranchi mentioned he might be interested in taking on this issue more generally and implementing a more principled method of choosing a best lat/lon point that actually treats the lat/lon as a geopoint. See also #1280 #656

zaneselvans avatar Jul 11 '24 00:07 zaneselvans

After diving into the raw EIA-860 tables a bit more, it looks like in some of the earlier years (eg 2017) for plant 61916, they were incorrectly reporting a longitude of -111 for this plant, so maybe pudl is taking the first value as a default?

However, it looks like even in the yearly plant output table in pudl, -111 is reported for all years, even though this was fixed in some of the later years.

grgmiller avatar Jul 11 '24 15:07 grgmiller