311-data
311-data copied to clipboard
Backfill Missing Data That Wasn't Addressed by Data Pipeline Hotfix
Overview
We need to backfill data from Socrata so that there are no missing data ranges when querying a custom date range.
NOTE: It would be useful to have the data pipeline automatically update the new dataset at the beginning of the year. Currently, the new dataset needs to be added manually (see original issue).
Action Items
- [ ] See comment below
Resources/Instructions
Original Issue:
- https://github.com/hackforla/311-data/issues/1165#issue-1102870912
If I'm reading the code correctly, the fix may be quite simple. The data pipeline will determine the time of the last update. Then, it will ask socrata for all the data since that update, and then load that data into the postgres DB. When it loads the data into postgres, it will load it in two ways:
- Insert new rows (i.e., if the rowkey to load does not exist in postgres, insert it)
- Update existing rows (i.e., if the rowkey that was to be inserted already exists in postgres, update that row)
So we could actually insert our custom start date into the existing data pipeline, and it should just work. We should also add a custom end date so that we don't read a ton of data unnecessarily.
In terms of code changes, this means we would need to make the start_datetime
a Parameter, which is defined in config.toml just like datasets
is (see code). If the parameter is not provided, then we can fall back to the data pipeline's default behavior of finding the time of the last update.
We will need to figure out how to test this. Hopefully we can do this through unit tests; otherwise we will need to bring up a test server and test on that.
Note that unit testing this is blocked by #1309.