311-data icon indicating copy to clipboard operation
311-data copied to clipboard

Backfill Missing Data That Wasn't Addressed by Data Pipeline Hotfix

Open EchoProject opened this issue 2 years ago • 2 comments

Overview

We need to backfill data from Socrata so that there are no missing data ranges when querying a custom date range.

NOTE: It would be useful to have the data pipeline automatically update the new dataset at the beginning of the year. Currently, the new dataset needs to be added manually (see original issue).

Action Items

Resources/Instructions

Original Issue:

  • https://github.com/hackforla/311-data/issues/1165#issue-1102870912

EchoProject avatar May 06 '22 02:05 EchoProject

If I'm reading the code correctly, the fix may be quite simple. The data pipeline will determine the time of the last update. Then, it will ask socrata for all the data since that update, and then load that data into the postgres DB. When it loads the data into postgres, it will load it in two ways:

  1. Insert new rows (i.e., if the rowkey to load does not exist in postgres, insert it)
  2. Update existing rows (i.e., if the rowkey that was to be inserted already exists in postgres, update that row)

So we could actually insert our custom start date into the existing data pipeline, and it should just work. We should also add a custom end date so that we don't read a ton of data unnecessarily.

In terms of code changes, this means we would need to make the start_datetime a Parameter, which is defined in config.toml just like datasets is (see code). If the parameter is not provided, then we can fall back to the data pipeline's default behavior of finding the time of the last update.

We will need to figure out how to test this. Hopefully we can do this through unit tests; otherwise we will need to bring up a test server and test on that.

nichhk avatar May 08 '22 03:05 nichhk

Note that unit testing this is blocked by #1309.

nichhk avatar Aug 12 '22 18:08 nichhk