bdit_data-sources icon indicating copy to clipboard operation
bdit_data-sources copied to clipboard

miovision_csv schema cleanup

Open gabrielwol opened this issue 1 year ago • 6 comments

miovision_csv tables ending in 2020 can be deleted, as they now live in miovision_api schema. @Nate-Wessel noted that querying this older data from miovision_csv for the volume index project is very slow, perhaps a volumes_daily table could help with this.

gabrielwol avatar Nov 10 '23 15:11 gabrielwol

It's less that querying those old _csv tables is slow and more that the inconsistency between the two schemas causes difficulties for queries that make use of the entire time range that we have data for.

It's never been clear to me why these two schemas haven't been combined into a single miovision schema that has only one right way of doing things vs the current set up where there are different aggregate tables, no way of manually QAing anomalous ranges for older data, etc. IMO all the data is from Miovision cameras and it should all live in the same place regardless of whether it is from before or after 2019-01-01.

If the schemas were combined, the older data would naturally be included in the daily aggregation table - which would both speed things up and dramatically simplify my queries.

Nate-Wessel avatar Nov 13 '23 14:11 Nate-Wessel

Writing this down so it can now be documented, both in the README and maybe the schema comment.

They are effectively different data sources, which is why they've been kept in separate schemas.

The csv data comes from before the counts were being processed at the intersection, so counts were processed from downloaded video using a human augmented process. This is generally regarded as higher quality than the API data, which has been evolving over time. Because of the resource intensity of this, the counts were rather sparse, originally about a week per month and then only targeted periods for reporting. Because of the sparseness, the approach around aggregating was different, inferring when a camera might have been not recording data and interpolating missing values, as opposed to excluding entire hours.

Also there's an miovision_csv.exceptions table, which you can use to filter the csv data.

radumas avatar Nov 13 '23 15:11 radumas

That makes some sense, but to me that different method sounds like an algorithm change the likes of which must already be accommodated within the miovision "API" schema. It is different, yes, but we haven't chopped the data into different schemas when other significant algorithm or classification changes occurred.

IMO, it's better to have all the data in one place with a good way of noting where methods and classifications have changed over time, as they continue to do. If this early data is actually better, why make it harder to use?

Also there's an miovision_csv.exceptions table, which you can use to filter the csv data.

The problem is just that this is different from the way it's done for all the other Miovision data.

Nate-Wessel avatar Nov 13 '23 15:11 Nate-Wessel

I think the temporal sparseness of the older data is a good reason to not combine the two schemas. These sparse study periods are very different than our other permanent volume counters. Volume counts from miovision_csv.volumes_15min_tmc: image

I made some progress on this issue, primarily listing 2020 things to delete.

gabrielwol avatar Nov 24 '23 17:11 gabrielwol

temporal sparseness of the older data

It's not as though the new data is complete (non-sparse) though. There are still big gaps where sensors go down, are decommissioned, etc. We need to check that the data exists for any location/period of interest in either case.

Nate-Wessel avatar Nov 29 '23 15:11 Nate-Wessel

My two cents, but if miovision_csv is the data that was collected a couple days per month during the King Pilot, it should be treated as short term counts, not permanent miovision counts, and migrated into MOVE/ turning movement count schemas there. That does create trickiness for analyzing together with the perm count data, but maybe a transformed view could be created for that purpose. The data collection methodology is fundamentally different, they were conducted as one off turning movement counts using camera footage from the miovision cameras, and the accuracy should be higher than what we got from earlier miovision smartsense counts.

jwrcoleman avatar Nov 29 '23 16:11 jwrcoleman