model-res-avm icon indicating copy to clipboard operation
model-res-avm copied to clipboard

Update `ingest` stage to use `noctua` `unload = TRUE` option

Open dfsnow opened this issue 10 months ago • 3 comments

Now that https://github.com/DyfanJones/noctua/pull/215 is merged, we should update this repo to use the new option, preferably by setting it globally within noctua_options(unload = TRUE). This will require a bit of work to QC, as I expect it may result in minor changes to the ingested data.

For example, the loc_tax_municipality_name column is stored as an array in Athena. It gets imported as a comma-separated string using the old method (which is then parsed back to a list). The new method using unload = TRUE natively supports arrays, and so imports loc_tax_municipality_name as a list. However, we should double-check that none of the values or treatments have changed.

dfsnow avatar Apr 15 '24 20:04 dfsnow

@dfsnow - Is there a particular output that you want for the value / treatment tests? I ran the ingest script twice, once with the unload = TRUE and once without the unload = TRUE. I then sorted the columns (they were ordered differently which is something to note), and used dataCompareR to test the different outputs. The only differences were related to small decimal disparities.

image

Damonamajor avatar May 14 '24 15:05 Damonamajor

@dfsnow - Is there a particular output that you want for the value / treatment tests? I ran the ingest script twice, once with the unload = TRUE and once without the unload = TRUE. I then sorted the columns (they were ordered differently which is something to note), and used dataCompareR to test the different outputs. The only differences were related to small decimal disparities.

image

@Damonamajor As mentioned in the issue, I would specifically focus on the categorical columns that are stored in Athena as arrays. That's mostly anything with a loc_tax_ prefix. I'm worried that the string processing we set up for the previous method may interfere with the new one and give us different results.

dfsnow avatar May 14 '24 15:05 dfsnow

@dfsnow - Is there a particular output that you want for the value / treatment tests? I ran the ingest script twice, once with the unload = TRUE and once without the unload = TRUE. I then sorted the columns (they were ordered differently which is something to note), and used dataCompareR to test the different outputs. The only differences were related to small decimal disparities. image

@Damonamajor As mentioned in the issue, I would specifically focus on the categorical columns that are stored in Athena as arrays. That's mostly anything with a loc_tax_ prefix. I'm worried that the string processing we set up for the previous method may interfere with the new one and give us different results.

Those columns didn't ping anything, but I'll take a deeper dive into them.

Damonamajor avatar May 14 '24 15:05 Damonamajor