model-res-avm
model-res-avm copied to clipboard
Update `ingest` stage to use `noctua` `unload = TRUE` option
Now that https://github.com/DyfanJones/noctua/pull/215 is merged, we should update this repo to use the new option, preferably by setting it globally within noctua_options(unload = TRUE)
. This will require a bit of work to QC, as I expect it may result in minor changes to the ingested data.
For example, the loc_tax_municipality_name
column is stored as an array in Athena. It gets imported as a comma-separated string using the old method (which is then parsed back to a list). The new method using unload = TRUE
natively supports arrays, and so imports loc_tax_municipality_name
as a list. However, we should double-check that none of the values or treatments have changed.
@dfsnow - Is there a particular output that you want for the value / treatment tests? I ran the ingest script twice, once with the unload = TRUE and once without the unload = TRUE. I then sorted the columns (they were ordered differently which is something to note), and used dataCompareR
to test the different outputs. The only differences were related to small decimal disparities.
@dfsnow - Is there a particular output that you want for the value / treatment tests? I ran the ingest script twice, once with the unload = TRUE and once without the unload = TRUE. I then sorted the columns (they were ordered differently which is something to note), and used
dataCompareR
to test the different outputs. The only differences were related to small decimal disparities.
@Damonamajor As mentioned in the issue, I would specifically focus on the categorical columns that are stored in Athena as arrays. That's mostly anything with a loc_tax_
prefix. I'm worried that the string processing we set up for the previous method may interfere with the new one and give us different results.
@dfsnow - Is there a particular output that you want for the value / treatment tests? I ran the ingest script twice, once with the unload = TRUE and once without the unload = TRUE. I then sorted the columns (they were ordered differently which is something to note), and used
dataCompareR
to test the different outputs. The only differences were related to small decimal disparities.@Damonamajor As mentioned in the issue, I would specifically focus on the categorical columns that are stored in Athena as arrays. That's mostly anything with a
loc_tax_
prefix. I'm worried that the string processing we set up for the previous method may interfere with the new one and give us different results.
Those columns didn't ping anything, but I'll take a deeper dive into them.