model-res-avm Update `ingest` stage to use `noctua` `unload = TRUE` option

Now that https://github.com/DyfanJones/noctua/pull/215 is merged, we should update this repo to use the new option, preferably by setting it globally within noctua_options(unload = TRUE). This will require a bit of work to QC, as I expect it may result in minor changes to the ingested data.

For example, the loc_tax_municipality_name column is stored as an array in Athena. It gets imported as a comma-separated string using the old method (which is then parsed back to a list). The new method using unload = TRUE natively supports arrays, and so imports loc_tax_municipality_name as a list. However, we should double-check that none of the values or treatments have changed.

Apr 15 '24 20:04 dfsnow

@dfsnow - Is there a particular output that you want for the value / treatment tests? I ran the ingest script twice, once with the unload = TRUE and once without the unload = TRUE. I then sorted the columns (they were ordered differently which is something to note), and used dataCompareR to test the different outputs. The only differences were related to small decimal disparities.

May 14 '24 15:05 Damonamajor

@dfsnow - Is there a particular output that you want for the value / treatment tests? I ran the ingest script twice, once with the unload = TRUE and once without the unload = TRUE. I then sorted the columns (they were ordered differently which is something to note), and used dataCompareR to test the different outputs. The only differences were related to small decimal disparities.

@Damonamajor As mentioned in the issue, I would specifically focus on the categorical columns that are stored in Athena as arrays. That's mostly anything with a loc_tax_ prefix. I'm worried that the string processing we set up for the previous method may interfere with the new one and give us different results.

May 14 '24 15:05 dfsnow

@dfsnow - Is there a particular output that you want for the value / treatment tests? I ran the ingest script twice, once with the unload = TRUE and once without the unload = TRUE. I then sorted the columns (they were ordered differently which is something to note), and used dataCompareR to test the different outputs. The only differences were related to small decimal disparities.

@Damonamajor As mentioned in the issue, I would specifically focus on the categorical columns that are stored in Athena as arrays. That's mostly anything with a loc_tax_ prefix. I'm worried that the string processing we set up for the previous method may interfere with the new one and give us different results.

Those columns didn't ping anything, but I'll take a deeper dive into them.

May 14 '24 15:05 Damonamajor

model-res-avm model-res-avm copied to clipboard

Update `ingest` stage to use `noctua` `unload = TRUE` option

model-res-avm
model-res-avm copied to clipboard