Deduplicate PUF records
When excluding RECID and s006, there are currently 40 records in the taxcalc PUF which are identical on all other dimensions. These could potentially be consolidated, summing s006 between them.
There's also more room to optimize when going to the raw PUF, where consolidating could shave off 15,000 records. This is because taxdata only uses 65 variables from the raw PUF, and the other parts of the taxdata process add nonidentical fields after ingesting the raw PUF. The most extreme case is one PUF record having 131 copies.
One catch could be preserving RECID, which might involve making it a comma-separated field. We could also add a function to "unroll" the consolidated data to produce one record per RECID for mapping back to the raw PUF.
Here's my notebook looking at this: https://github.com/MaxGhenis/taxcalc-notebooks/blob/master/random/identical_puf_records.ipynb