john krauss
john krauss
There should be a Docker image with a recent version of postgres that handles our DB dependency.
Instead of iterating over text files locally, parse from text located on S3.
Instead of persisting either PDFs or text to disk, both should be persisted to S3.
Extract the bottom portions of the bill, currently ignored.
This is collected from bills, but not included in any way on joined.csv.
Abatements are only collected from PDFs right now, but we could get them from HTML too.
We don't need to keep all the PDFs immediately accessible for reading, the text version is fine for that. 1. Move all PDFs to Amazon glacier or another low-cost long-term...
When the PLUTO lot map changes because of condo declarations, lot merges, or lot splits, we lose all the political districts metadata for the changed lots. For those unjoined lots,...
After pg_dumps are made by the build container, they should be tested somehow. Weird, hard-to-reproduce encoding errors are sometimes killing the `pg_restore`, resulting in 0 row tables.
[This dataset](https://data.ny.gov/Economic-Development/Quarterly-Census-of-Employment-and-Wages-Quarterly/cwsm-2ns3) has a "money" typed column which is filled with values starting with "$", which makes `pgloader` barf (and take forever.) Clearly suboptimal to default these to a text...