john krauss issues

Results 11 issues of


                                            john krauss

Remove docker4data dependency, update postgres

There should be a Docker image with a recent version of postgres that handles our DB dependency.

Parse natively from S3

Instead of iterating over text files locally, parse from text located on S3.

Scrape natively to S3

Instead of persisting either PDFs or text to disk, both should be persisted to S3.

Extract amounts due etc. from bill

Extract the bottom portions of the bill, currently ignored.

owner/address year-by-year from bills

This is collected from bills, but not included in any way on joined.csv.

abatements from HTML statements

Abatements are only collected from PDFs right now, but we could get them from HTML too.

long term maintenance

We don't need to keep all the PDFs immediately accessible for reading, the text version is fine for that. 1. Move all PDFs to Amazon glacier or another low-cost long-term...

Obtain CD and political districts by join with PLUTO by block

When the PLUTO lot map changes because of condo declarations, lot merges, or lot splits, we lose all the political districts metadata for the changed lots. For those unjoined lots,...

Test pg_dumps after build

After pg_dumps are made by the build container, they should be tested somehow. Weird, hard-to-reproduce encoding errors are sometimes killing the `pg_restore`, resulting in 0 row tables.

Handle "money" data type properly

[This dataset](https://data.ny.gov/Economic-Development/Quarterly-Census-of-Employment-and-Wages-Quarterly/cwsm-2ns3) has a "money" typed column which is filled with values starting with "$", which makes `pgloader` barf (and take forever.) Clearly suboptimal to default these to a text...