muckrock
muckrock copied to clipboard
Handle more file types
Since all the 'smarts' for analyzing records is in DocumentCloud, MuckRock should send more of its data to DocumentCloud even if the file is not a PDF.
- ZIP files should be uncompressed and sent to DocumentCloud as its component files
- Image files - and really any Office document - should be converted to PDF (if needed for DocumentCloud, but it appears DC's docsplit already could do that) and OCR-ed by DocumentCloud
- EML and MSG files should be converted to PDF for reading email messages
- PDFs with embedded attachments and "Portfolio" PDFs should have those components extracted and sent to documentcloud
In each of the above, the functionality could live on the DocumentCloud or MuckRock side as appropriate.