reach
reach copied to clipboard
SQLite / Document ID Refactor for Scrapers
This is a fairly sizeable refactor target which covers a few different issues.
Warning Before anyone starts on this, this issue might be negated by architectural changes proposed in #419
General Design Notes
Refactor the scraper pipelines to storetheir data in SQLite databases configured with WAL and the JSON1 extension.
The idea is to move from using a JSON manifest file to a relational database, this will provide th following:
- The ability to easily query on multiple properties of a scraped document to decide whether we need to use that document (DocumentId, File Hash, source url, title, etc...)
- Reduce overhead of loading the manifest into memory to edit and flushing to disk.
- Allow us to easily store information about alternate ids for a given document using JSON1 arrays in the schema such as
alternate_hashes
,alternate_dids
,alternate_urls
. This will allow us to have multiple identifiers for a given document in order to reduce duplication because of file hash changes or lack of a DID.
Tasks
- [ ] Refactor pipeline in scraper to use SQLite database
- [ ] Refactor pdf_parser to accept a SQLite database
- [ ] Implement identity resolution logic and historic identifier updating against the SQLite database
- [ ] Extend the DocumentId strippping code to be more exhaustive and to handle different DTDs in the metadata definition for PDFs.
- [ ] Figure out if there is a DocumentId or similar identity in documents generated by MS Word (poppler can't find any)
Other Notes
- Thought about using a postgres database instead, however, with the file-passing based workflow we can guarantee that the indexes won't be badly updated if an error occurs at a checkpoint somewhere, with each task writing to postgres it could result in data getting into an inconsistent state as well as the cost of an RDS instance capable of the volume of writes/reads would be significantly more than the cost of a single SQLite database per scraper.
This should resolve https://github.com/wellcometrust/reach/issues/48
It is related to the architecture work and will be re-written