ingest-file
ingest-file copied to clipboard
Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.
There are two formats of PDF metadata: - the old one (a key-value dict) - XMP (introduced by Adobe in early 00's) These days, XMP must always be used instead...
Bumps [coverage](https://github.com/nedbat/coveragepy) from 6.4.4 to 6.5.0. Changelog Sourced from coverage's changelog. Version 6.5.0 — 2022-09-29 The JSON report now includes details of which branches were taken, and which are missing...
Bumps [servicelayer[amazon,google]](https://github.com/alephdata/servicelayer) from 1.20.4 to 1.20.5. Commits ebbd6ed Bump version: 1.20.4 → 1.20.5 162e04d Update structlog requirement from <22.0.0,>=20.2.0 to >=20.2.0,<23.0.0 (#70) 3b944f9 Bump pika from 1.2.0 to 1.3.0 (#68)...
Bumps ubuntu from 20.04 to 22.04. [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a...
when ingest-file is run on a machine with a large number of cores, the default pool size of SQLAlchemy may not be enough. See https://github.com/alephdata/ingest-file/issues/251
Error while analysing an ingested document stops the document processing pipeline and the document doesn't get indexed or show up on Aleph. Example of such an error: ``` Traceback (most...
Replace chardet encoding checks for the newer stuff
### What is an ftm-bundle? An `ftm-bundle` is a zip file containing structured FtM entities and document blobs. The structure of the zip file may look something like: ``` bundle.zip/...
nosetests is dead, and probably the better fixture handling in pytest will be an overall gain for the ingestors.
Ingestors don't currently support HEIC / HEVC images. refs https://github.com/alephdata/aleph/issues/1982