ingest-file
ingest-file copied to clipboard
Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.
ingest-file could extract crypto wallet addresses for popular crypto currencies using regular expressions, similar to it already extracts email addresses and IBANs. While ElasticSearch and Aleph do support searching using...
See https://github.com/alephdata/ingest-file/pull/511
DO NOT MERGE. This is a hack, for internal review only.
TODO: - [ ] `Mentions` seem to be missing, ingestigate As per alephdata/aleph#3908 and [#2066](https://github.com/alephdata/aleph/issues/2066), this is an attempt to create `BankAccount` FTM entities out of valid IBANs. In the...
Bumps ubuntu from 20.04 to 23.04. [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) You can trigger a rebase of this PR by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands...
There are two tests currently marked as `@skip` in the `tests` dir: - [test_olm.py](https://github.com/alephdata/ingest-file/blob/main/tests/test_olm.py) - [test_djvu.py](https://github.com/alephdata/ingest-file/blob/main/tests/test_djvu.py) Both fail. The root cause for the failure should be investigated. Ideally, all tests...
`3.18.2` has difficulties with PDFs with unsupported image formats when we try to get a PIL image out of a pikepdf Image. Some research suggests this might be related to...
A `ProcessingException` is thrown every time `ingest-file` isn't able to parse a file. In the current state, if Sentry support is enabled, each of these will create an event in...
While importing an e-mail-archive in the (IMHO cursed) .PST-format, I came across a mailbox having all `application/rtf` for body type. ``` Content-Type: application/rtf Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename*=utf-8''rtf-body.rtf; filename="rtf-body.rtf" ```...