ingest-file
ingest-file copied to clipboard
Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.
https://github.com/pemistahl/lingua-py https://pemistahl.github.io/lingua-py/ Upsides: better and well studied accuracy than the lid.175 model of fasttext Downsides: 75 languages supported vs 176 for fasttext lid.175 model
... like https://github.com/Ousret/charset_normalizer
Hi. I am trying to upload documents (PDFs with text) to Aleph v 3.11.1 using ingest-file 3.16.0 As far as I can see the newest ingest-file includes NER-support for Norwegian...
This PR might never get merged, but it explores the outcome of adopting PDFminer, following the discussion in #42. So far: * Lots of problems loading images correctly (might want...
You have to give Microsoft credit for its consistency: instead of storing E-Mail messages in Outlook as RFC822 plain text, they came up with their own super funky file format...
At the moment, the `ingestors` will call on an HTTP service provided by `convert-document` (in this repo) to convert documents in various types (Word, Powerpoint, etc.) to PDF files, which...
What's broken? * We're seeing incorrect text extraction out of some documents, especially those containing Arabic text. * Text from images isn't being extracted into the right location in the...
It seems like we fail to parse files which are created in Excel with write-protection, even though they are readable without a password in the app. There has to be...
Bumps [spacy](https://github.com/explosion/spaCy) from 3.6.1 to 3.7.4. Release notes Sourced from spacy's releases. v3.7.4: New textcat layers and fo/nn language extensions ✨ New features and improvements Improve NumPy 2.0 compatibility (#13103)....
Bumps [followthemoney](https://github.com/alephdata/followthemoney) from 3.5.8 to 3.5.9. Commits 04554ad Bump version: 3.5.8 → 3.5.9 7b6a058 re-generate model 4b47761 fix naive assertion 5dddae6 remove utcnow because it is deprecated in py 3.12...