ingest-file icon indicating copy to clipboard operation
ingest-file copied to clipboard

Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.

Results 91 ingest-file issues
Sort by recently updated
recently updated
newest added

https://github.com/pemistahl/lingua-py https://pemistahl.github.io/lingua-py/ Upsides: better and well studied accuracy than the lid.175 model of fasttext Downsides: 75 languages supported vs 176 for fasttext lid.175 model

... like https://github.com/Ousret/charset_normalizer

Hi. I am trying to upload documents (PDFs with text) to Aleph v 3.11.1 using ingest-file 3.16.0 As far as I can see the newest ingest-file includes NER-support for Norwegian...

bug

This PR might never get merged, but it explores the outcome of adopting PDFminer, following the discussion in #42. So far: * Lots of problems loading images correctly (might want...

You have to give Microsoft credit for its consistency: instead of storing E-Mail messages in Outlook as RFC822 plain text, they came up with their own super funky file format...

At the moment, the `ingestors` will call on an HTTP service provided by `convert-document` (in this repo) to convert documents in various types (Word, Powerpoint, etc.) to PDF files, which...

question

What's broken? * We're seeing incorrect text extraction out of some documents, especially those containing Arabic text. * Text from images isn't being extracted into the right location in the...

It seems like we fail to parse files which are created in Excel with write-protection, even though they are readable without a password in the app. There has to be...

file-types

Bumps [spacy](https://github.com/explosion/spaCy) from 3.6.1 to 3.7.4. Release notes Sourced from spacy's releases. v3.7.4: New textcat layers and fo/nn language extensions ✨ New features and improvements Improve NumPy 2.0 compatibility (#13103)....

dependencies
python

Bumps [followthemoney](https://github.com/alephdata/followthemoney) from 3.5.8 to 3.5.9. Commits 04554ad Bump version: 3.5.8 → 3.5.9 7b6a058 re-generate model 4b47761 fix naive assertion 5dddae6 remove utcnow because it is deprecated in py 3.12...

dependencies
python