aleph
aleph copied to clipboard
Search and browse documents and data; find the people and companies you look for.
Ingestors don't currently support HEIC / HEVC images. refs https://github.com/alephdata/aleph/issues/1982
https://github.com/pemistahl/lingua-py https://pemistahl.github.io/lingua-py/ Upsides: better and well studied accuracy than the lid.175 model of fasttext Downsides: 75 languages supported vs 176 for fasttext lid.175 model
... like https://github.com/Ousret/charset_normalizer
### What is an ftm-bundle? An `ftm-bundle` is a zip file containing structured FtM entities and document blobs. The structure of the zip file may look something like: ``` bundle.zip/...
when ingest-file is run on a machine with a large number of cores, the default pool size of SQLAlchemy may not be enough. See https://github.com/alephdata/aleph/issues/3915
For whatever reason (logs output nothing helpful), some tasks seem to be failing and this blocks all other tasks. We have millions of enqueued tasks, and tasks continuously start and...
ingest-file extracts IBANs using a rather [simple regex](https://github.com/alephdata/ingest-file/blob/main/ingestors/analysis/patterns.py). This can lead to a lot of false positives. ingest-file could add additional validation for matches in order to improve precision: *...
ingest-file could extract crypto wallet addresses for popular crypto currencies using regular expressions, similar to it already extracts email addresses and IBANs. While ElasticSearch and Aleph do support searching using...
`3.18.2` has difficulties with PDFs with unsupported image formats when we try to get a PIL image out of a pikepdf Image. Some research suggests this might be related to...
Our current retry logic for converting documents (shelling out to LibreOffice) is based on two constants: the number of retry attempts and the timeout https://github.com/alephdata/ingest-file/blob/fca65fbb08ff37d65df3c14804ad5b1b6809b97d/ingestors/support/convert.py#L16-L17 What would be more desirable...