open-semantic-etl icon indicating copy to clipboard operation
open-semantic-etl copied to clipboard

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelin...

Results 42 open-semantic-etl issues
Sort by recently updated
recently updated
newest added

I want to install the opensemanticsearch-ner-python-django module in order to remove entities which have not been correctly identified. On the page https://opensemanticsearch.org/enhancer/named_entities_manager, it explains how to install it, but not...

Contenttype text/tsv;... should not be content type group "text document" but spreadsheet.

Law code subcodes in text like "a b c § 123 Abs. 3 d e f" should be extracted to multiple law codes "§ 123" and "§ 123 Abs. 3",...

enhancement

If there is a parameter in Tika for Tesseract custom OCR dictionary, add it like in OCR of PDF images.

enhancement

Hi, I need to extract document and metadata from alfresco repository. I have tried using apache manifoldcf and connected alfresco cmis to solr (opensemanticsearch). But I want to connect the...

After upgrade to Python 3 with urllib problem with parsing last modification date from webserver like Wed, 21 Jun 2017 11:35:20 +0000 The now used dateutil parser seems not to...

bug
help wanted

Add linked websites to indexing queue only, if yet not in index to spare resources because indexed only once even in linked in many tweets.

enhancement

Add options to import tweets by date (since / to)

enhancement

"indexing new file /media/folder/...." is misleading, if adding to queue where indexing is done later parallel by daemon.

enhancement

Implement enhanced error handling (fallback plugins and retry) for data enrichment or data analysis plugins: There should be parameters for each extraction & analysis plugin in the process chain for...

enhancement