open-semantic-etl
open-semantic-etl copied to clipboard
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelin...
I want to install the opensemanticsearch-ner-python-django module in order to remove entities which have not been correctly identified. On the page https://opensemanticsearch.org/enhancer/named_entities_manager, it explains how to install it, but not...
Contenttype text/tsv;... should not be content type group "text document" but spreadsheet.
Law code subcodes in text like "a b c § 123 Abs. 3 d e f" should be extracted to multiple law codes "§ 123" and "§ 123 Abs. 3",...
If there is a parameter in Tika for Tesseract custom OCR dictionary, add it like in OCR of PDF images.
Hi, I need to extract document and metadata from alfresco repository. I have tried using apache manifoldcf and connected alfresco cmis to solr (opensemanticsearch). But I want to connect the...
After upgrade to Python 3 with urllib problem with parsing last modification date from webserver like Wed, 21 Jun 2017 11:35:20 +0000 The now used dateutil parser seems not to...
Add linked websites to indexing queue only, if yet not in index to spare resources because indexed only once even in linked in many tweets.
"indexing new file /media/folder/...." is misleading, if adding to queue where indexing is done later parallel by daemon.
Implement enhanced error handling (fallback plugins and retry) for data enrichment or data analysis plugins: There should be parameters for each extraction & analysis plugin in the process chain for...