fscrawler
fscrawler copied to clipboard
Elasticsearch File System Crawler (FS Crawler)
Just wanted to find out if it is possible to : i) detect strikethrough in pdf files ii) detect paragraph in pdf files
We should not extract all the raw metadata when `fs.raw_metadata` is enabled but only the non standard raw metadata. See https://github.com/dadoonet/fscrawler/blob/master/tika/src/main/java/fr/pilato/elasticsearch/crawler/fs/tika/TikaDocParser.java#L148-L185
- Target feature: Provide information for the physical path of the file for which FSCralwer has failed to operate on (e.g. index in ES). - Current Situation: Currently you are...
I would like to run fscrawler on a Raspberry Pi 4, but it has arm64 architecture. Although the core is written for JVM and should be architecture independent, the produced...
**Is your feature request related to a problem? Please describe.** We're building a crawler cluster for local area network. It intends to provide a convenient search service. People in there...
**Is your feature request related to a problem? Please describe.** Many users of this scrawler run it as a scheduled tabk, docker container, or 24x7. Currently you have to resort...
Hi , I am currently trying to setup a pipeline for end to end document upload and delete . and i have successfully managed to upload a document using fscrawler...
While performing sizing testing to check how big a file can be ingested, it was noticed that anything above 10MB file size does not goes through. Even if ingestion into...
Let's make the code more generic in preparation of #263 #264. Instead of writing `{job_name}/_status.json` file, let's write: * `{job_name}/_status-fs.json` for FS standard implementation * `{job_name}/_status-ssh.json` for SSH implementation *...
Although, tesseract is integrated with fscrawler for OCR. But, Tesseract fails when data is in tabular form. I found that ABBYY FineReader OCR does that efficiently. Is there any provision...