fscrawler issues

does fscrawler recognize lines with strikethrough in pdf files

4

Just wanted to find out if it is possible to : i) detect strikethrough in pdf files ii) detect paragraph in pdf files

Do not extract ALL raw metadata

We should not extract all the raw metadata when `fs.raw_metadata` is enabled but only the non standard raw metadata. See https://github.com/dadoonet/fscrawler/blob/master/tika/src/main/java/fr/pilato/elasticsearch/crawler/fs/tika/TikaDocParser.java#L148-L185

dadoonet

feature_request

component:core

documents.log to contain The physical path of the File being indexed

2

- Target feature: Provide information for the physical path of the file for which FSCralwer has failed to operate on (e.g. index in ES). - Current Situation: Currently you are...

fsaab

feature_request

component:core

Provide arm64 Docker image

6

I would like to run fscrawler on a Raspberry Pi 4, but it has arm64 architecture. Although the core is written for JVM and should be architecture independent, the produced...

ebekebe

new

Auto detect file when it's moved into watched directory

1

**Is your feature request related to a problem? Please describe.** We're building a crawler cluster for local area network. It intends to provide a convenient search service. People in there...

helsonxiao

feature_request

Add Elastic Java Agent instrumentation

3

**Is your feature request related to a problem? Please describe.** Many users of this scrawler run it as a scheduled tabk, docker container, or 24x7. Currently you have to resort...

philippkahr

feature_request

component:monitoring

Add support for delete in the REST API

2

Hi , I am currently trying to setup a pipeline for end to end document upload and delete . and i have successfully managed to upload a document using fscrawler...

Deepakparashar14

Ingestion of more than 10MB single file ingestion fails

10

While performing sizing testing to check how big a file can be ingested, it was noticed that anything above 10MB file size does not goes through. Even if ingestion into...

coder-sa

check_for_bug

Add more abstraction depending on the implementation

Let's make the code more generic in preparation of #263 #264. Instead of writing `{job_name}/_status.json` file, let's write: * `{job_name}/_status-fs.json` for FS standard implementation * `{job_name}/_status-ssh.json` for SSH implementation *...

dadoonet

update

Interface ABBYY FineReader OCR with fscrawler

4

Although, tesseract is integrated with fscrawler for OCR. But, Tesseract fails when data is in tabular form. I found that ABBYY FineReader OCR does that efficiently. Is there any provision...

manoj4321

feature_request

fscrawler
fscrawler copied to clipboard

Metadata

does fscrawler recognize lines with strikethrough in pdf files

Do not extract ALL raw metadata

documents.log to contain The physical path of the File being indexed

Provide arm64 Docker image

Auto detect file when it's moved into watched directory

Add Elastic Java Agent instrumentation

Add support for delete in the REST API

Ingestion of more than 10MB single file ingestion fails

Add more abstraction depending on the implementation

Interface ABBYY FineReader OCR with fscrawler

← Metadata

Owner

Metadata

fscrawler fscrawler copied to clipboard

Metadata

← Metadata

Owner

Metadata

fscrawler
fscrawler copied to clipboard