David Pilato comments

Results 329 comments of


                                            David Pilato

Duplication of PDF content

Interesting. I tested this document today and indeed the extracted content by Tika is: ``` \nDummy PDF file\n\nDummy PDF file\n\n\n\tDummy PDF file\n\n ``` Could you share the OpenOffice.org 2.1 source...

Duplication of PDF content

I'm wondering if the content generated by OpenOffice is not the problem here... ``` %PDF-1.4 %äüöß 2 0 obj stream x�=�� 1 ��²�G��~��߷��d�{r"X� �^�Y��AS!��0X�r�bVhl��8��(O�vN�3J�$o�z 6��Сw��$ɝj�O��Q��N�f��˒( endstream endobj 3 0 obj...

Duplication of PDF content

Interesting. PDFBox gives the right output. ```sh wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf ``` gives: ``` Dummy PDF file ``` Digging more...

Duplication of PDF content

Tested with Tika ```sh wget https://downloads.apache.org/tika/tika-app-1.26.jar java -jar tika-app-1.26.jar ``` When I upload the test file, I can see that the text is extracted twice. ```xml Dummy PDF file Dummy...

Duplication of PDF content

Reopening

Use External API for OCR ie amazon textract or google vision

Thanks for proposing this change. That'd require a lot of changes though and because we are using Tika to do the extraction, I think that this change would have more...

FScrawler 2.10 dose not update index when more than one elatic search specified.

The problem is here: Mapping is incorrect: please set stored: true on field [file.filename]. you need to remove the 2 indices, and restart from scratch I think.

FScrawler 2.10 dose not update index when more than one elatic search specified.

Can you run: ``` GET /eng_drawings*/_mapping ``` And share the result?

FScrawler 2.10 dose not update index when more than one elatic search specified.

> The document said I could provide multiple elastic search nodes. Yes. The problem is not related to this. Unless the nodes don't belong to the same cluster.

FScrawler 2.10 dose not update index when more than one elatic search specified.

No answer on this one. Please feel free to open [a new discussion](https://github.com/dadoonet/fscrawler/discussions) about this if you are still hitting this problem.