David Pilato

Results 329 comments of David Pilato

Interesting. I tested this document today and indeed the extracted content by Tika is: ``` \nDummy PDF file\n\nDummy PDF file\n\n\n\tDummy PDF file\n\n ``` Could you share the OpenOffice.org 2.1 source...

I'm wondering if the content generated by OpenOffice is not the problem here... ``` %PDF-1.4 %äüöß 2 0 obj stream x�=�� 1 ��²�G����~������߷��d�{r"X� �^�Y��AS!��0X�r�bVhl���8��(O�vN�3J�$o�z 6��Сw���$ɝj�O��Q��N�f��˒( endstream endobj 3 0 obj...

Interesting. PDFBox gives the right output. ```sh wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf ``` gives: ``` Dummy PDF file ``` Digging more...

Tested with Tika ```sh wget https://downloads.apache.org/tika/tika-app-1.26.jar java -jar tika-app-1.26.jar ``` When I upload the test file, I can see that the text is extracted twice. ```xml Dummy PDF file Dummy...

Thanks for proposing this change. That'd require a lot of changes though and because we are using Tika to do the extraction, I think that this change would have more...

The problem is here: Mapping is incorrect: please set stored: true on field [file.filename]. you need to remove the 2 indices, and restart from scratch I think.

Can you run: ``` GET /eng_drawings*/_mapping ``` And share the result?

> The document said I could provide multiple elastic search nodes. Yes. The problem is not related to this. Unless the nodes don't belong to the same cluster.

No answer on this one. Please feel free to open [a new discussion](https://github.com/dadoonet/fscrawler/discussions) about this if you are still hitting this problem.