camel-quarkus Tika pdf failure after upgrade to pdfBox in Camel, requires new quarkiverse-tika

Bug description

Camel upgraded pdfbox to 3.x (https://issues.apache.org/jira/browse/CAMEL-19796).

Pdfbox 3.x is not back compatible with 2.x, therefore quarkiverse-tika used by tika extension fails with the new pdfbox.

Apache tika is aware of the new version of pdfbox and the upgrade ticket is already in progress - see https://issues.apache.org/jira/browse/TIKA-3347

As soon as an new version of apache tika is released it has to be adopted by quarkiverse-tika and this new version has to be adopted by camel-quarkus.

I'm disabling tika tests using pdfbox until provblem is solved (on camel-main)

Aug 30 '23 08:08 JiriOndrusek

@ppalaga @jamesnetherton We have to wait for the new release of quarkiverse-tika supporting the new pdfbox. There is probably not much other options (the pdfbox involves i.e. fop, tika, pdf extensions, therefore keeping pdfbox on 2.x may be complicated. WDYT?

Aug 30 '23 08:08 JiriOndrusek

We have to wait for the new release quarkiverse-tika

Or we propose to revert the upgrade in Camel. I assume the same issue exists there if you try to bring the tika & pdf components together in the same app?

Aug 31 '23 06:08 jamesnetherton

We have to wait for the new release quarkiverse-tika

Or we propose to revert the upgrade in Camel. I assume the same issue exists there if you try to bring the tika & pdf components together in the same app?

Error starts to happen if tika parses any pdf file -> other functionality should work even with pdf extension together.

Reverting of the Camel change and postponing it until quarkiverse-tika would support 3.x would be an easy solution

Aug 31 '23 08:08 JiriOndrusek

There is no point in postponing the change in Camel. Not for special reasons, but just because we cannot base the core development on what Quarkus/Quarkiverse does. It's dangerous and not healthy.

Aug 31 '23 09:08 oscerd

If errors appear in Tika on plain camel, then it make sense to wait for a Tika release supporting pdfbox 3.x, but this is not evident through tests.

Aug 31 '23 09:08 oscerd

@oscerd I'll add some tests covering pdf into tika component, because I can not see pdf file there - https://github.com/apache/camel/tree/main/components/camel-tika/src/test/resources

If some troubles emerges, we can discuss what to do next. Does it sounds ok?

Aug 31 '23 09:08 JiriOndrusek

Yes, it is.

Aug 31 '23 09:08 oscerd

Also if you check in the SBOM for camel, pdfbox is used explicitly only in camel-pdf and camel-fop.

Camel-tika is using only:

    {
      "ref" : "pkg:maven/org.apache.camel/[email protected]?type=jar",
      "dependsOn" : [
        "pkg:maven/org.apache.camel/[email protected]?type=jar",
        "pkg:maven/org.apache.tika/[email protected]?type=jar",
        "pkg:maven/org.apache.tika/[email protected]?type=jar",
        "pkg:maven/org.apache.tika/[email protected]?type=jar"
      ]
    }

both the tika commons and tika text modules are not using pdfbox. It's only something related to quarkiverse extension : https://github.com/quarkiverse/quarkus-tika/blob/main/pom.xml#L63

Aug 31 '23 09:08 oscerd

I missed that fact, though having pdf part of the tika test coverage make sense nevertheless. (but with lower priority)

Aug 31 '23 10:08 JiriOndrusek

@oscerd I tried to create a test which parses pdf (using plain Camel). For that purpose I had to add:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parser-pdf-module</artifactId>
    <version>${tika-version}</version> (2.8.0)
</dependency>

to get PdfParser for Tika. This dependency brings PdfBox 2.x (see https://mvnrepository.com/artifact/org.apache.tika/tika-parser-pdf-module/2.9.0)

Therefore user which would like to use pdf parser for Tika might had a conflict in dependencies in case i.e. camel-pdf is also part of the project. The failure is caused by the no-compatibility between pdfbox 2.x and 3.x.

java.lang.NoSuchMethodError: 'org.apache.pdfbox.pdmodel.PDDocument org.apache.pdfbox.pdmodel.PDDocument.load(java.io.InputStream, java.lang.String, org.apache.pdfbox.io.MemoryUsageSetting)'
	at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:421)

I can imagine that this use case does not make sense - parsing pdf by camel-tika and depend on camel-pdf.

In case that useage of camel-tika (for pdf parsing) together with camel-pdf is not supported, forcing similar restriction for camel-quarkus might solve the problem. (Currently we are testing tika extension for parsing pdf files)

--- edited

tika version is 2.8. not 2.9.0 (behavior is the same)

Aug 31 '23 11:08 JiriOndrusek

To me this is really a corner case. I understand the point, but it's a tika problem.

Aug 31 '23 11:08 oscerd

I agree, so the right way for camel-quarkus is not testing pdf parsing with Tika (as this is the corner case in plain Camel, which might not work in some cases), Camel-pdf should be used instead.

Aug 31 '23 11:08 JiriOndrusek

Yes, once they align we could re-use tika for parsing PDF

Aug 31 '23 11:08 oscerd

camel-quarkus camel-quarkus copied to clipboard

Tika pdf failure after upgrade to pdfBox in Camel, requires new quarkiverse-tika

Bug description

camel-quarkus
camel-quarkus copied to clipboard