camel-quarkus icon indicating copy to clipboard operation
camel-quarkus copied to clipboard

Tika pdf failure after upgrade to pdfBox in Camel, requires new quarkiverse-tika

Open JiriOndrusek opened this issue 1 year ago • 13 comments

Bug description

Camel upgraded pdfbox to 3.x (https://issues.apache.org/jira/browse/CAMEL-19796).

Pdfbox 3.x is not back compatible with 2.x, therefore quarkiverse-tika used by tika extension fails with the new pdfbox.

Apache tika is aware of the new version of pdfbox and the upgrade ticket is already in progress - see https://issues.apache.org/jira/browse/TIKA-3347

As soon as an new version of apache tika is released it has to be adopted by quarkiverse-tika and this new version has to be adopted by camel-quarkus.

I'm disabling tika tests using pdfbox until provblem is solved (on camel-main)

JiriOndrusek avatar Aug 30 '23 08:08 JiriOndrusek

@ppalaga @jamesnetherton We have to wait for the new release of quarkiverse-tika supporting the new pdfbox. There is probably not much other options (the pdfbox involves i.e. fop, tika, pdf extensions, therefore keeping pdfbox on 2.x may be complicated. WDYT?

JiriOndrusek avatar Aug 30 '23 08:08 JiriOndrusek

We have to wait for the new release quarkiverse-tika

Or we propose to revert the upgrade in Camel. I assume the same issue exists there if you try to bring the tika & pdf components together in the same app?

jamesnetherton avatar Aug 31 '23 06:08 jamesnetherton

We have to wait for the new release quarkiverse-tika

Or we propose to revert the upgrade in Camel. I assume the same issue exists there if you try to bring the tika & pdf components together in the same app?

Error starts to happen if tika parses any pdf file -> other functionality should work even with pdf extension together.

Reverting of the Camel change and postponing it until quarkiverse-tika would support 3.x would be an easy solution

JiriOndrusek avatar Aug 31 '23 08:08 JiriOndrusek

There is no point in postponing the change in Camel. Not for special reasons, but just because we cannot base the core development on what Quarkus/Quarkiverse does. It's dangerous and not healthy.

oscerd avatar Aug 31 '23 09:08 oscerd

If errors appear in Tika on plain camel, then it make sense to wait for a Tika release supporting pdfbox 3.x, but this is not evident through tests.

oscerd avatar Aug 31 '23 09:08 oscerd

@oscerd I'll add some tests covering pdf into tika component, because I can not see pdf file there - https://github.com/apache/camel/tree/main/components/camel-tika/src/test/resources

If some troubles emerges, we can discuss what to do next. Does it sounds ok?

JiriOndrusek avatar Aug 31 '23 09:08 JiriOndrusek

Yes, it is.

oscerd avatar Aug 31 '23 09:08 oscerd

Also if you check in the SBOM for camel, pdfbox is used explicitly only in camel-pdf and camel-fop.

Camel-tika is using only:

    {
      "ref" : "pkg:maven/org.apache.camel/[email protected]?type=jar",
      "dependsOn" : [
        "pkg:maven/org.apache.camel/[email protected]?type=jar",
        "pkg:maven/org.apache.tika/[email protected]?type=jar",
        "pkg:maven/org.apache.tika/[email protected]?type=jar",
        "pkg:maven/org.apache.tika/[email protected]?type=jar"
      ]
    }

both the tika commons and tika text modules are not using pdfbox. It's only something related to quarkiverse extension : https://github.com/quarkiverse/quarkus-tika/blob/main/pom.xml#L63

oscerd avatar Aug 31 '23 09:08 oscerd

I missed that fact, though having pdf part of the tika test coverage make sense nevertheless. (but with lower priority)

JiriOndrusek avatar Aug 31 '23 10:08 JiriOndrusek

@oscerd I tried to create a test which parses pdf (using plain Camel). For that purpose I had to add:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parser-pdf-module</artifactId>
    <version>${tika-version}</version> (2.8.0)
</dependency>

to get PdfParser for Tika. This dependency brings PdfBox 2.x (see https://mvnrepository.com/artifact/org.apache.tika/tika-parser-pdf-module/2.9.0)

Therefore user which would like to use pdf parser for Tika might had a conflict in dependencies in case i.e. camel-pdf is also part of the project. The failure is caused by the no-compatibility between pdfbox 2.x and 3.x.

java.lang.NoSuchMethodError: 'org.apache.pdfbox.pdmodel.PDDocument org.apache.pdfbox.pdmodel.PDDocument.load(java.io.InputStream, java.lang.String, org.apache.pdfbox.io.MemoryUsageSetting)'
	at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:421)

I can imagine that this use case does not make sense - parsing pdf by camel-tika and depend on camel-pdf.

In case that useage of camel-tika (for pdf parsing) together with camel-pdf is not supported, forcing similar restriction for camel-quarkus might solve the problem. (Currently we are testing tika extension for parsing pdf files)

--- edited

tika version is 2.8. not 2.9.0 (behavior is the same)

JiriOndrusek avatar Aug 31 '23 11:08 JiriOndrusek

To me this is really a corner case. I understand the point, but it's a tika problem.

oscerd avatar Aug 31 '23 11:08 oscerd

I agree, so the right way for camel-quarkus is not testing pdf parsing with Tika (as this is the corner case in plain Camel, which might not work in some cases), Camel-pdf should be used instead.

JiriOndrusek avatar Aug 31 '23 11:08 JiriOndrusek

Yes, once they align we could re-use tika for parsing PDF

oscerd avatar Aug 31 '23 11:08 oscerd