camel-quarkus
camel-quarkus copied to clipboard
Tika pdf failure after upgrade to pdfBox in Camel, requires new quarkiverse-tika
Bug description
Camel upgraded pdfbox to 3.x (https://issues.apache.org/jira/browse/CAMEL-19796).
Pdfbox 3.x is not back compatible with 2.x, therefore quarkiverse-tika used by tika extension fails with the new pdfbox.
Apache tika is aware of the new version of pdfbox and the upgrade ticket is already in progress - see https://issues.apache.org/jira/browse/TIKA-3347
As soon as an new version of apache tika is released it has to be adopted by quarkiverse-tika and this new version has to be adopted by camel-quarkus.
I'm disabling tika tests using pdfbox until provblem is solved (on camel-main
)
@ppalaga @jamesnetherton We have to wait for the new release of quarkiverse-tika supporting the new pdfbox. There is probably not much other options (the pdfbox involves i.e. fop, tika, pdf extensions, therefore keeping pdfbox on 2.x may be complicated. WDYT?
We have to wait for the new release
quarkiverse-tika
Or we propose to revert the upgrade in Camel. I assume the same issue exists there if you try to bring the tika
& pdf
components together in the same app?
We have to wait for the new release
quarkiverse-tika
Or we propose to revert the upgrade in Camel. I assume the same issue exists there if you try to bring the
tika
&
Error starts to happen if tika parses any pdf file -> other functionality should work even with pdf extension together.
Reverting of the Camel change and postponing it until quarkiverse-tika would support 3.x would be an easy solution
There is no point in postponing the change in Camel. Not for special reasons, but just because we cannot base the core development on what Quarkus/Quarkiverse does. It's dangerous and not healthy.
If errors appear in Tika on plain camel, then it make sense to wait for a Tika release supporting pdfbox 3.x, but this is not evident through tests.
@oscerd I'll add some tests covering pdf into tika component, because I can not see pdf file there - https://github.com/apache/camel/tree/main/components/camel-tika/src/test/resources
If some troubles emerges, we can discuss what to do next. Does it sounds ok?
Yes, it is.
Also if you check in the SBOM for camel, pdfbox is used explicitly only in camel-pdf and camel-fop.
Camel-tika is using only:
{
"ref" : "pkg:maven/org.apache.camel/[email protected]?type=jar",
"dependsOn" : [
"pkg:maven/org.apache.camel/[email protected]?type=jar",
"pkg:maven/org.apache.tika/[email protected]?type=jar",
"pkg:maven/org.apache.tika/[email protected]?type=jar",
"pkg:maven/org.apache.tika/[email protected]?type=jar"
]
}
both the tika commons and tika text modules are not using pdfbox. It's only something related to quarkiverse extension : https://github.com/quarkiverse/quarkus-tika/blob/main/pom.xml#L63
I missed that fact, though having pdf part of the tika test coverage make sense nevertheless. (but with lower priority)
@oscerd I tried to create a test which parses pdf (using plain Camel). For that purpose I had to add:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parser-pdf-module</artifactId>
<version>${tika-version}</version> (2.8.0)
</dependency>
to get PdfParser for Tika. This dependency brings PdfBox 2.x (see https://mvnrepository.com/artifact/org.apache.tika/tika-parser-pdf-module/2.9.0)
Therefore user which would like to use pdf parser for Tika might had a conflict in dependencies in case i.e. camel-pdf
is also part of the project. The failure is caused by the no-compatibility between pdfbox
2.x and 3.x.
java.lang.NoSuchMethodError: 'org.apache.pdfbox.pdmodel.PDDocument org.apache.pdfbox.pdmodel.PDDocument.load(java.io.InputStream, java.lang.String, org.apache.pdfbox.io.MemoryUsageSetting)'
at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:421)
I can imagine that this use case does not make sense - parsing pdf by camel-tika and depend on camel-pdf.
In case that useage of camel-tika
(for pdf parsing) together with camel-pdf
is not supported, forcing similar restriction for camel-quarkus might solve the problem. (Currently we are testing tika
extension for parsing pdf files)
--- edited
tika version is 2.8. not 2.9.0 (behavior is the same)
To me this is really a corner case. I understand the point, but it's a tika problem.
I agree, so the right way for camel-quarkus is not testing pdf parsing with Tika (as this is the corner case in plain Camel, which might not work in some cases), Camel-pdf should be used instead.
Yes, once they align we could re-use tika for parsing PDF