Tilman Hausherr

Results 127 comments of Tilman Hausherr

I didn't touch tika-grpc/pom.xml at all. Your script has "tika-pipes/tika-grpc" however "tika-grpc" is at the top level.

The last change makes it fail on our CI: [INFO] --- protobuf:0.6.1:compile (default) @ tika-grpc --- [INFO] Downloading from central: https://repo.maven.apache.org/maven2/com/google/protobuf/protoc/3.25.8/protoc-3.25.8-$%7Bos.detected.classifier%7D.exe [ERROR] Failed to execute goal org.xolstice.maven.plugins:protobuf-maven-plugin:0.6.1:compile (default) on project...

Oops I see now that this was mentioned in the JIRA ticket, that this plugin no longer works.

You can close the issue yourself

The PDF implementation can be found in PDFTextStripper.java in the PDFBox project.

Sorry for the late reaction. I'm kinda undecided. The problem is that this will work only for very simple cases, where the text is in the top content stream.

Look at the examples subproject and then look at the RemoveAllText.java example, this processes form objects and patterns too. But be aware that PDF is so complex that there will...