extractous icon indicating copy to clipboard operation
extractous copied to clipboard

Are Rust-native libraries used at all, or all goes through Tika?

Open darkostanimirovic opened this issue 7 months ago • 2 comments

Hi there. The README.md implies that native Rust libraries are used for common formats and Tika is only a fallback for any other formats. I can't verify this in code though or in tests.

It looks like Tika is used for all formats.

Can you please clarify?

Thanks

darkostanimirovic avatar Sep 10 '25 15:09 darkostanimirovic

I was curious as well, here's the answer of Gemini analysis this repository (packed using gitingest)


Based on an analysis of the provided source code, all file formats are handled by falling back to Apache Tika.

Here's a breakdown of why:

  • Core Logic: The main Extractor struct in extractous-core/src/extractor.rs delegates all its extraction methods (like extract_file, extract_bytes, etc.) directly to functions within the internal tika module. There is no conditional logic to handle any file types differently within the Rust code.
  • Tika Integration: The extractous-core/src/tika/ directory and its modules are dedicated to communicating with a Java environment. Specifically, parse.rs makes Java Native Interface (JNI) calls to the ai.yobix.TikaNativeMain Java class.
  • Java-Side Implementation: The Java class TikaNativeMain.java uses org.apache.tika.parser.AutoDetectParser. This is Apache Tika's standard mechanism for automatically identifying a file's type and then using the appropriate parser for it.
  • Build Process: The build.rs script is responsible for a crucial step: compiling the Java code found in tika-native/ into a native shared library using GraalVM. This library, which contains Apache Tika's functionality, is then linked against the Rust code.

Conclusion

While the project's README mentions that Tika is used for formats "not natively supported in the Rust core," the provided code shows that, in its current state, no formats are handled natively in Rust. The Rust portion of the project serves as a high-performance wrapper that manages the execution of and communication with the pre-compiled Apache Tika library. This architectural choice allows the project to leverage Tika's extensive file format support while benefiting from Rust's performance and memory safety for the overall application.

Still, is it a problem?

Valian avatar Sep 18 '25 11:09 Valian

My impression was always that extractous essentially just embeds Tika in a rust binary, rather than do extraction itself

nickchomey avatar Sep 18 '25 13:09 nickchomey