Are Rust-native libraries used at all, or all goes through Tika?
Hi there. The README.md implies that native Rust libraries are used for common formats and Tika is only a fallback for any other formats. I can't verify this in code though or in tests.
It looks like Tika is used for all formats.
Can you please clarify?
Thanks
I was curious as well, here's the answer of Gemini analysis this repository (packed using gitingest)
Based on an analysis of the provided source code, all file formats are handled by falling back to Apache Tika.
Here's a breakdown of why:
-
Core Logic: The main
Extractorstruct inextractous-core/src/extractor.rsdelegates all its extraction methods (likeextract_file,extract_bytes, etc.) directly to functions within the internaltikamodule. There is no conditional logic to handle any file types differently within the Rust code. -
Tika Integration: The
extractous-core/src/tika/directory and its modules are dedicated to communicating with a Java environment. Specifically,parse.rsmakes Java Native Interface (JNI) calls to theai.yobix.TikaNativeMainJava class. -
Java-Side Implementation: The Java class
TikaNativeMain.javausesorg.apache.tika.parser.AutoDetectParser. This is Apache Tika's standard mechanism for automatically identifying a file's type and then using the appropriate parser for it. -
Build Process: The
build.rsscript is responsible for a crucial step: compiling the Java code found intika-native/into a native shared library using GraalVM. This library, which contains Apache Tika's functionality, is then linked against the Rust code.
Conclusion
While the project's README mentions that Tika is used for formats "not natively supported in the Rust core," the provided code shows that, in its current state, no formats are handled natively in Rust. The Rust portion of the project serves as a high-performance wrapper that manages the execution of and communication with the pre-compiled Apache Tika library. This architectural choice allows the project to leverage Tika's extensive file format support while benefiting from Rust's performance and memory safety for the overall application.
Still, is it a problem?
My impression was always that extractous essentially just embeds Tika in a rust binary, rather than do extraction itself