tikaondotnet
tikaondotnet copied to clipboard
Add Dotnetcore support
.Net Core Support
We have long wanted to add support for .Net core and earlier this year IKVM was finally "revived" to have support for .net core. At first, I gave up because ikvmc.exe didn't seem to work at all (and still does not for our use case). But @dylanlangston created a proof of concept using IKVMReference and msbuild to extract dotnet assemblies from the tika .jar file.
Nugets
- TikaOnDotNet nuget is now multi-targeting .Net framework 4.6.2 and .Net Core 3.1.
- TikaOnDotNet.TextExtraction is now multi-targeting .Net framework 4.6.2 and .Net Core 3.1.
Do we need to target .Net 6?
Tests
All tests but one are passing. For some reason parsing our test .rtf file throws a java UnsatisfiedLinkError exception:
TikaOnDotNet.TextExtraction.TextExtractionException : Extraction of text from the file 'files/Tika.rtf' failed.
----> TikaOnDotNet.TextExtraction.TextExtractionException : Extraction failed.
----> java.lang.UnsatisfiedLinkError : sun/java2d/Disposer.initIDs()V
If anyone has an idea what this might be related to please help!š
Build / Deployment Automation
We are going to move away from Packet and the F# build automation to use GitHub actions to build/test and deploy nugets. I'd like updating the version of Tika to be a simple update of a version file. We are close with what @dylanlangston started for us.
Tests are "mostly" passing with plain msbuild and me hammering out this at the command line to produce a tika nuget.
dotnet pack ./src/TikaOnDotnet/TikaOnDotnet.csproj -p:NuspecFile=package.nuspec -p:NuSpecBasePath=. --configuration=Release
Nuget Packaging
The nuget has been updated to better represent the license, readme location, project url, and finally I've added an icon.
Icon
I spent 30 seconds creating an icon to prettify the Nuget presentation. I anyone would like to improve on what I started. Please do. I am not a design person.
TBD:
- [ ]
TikaOnDotnet.TextExtractionshould use nuspec or jave csproj properties to make the listing as nice asTikaOnDotnet - [ ] Move deployment automation to GitHub Actions.

Hey I'm not a designer, but if you like it I can add in a commit to this branch.
Thank you!
Hey KevM,
Do we need to target .Net 6?
Yes we do need Tika on .Net to target .Net 6 at my organization for a couple of projects,
When do you think we could expect a new release?
There is one failing test for rtf files. No idea why it is not working. I was going to work on getting a pre-release out and then let people try it out for a bit before committing to a release.
Note: Iād be willing to take a short contract to get this release out quicker. I am self employed.
On Wed, Sep 14, 2022, at 7:01 AM, Smiechowski Nathanael wrote:
Hey KevM,
Do we need to target .Net 6?
Yes we do need Tika on .Net to target .Net 6 at my organization for a couple of projects,
When do you think we could expect a new release?
ā Reply to this email directly, view it on GitHub https://github.com/KevM/tikaondotnet/pull/152#issuecomment-1246661398, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAMHPSFOCSWRMVEYDO473V6G5DLANCNFSM6AAAAAAQCWG3ME. You are receiving this because you authored the thread.Message ID: @.***>
There is one failing test for rtf files. No idea why it is not working. I was going to work on getting a pre-release out and then let people try it out for a bit before committing to a release. Note: Iād be willing to take a short contract to get this release out quicker. I am self employed. ⦠On Wed, Sep 14, 2022, at 7:01 AM, Smiechowski Nathanael wrote: Hey KevM, > Do we need to target .Net 6? > Yes we do need Tika on .Net to target .Net 6 at my organization for a couple of projects, When do you think we could expect a new release? ā Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAMHPSFOCSWRMVEYDO473V6G5DLANCNFSM6AAAAAAQCWG3ME. You are receiving this because you authored the thread.Message ID: @.***>
I was able to pretty easily target net6.0 throughout and pass tests (except the RTF one) without changing any other dependency versions by incrementing version numbers and adding net6.0 to the .csproj targets.
Regarding the RTF test - it seems to pass when the RTF file doesn't contain an image. Without digging into the Java side of things I can't provide much feedback beyond that.
If you'd like, I can submit a PR for the net6 support but that'll take a bit of approval on my end as I'm using this for an internal project.
Hey ya'll. I fell into this thread while following links blindly. I revived the IKVM project.
To get Core out, and because nobody really wanted to fix it, we didn't pay any attention to AWT. So, no AWT in IKVM. My guess is this is killing your attempted usage of Java2D. I don't really know though, since I didn't do any more investigation yet besides read this thread.
The previous AWT default toolkit was IKVM.AWT.WinForms. An attempt to map the AWT stuff to WinForms. As ya'll know, WinForms is quite different in Core. And it's not cross platform anyways. So we just didn't get it building, and probably aren't going to spend any time on it.
Instead though, you can probably configure IKVM to run in headless mode, just as you would configure OpenJDK to do so. Some System property you can set.
8.2.2 will end up with headless mode enabled by default.
Somebody try that.
Also, while I'm here, I want to make ya'll aware of IKVM.Maven.Sdk.
This is a new strategy for Java-Libraries-on-DotNet.
Instead of actually distributing cross compiled JAR -> DLL files on NuGet, which is error prone, as you don't own the assembly name or code, we allow users to directly add references to Maven packages in their C# projects. That way only one authoritative source for Java libraries exist: Maven. And the author of the actual product owns those artifacts. And the end developer is the one downloading and doing the conversion, and no middle man is redistributing licensed code.
We also embed some fancy information inside produced NuGet packages describing these references to Maven. So, say you write some .NET code that uses a library from Maven. And then you Pack that. Your produced NuGet file actually has a .pom file embedded into it. When somebody installs your NuGet package, and builds their own library, IKVM.Maven.Sdk downloads the Maven artifacts and cross compiles them on the fly.
It kind of obsoletes projects like TikaOnDotNet, unless you guys provide some functionality beyond the Java library. Like extension methods, etc.
Also, while I'm here, I want to make ya'll aware of IKVM.Maven.Sdk.
This is a new strategy for Java-Libraries-on-DotNet.
Instead of actually distributing cross compiled JAR -> DLL files on NuGet, which is error prone, as you don't own the assembly name or code, we allow users to directly add references to Maven packages in their C# projects. That way only one authoritative source for Java libraries exist: Maven. And the author of the actual product owns those artifacts. And the end developer is the one downloading and doing the conversion, and no middle man is redistributing licensed code.
We also embed some fancy information inside produced NuGet packages describing these references to Maven. So, say you write some .NET code that uses a library from Maven. And then you Pack that. Your produced NuGet file actually has a .pom file embedded into it. When somebody installs your NuGet package, and builds their own library, IKVM.Maven.Sdk downloads the Maven artifacts and cross compiles them on the fly.
It kind of obsoletes projects like TikaOnDotNet, unless you guys provide some functionality beyond the Java library. Like extension methods, etc.
This is great! However, it's building extremely slowly for me. Is that expected for IKVM.Maven.Sdk? Any recommendations for how the Maven build process can be sped up?
Depends. The first build is definitely going to be a thing. Likely it has to download two dozen jars and convert them all. But that information is cached for subsequent builds.
Can you described what it looks like it's doing?
Also, while I'm here, I want to make ya'll aware of IKVM.Maven.Sdk. It kind of obsoletes projects like TikaOnDotNet, unless you guys provide some functionality beyond the Java library. Like extension methods, etc.
Thanks for the information and suggestion of IKVM.Maven.SDK. Unfortunately upon digging deeper I realized the IKVM Nuget package is licensed under GPL and as such won't work for me. In addition this project likely needs to have a change of license to accommodate the requirements of GPL.
Depends. The first build is definitely going to be a thing. Likely it has to download two dozen jars and convert them all. But that information is cached for subsequent builds.
Can you described what it looks like it's doing?
Here's a configuration snippet for a .NET 6 console app with no other packages or code that was slow for me:
<ItemGroup> <PackageReference Include="IKVM" Version="8.2.1" /> <PackageReference Include="IKVM.Maven.Sdk" Version="1.0.2" /> <MavenReference Include="org.apache.tika:tika-app" Version="2.5.0" /> </ItemGroup>
Here's another configuration that was slow for me:
<ItemGroup> <PackageReference Include="IKVM" Version="8.2.1" /> <PackageReference Include="IKVM.Maven.Sdk" Version="1.0.2" /> <MavenReference Include="org.apache.tika:tika-core" Version="2.5.0" /> <MavenReference Include="org.apache.tika:tika-parsers-standard-package" Version="2.5.0" /> </ItemGroup>
Building with these configurations is extremely slow even after the initial build. If I only include tika-core, then the build is fast, but that library doesn't have the parsers I need. I skimmed through the source code for IKVM.Maven.Sdk, and as near as I can tell, the Java artifacts will be downloaded and built every single time. It may just be the case that Maven is inherently slow when there are many/large dependencies.
After searching on Google for "how to speed up maven builds", one of the suggestions is to build the artifacts in parallel using multiple threads (e.g. "mvn -T 4 install"), and another suggestion is to use "offline" mode after the initial build so that maven doesn't check the internet again.
Unfortunately, the best solution for me might be simply manually building the tika dlls and adding them to source control.
and as near as I can tell, the Java artifacts will be downloaded and built every single time
Nope. The dependency graph is cached until it's changed.
After searching on Google for "how to speed up maven builds", one of the suggestions is to build the artifacts
We don't build artifacts.
I need to know where in the process you are experiencing a slow down. What is the output at when it's slow?
On tiki-app, my understanding is that's not a library you're actually supposed to use as a dependency, but a JAR file with all of the dependencies embedded into it, and a main entry point, for running it as an app. Like, all the logging stuff is copied into it. Which would break trying to use tika-app along with other Java libraries that use the same logging libraries.
Like imagine if a user tried to use both tika-app and also, say, I don't know, Apache Foobar. And both depend on commons-logging. The tiki-app JAR has commons-logging copied into it. While Apache Foobar uses the version from commons-logging. They'd be be using different classes and assemblies, and configuration for one wouldn't work right for the other. Which is really the reason to favor using a single source in the first place.
There is probably documentation about how Tika users in Java make use of tika-core and the parsers properly.
Yeah, from the documentation:
tika-app/target/tika-app-0.7.jar Tika application. Combines the above libraries and all the external parser libraries into a single runnable jar with a GUI and a command line interface.
They then go into details about which packages you're supposed to add for development purposes.
The first thing I tried was tika-core with tika-parsers-standard-package (as suggested here). I tried tika-app for a comparison out of curiosity mostly. In both cases, building the solution or project is painfully slow, with a couple minutes being spent at "Build started..." and a couple more minutes at "1>------ Build started: Project: IKVM_Testing, Configuration: Debug Any CPU ------" in the Output window.
I appreciate you taking a look at it for me, but this may be a case where nothing can be done about it.
Heh. Yeah. Okay, I got it reproduced on my end. Tika, with all the parsers, has 87 different dependencies. That's 87 JARs that need to be downloaded and individually converted to assemblies.
It looks like the holdup on subsequent builds is checking the cache itself. The cache is organized by hash of the transpiler information per-JAR. For instance, a JAR built with 20 different dependencies, is cached as long as the 20 dependencies themselves are cached, etc. Because any change in the graph could produce different results.
It's just taking a long time trying to even figure out if they've even changed. Let alone dealing with them if it has.
There are problem some optimizations I can put in here. Will look.
So it sounds like there is still a benefit of our project doing once what people would otherwise need to do on every build.
On Mon, Oct 24, 2022, at 1:54 PM, Jerome Haltom wrote:
Heh. Yeah. Okay, I got it reproduced on my end. Tika, with all the parsers, has 87 different dependencies. That's 87 JARs that need to be downloaded and individually converted to assemblies.
It looks like the holdup on subsequent builds is checking the cache itself. The cache is organized by hash of the transpiler information per-JAR. For instance, a JAR built with 20 different dependencies, is cached as long as the 20 dependencies themselves are cached, etc. Because any change in the graph could produce different results.
It's just taking a long time trying to even figure out if they've even changed. Let alone dealing with them if it has.
There are problem some optimizations I can put in here. Will look.
ā Reply to this email directly, view it on GitHub https://github.com/KevM/tikaondotnet/pull/152#issuecomment-1289458331, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAMHIU3IEHOSXIXZXUSHDWE3LO3ANCNFSM6AAAAAAQCWG3ME. You are receiving this because you authored the thread.Message ID: @.***>
Sure. For the next couple hours or so, maybe. Heh. A speed problem around a cache is hardly a terminal bug in software.
IKVM 8.2.3 much improves the cache lookup speed.
I confirm that the build time is much faster. For tika-core with parsers, it was taking about 5 minutes to build an empty .NET 6 console project before, and now it's taking about 20 seconds. Thanks for fixing it so quickly!
It's at about 3 seconds for me. And I think I can get it down more. But this should be at least usable. Remember though: jar -> dll conversations are globally cached. So the same JAR, and input options (frameworks, references, etc) will always return from the cache. Even between solutions and projects. So, if you create a new Console project, from scratch, after already having used IkvmReference, it will return from the cache, even though it's a new project.
For somebody with no cache, it's still going to have to do the JAR -> DLL conversion.
@gomep342 , I was having problems getting the Tika core to find the parsers (probably due to missing CLASSPATH), so I just built everything into one dll targeting .NET 6, and that works for me. I did not make a pull request as my solution felt like a one-off for my own situation.
Hey, this is a great project! I was trying to add some word document processing to my latest .NET program and came across this project (which is so great but unfortunately doesn't support .NET Core and I found it difficult to get this pull request branch to compile and work in a project) and GroupDocs.Parser (unfortunately a commercial piece of software with a very limited trial, not very useful for a project I was only doing for fun, what a party pooper)
I noticed IKVM has been revived recently too and I threw together a small proof of concept that fits my needs, where I am able to use IKVM and tika to parse doc, docx, pdf files - https://github.com/souramoo/TikaOnDotNet - along with examples on how to use it in c#, for anyone who needs this functionality until this pull request gets merged!
I initially tried using IKVM.Maven.Sdk as suggested above, but actually this optimised a bit too much, leaving out the office and OOXml parsers, so I basically just made a quick java app that drags in these dependencies into a main function and then adding a reference to this jar in my csproj file (along with the IKVM dependency) got everything working :)
@souramoo
On the references. You need to include them exactly how you would in Maven. If, for example, Maven lists them as optional dependencies, MavenReference won't pull them in. However, if Maven does list them as optional dependencies, you can add them as MavenReferences, and they'll be ordered correctly.
If they're optional references but upstream forgot to actually add them as optional references, though, a bug should be filed upstream.
@wasabii thanks for the advice and great job on everything with ikvm-revived - it's very exciting stuff!
I was including both the tika-core and tika-parsers-standard-package using MavenReference directly into my project, but for some reason intellisense was pointing out that org.apache.tika.parsers.microsoft.OfficeParser was not available (despite being present in the jar file, I opened it up and checked!) - and despite other parsers such as AutodetectParser being available.
On top of that, AutodetectParser does some weird stuff to detect which Parser classes are included in the classpath which doesn't seem to quite work in IKVM so I had to specify manually (by building a quick and dirty function based on file extension).
By making another jar that drags in the parsers I want in the main function I think I convinced IKVM not to optimise the parser classes away (presumably because the main classes in the original package jar did not use the OfficeParser or OOXMLParser class directly)
Well, it'd be good to resolve some of those issues. The goal is it to work exactly like Maven would, as possible.
For instance, IkvmReference defaults to the "app domain assembly class loader" feature, where the static assemblies "believe" that they live inside a ClassLoader where their direct references are available first, followed by any other assemblies loaded into the current app domain (or available in the DependencyContext on Core).
So, if they do stuff like read resources (classLoader.getResource('')), it should scan properly: first directly referenced assemblies; second the entire loaded app domain.
Making sure the compiler knows about the direct assemblies is what's important for them to be found. Since JAR files themselves do not contain any dependency information like that (pre JDK9), IkvmReference correctly listing <References> to all the dependencies is needed. And MavenReference populates that based on it's graph. Hence the requirement of "Optional" dependencies documented in Maven: they determine what references what.
@wasabii Any news on this? we would like to transition to .Net Core
@Arextion I don't see any remaining issues for IKVM on this thread that prevent Tika from running. Is there any I'm not aware of? It's been months since I tried.
@Arextion I don't see any remaining issues for IKVM on this thread that prevent Tika from running. Is there any I'm not aware of? It's been months since I tried.
@wasabii I haven't tried running this, i was just curious if this PR would soon be merged, or there would be a release to try out.