lucenenet
lucenenet copied to clipboard
When will the 4.8.0 version be released?
This question was asked a few months ago and the answer hasn't likely changed much. Please see #778. In that thread you will see references to #437 Comment which contains a wealth of information on the topic.
In all of this, one thing you should keep in mind is that many people (even Microsoft!) currently use portions of Lucene 4.8 in production. So that is to say the product is already very stable.
Companies using 4.8 in production just need to be awere there may be some minor API changes on the road to a final 4.8 release. But honestly, that's not a lot different than using a Production release that has a major new release.
In my opinion, if you know people are using it, and you feel it is stable, you might as well call it a release and remove the beta designation.
I am confident it works for most use cases. However, I can't get some developers to consider it. If it gets escalated, it will most likely get dropped.
I feel it is time to go forward.
I'm definitely in agreement with @jeffreywstevens here. I know there are some recent commits and changes done in the past several months which would warrant a new beta release, but after that I also think we should procure an RTM release. Any changes after that can just be patch versions.
Would need buy in from @rclabo + @NightOwl888
I'm totally on board with that. In the past two years I've had to make way more changes to my code base due to ASP.NET Core API changes than I have had to for Lucene.NET 4.8 API changes. I feel that LuceneNET 4.8 is super stable (hats off to @NightOwl888!!!) and is worthy of an RTM release.
Sure there are some aspects that may not be perfectly on par with Java Lucene 4.8 (OpenNLP comes to mind), but those areas tend to be auxiliary functionality that have no easy route for porting.
The core functionality seems rock solid and awesome from my perspective. And I think so many more devs will use this amazing project if it's RTM.
I know @NightOwl888 is under a lot of pressure right now due to a deadline on another project, so he might not be able to chime in for a while.
That doesn't mean he isn't interested in this discussion; it just means he is juggling a lot at the moment.
@rclabo IIRC there were quite a lot of commits and fixes since the last release, do you think we should look to ship one more beta?
@Shazwazza Probably, but honestly...I feel like that question is above my pay grade :-)
I have enormous respect for @NightOwl888 and would certainly defer to his judgment.
I have some time on my hands and can dedicate helping out with the efforts.
Just from looking over the NuGet download stats, the 4.8 beta packages outnumber the last production 3.3 release, in terms of downloads:
https://www.nuget.org/stats/packages/Lucene.Net?groupby=Version

These numbers could be misleading and inflated due to automated CI builds but still paint a good picture for 4.8 usage. Some of us are using 4.8 betas in prod without issues, and anecdotally we hear about that from other people too. If another beta release makes prod 4.8 a reality, let's go for it.
Until @NightOwl888 can chime in, I will start pulling together the changelog and see what the new beta release would look like, and we can try pushing that out and get the ball rolling.
I put together a draft for the next release; I believe people with commit access should be able to see it in the releases page:
https://github.com/apache/lucenenet/releases
It's pretty meaty.
As the next step, I will review the communication from @NightOwl888 from the previous beta build and see what we need to do to proceed. From the top of my head, there is a PMC vote and then the publishing of the NuGet packages if the vote passes. I also want to set up local tests for handling indexes produced by previous versions to ensure the current version can work and open it, etc.
Just a quick status update, I am going through the steps outlined here to ensure I have all the bits correctly set up locally to do the release.
One thing I am not clear about is the Azure Pipelines and the access that is needed there to make a release. But I haven't gotten to that part yet, so I haven't explored it too deeply.
A quick update. We have sorted out access etc., and are actively working on finishing up a few remaining things that will allow us to push 4.8 beta 17. I can dedicate a decent amount of time now and have been pushing PRs with the remaining fixes. Shad has chimed in as well and has started some work too. No ETA, but we are back up and pushing to the finish line. You can observe our progress here by watching PRs coming in and out: https://github.com/apache/lucenenet/pulls?q=is%3Apr+is%3Aclosed
Our focus areas are 1) fixing the findings by SonarCloud code scans that indicate a problem with converted code where base class constructors call virtual methods that can be overridden in subclasses and cause issues with the state not being initialized properly. 2) Fix Close/Dispose issue with the analyzers #271
And then we will regroup and see where we are at with 4.8 release. There is still some work/considerations to be made about ICU4N. No ETA, as I don't think we can estimate how long this will take, but we can take it as we go and make regular status updates.
We made some more progress. One of the items on the "TODO" before the release, #670 , has been addressed.
I am taking a look at what to take care of next, most likely #271, although lacking a lot of context there but hopefully can find a way to clear it up a bit more.
I think we're ready, @NightOwl888 what's your opinion? maybe RC?
I had to go on a month+ trip but back now. I haven't heard much from @NightOwl888 recently, he must be busy with some other commitments. The last piece of work I pushed before taking off was this #852 . It's not entirely clear if I can pull into main what we have there or if Shad was considering more changes to the approach.
Having re-familiarized myself with the project and talking with Shad more about why it's been difficult to make a production release, I think I can give this explanation for it:
- The goal for the Lucene.NET is to have production releases map one-to-one to Lucene releases as much as possible. It makes sense. We want people that find java docs or examples online for 4.8.1 version to be compatible with .NET version with minor language based tweaks. The API surface and functionality should remain the same. We do break API equivalency here and there, but we try to do so as little as possible.
The difficulty lies in what to do if we release 4.8.1 and find a bug. OK, we make a patch release, 4.8.2 that fixes that bug. But now, Java Lucene does not have 4.8.2 version. Worse, what if the issue we discover requires a change that's a breaking change, and we in theory would increment the minor version, end up with 4.9.x release which would have API/changes that are not compatible with Java Lucene 4.9.x releases that exist. And with fixes to those we would have releases that don't exist in Java world once again, e.g. 4.9.3.
And I think that's the main issue why Shad has been extremely careful and reluctant to do production releases of the project. We know that bugs are lurking in the code base, but with each pass they are more and more difficult to find, and we can't guarantee that 4.8.1 we release will not require changes.
A careful discussion and consideration is needed here, but one way forward would be to come as a group with the remaining committers that still at least chime in and perhaps draw a line in the sand and say ok, 4.8.1 prod release we are making attempts to be as close as possible to Lucene 4.8.1 release. All releases going forward from that will attempt to stay close at the "major" version but all the minor/patch releases can and will deviate greatly.
I am not proposing this lightly, but it does seem to offer some sort of way forward with making a production release and potentially allowing for a more frequent prod update cadence without keeping ourselves accountable for those versions to be one-to-one mapped to Java world.
@laimis I think that makes a lot of sense. Given that we haven't previously had an approach for versioning when rolling bug fixes or breaking changes once Lucene.NET 4.8 is released it's very understandable that we have held a very high bar to what needs to be achieved before doing a production release.
I personally think that what you propose as a solution seems reasonable. And who knows, perhaps someone will offer up other solutions that may be even better. But I think as a dev community we need to rally around some versioning approach whatever it is. Having a versioning approach and an understanding of what versions align with Java Lucene and which ones don't will give us the freedom to get Lucene.NET released.
Doing a production release of the library will untie the hands of developers that would love to use it but who are restricted from doing so due to company policies not allowing pre-release software into production environments. Releasing the software will thus grow our developer community and hopefully our committer pool as well. Also, releasing the software will grow the use cases that are actively being utilized and provide valuable feedback on where the library can be improved.
What you propose seems reasonable however it's a bit challenging that this is a release of 4.8.1 rather than 4.0. As such we only have 4.9 as a potential breaking change release, then we hit a major version 5.0. This could cause us to be forced to release a braking change as a point release, say 4.9.1. This challenge of course goes away in the future if the next major release of Lucene.NET has a low minor release number. e.g. 10.3 but we have the same issue in the future if the next major release of Lucene.NET has a high minor release number like 9.7 (the current version of Java Lucene). It's a bit challenging I guess, but we may just have to get comfortable with the idea of a breaking change in a point release. ie. 4.91. (shrug)
@rclabo thank you for chiming in. Curious about this part that you mention:
What you propose seems reasonable however it's a bit challenging that this is a release of 4.8.1 rather than 4.0. As such we only have 4.9 as a potential breaking change release, then we hit a major version 5.0
After 4.9, wouldn't we have 4.10.x as an option? 4.11.x after, etc?
@laimis That really made me laugh (at myself). You are totally right. For some reason when I wrote that it didn't even occur to me that we could have a 4.10.x! That's pretty funny. Definitely, after 4.9 we can have 4.10.x as an option, and after that 4.11.x. Thanks for being gracious in your question. ;-)
I've been reading your latest comments about the problems with versioning and a production release and think I have an idea to solve this problem. What if we use the Lucene version as is and then add an extra number to the end to signal the current iteration of the version? That way you still have the consistency of matching the Lucene Java version and .Net version but can still make improvements like bug fixes that weren't caught in a preview/beta phase. (See this image for an example)
We also have to remember that no one can be sure that they have bug-free software and that unforeseen problems do come up no matter how long we work on something. So I think it would be better to use iterations instead of neverending beta releases like there have been for a while now with the current 4.8.0 release. This also gives a better signal of when you can use Lucene in production as has been mentioned time and time again that many people do even though it's in beta.
@nikcio - I think this is a fine proposal and in some ways, I like this approach better because it makes it more clear which version LuceneNET is in rough alignment with, The one thing lost with this approach is the ability to tell, via the version number, if an iteration of a version is a breaking change. But honestly, that doesn't bother me personally a bit. In my case, if I'm upgrading to a newer version of LuceneNET for my project, then I'm probably reading the release notes to see what new goodies it includes. And in that process, I'd be made aware of any breaking changes and the nature of those changes. That's sufficient for me and probably for a lot of devs. However, I know versioning can be an opinionated topic so it will be interesting to see how others on the dev mailing list feel.
I realize this is a lot bigger topic, but I think the maintainers of this project should seriously consider breaking off from the exact version scheme of the upstream Java Lucene.
As a consumer of this library, naturally I would like to know what API version of Lucene this corresponds to, but that could easily be solved by a version mapping table in documentation.
Examples such as
The difficulty lies in what to do if we release 4.8.1 and find a bug. OK, we make a patch release, 4.8.2 that fixes that bug. But now, Java Lucene does not have 4.8.2 version.
indicate just how hard it is to keep the versions of distinct code bases the same. Especially the patch number is troublesome as that typically designates implementation and bug fixes, but I think the same applies to minor and major.
By releasing yourself from this constraint you would have the flexibility to release stable versions of the functionality that you have implemented without waiting for 100% feature parity with a given upstream Java version.
This way you may opt to never be 100% feature complete with Java Lucene 4.8 (for example), because the community is more in need for some 7.x features that can then be prioritized over the long tail of rarely used 4.8 features (just as a made up example). By following your own version scheme you can instead document version X as "compatible with Lucene 4.8 minus features Y and Z".
It would also possibly be easier to get contributors, as most consumers of a library would rather contribute a PR that just adds a feature from a later version that they need for their application. Sorry to be blunt, but it's going to be very hard to get contributors chasing the last bits 4.8 compatibility.
The additional value is that you can now follow semantic versioning more strictly, something I would argue is an industry standard these days. It would sure make maintaining libraries that depend on Lucene.NET easier.
First of all, the versioning scheme had been decided some time ago and is in fact documented and made part of the build. At this point I don't see any reason to go back and revisit this scheme which was part of the work that was done during the first 4.8.0 beta.
By releasing yourself from this constraint you would have the flexibility to release stable versions of the functionality that you have implemented without waiting for 100% feature parity with a given upstream Java version.
This way you may opt to never be 100% feature complete with Java Lucene 4.8 (for example), because the community is more in need for some 7.x features that can then be prioritized over the long tail of rarely used 4.8 features (just as a made up example). By following your own version scheme you can instead document version X as "compatible with Lucene 4.8 minus features Y and Z".
This assumes usability and API are the entire issue, but they are not.
Lucene.NET is the most difficult application I have ever had the pleasure of debugging in my 25 years as a developer. When we go off the map like this, we literally throw away our best debugging tool, which is to run the same version of Lucene and Lucene.NET side by side to see where the execution paths diverge. I don't have an answer for how we could debug if we combine different versions of Lucene. Do you?
Furthermore, the binary structure of the index does change from one version to the next, making them incompatible and making it literally impossible to bring many Lucene 9.x features back to Lucene.NET 4.x. We had this issue with back-porting the analyzers-nori package.
We have 100% compatibility with creating an index in Lucene and opening it in Lucene.NET with the same version and plan to keep it that way going forward (and it worked once the other way around, but hasn't been tested in quite a while). The index isn't the only binary format that is also kept in sync between versions.
There are other problems with disjointed versioning between Lucene and Lucene.NET. Case and point: Lucene.NET 3.0.3. There was no release of Lucene 3.0.3. Despite trying to sleuth an answer I have no idea what commit Lucene.NET 3.0.3 is a port of. I could guess that it is a port from 3.0.1 (which actually was released), but I can't be 100% sure. I didn't even know what commit in this repo corresponded to the 3.0.3 release until I found it on an obscure blog (they released 3.0.3 RC2 by renaming it, but didn't make a tag corresponding to the 3.0.3 release). Both of these issues are the primary reason we have never done a maintenance release of Lucene.NET 3.0.3. While we could incorporate the actual version number as part of the InformationalVersion
and make it disjointed, it would be very confusing for users who see numbers that overlap Lucene releases that don't correspond to them or their binary formats. Strict version compatibility avoids getting into this situation again.
For usability, there are also issues. Existing Lucene blog posts may not be useful if the API is different than the major version of Lucene the post is about.
The bottom line is there is no maintenance plan for making a Frankenstein version of Lucene that incorporates features from different versions. The best way is to try to sync the entire project to a single Git commit. The story goes way beyond keeping the API in sync. It also means keeping the execution paths, binary formats, tests, and documentation in sync.
While we could simply abandon 4.8.0 and start working on the latest version of Lucene now, we would be stuck in a situation where we have all of the same work to finish we do now plus an estimated 1800 hours of upgrading work. This upgrade estimate could be off if we run into any major gaps that mean more JDK features we need to find or build replacements for. Right now, we are in a situation where our remaining work still has an undefined scope because of gaps that we may not know about. The plan is to try to close all of the gaps so when we finally do start working on the upgrade we have a mostly well-defined scope of work instead of a fuzzy "research this and figure out what we need to do here" situation, where research is often most of the work (meaning to create an issue about it, we need to do most of the work first to define the scope of the issue).
Also, seems like a total waste do to that. Most of the work that is remaining is on ICU4N. I have almost convinced myself that we may be able to release ICU4N as stable earlier by not strictly following the ICU versioning scheme but instead allowing each major release to have breaking API changes until we stabilize it (we are 13 versions behind so we have some wiggle room, but it does mean we will have to do a full upgrade every time we make a breaking API change). But we should probably still conditionally compile out the "draft" APIs and other APIs that are considered unstable in the NuGet package or at least make them invisible to the IDE. There are still other issues to deal with, such as the fact that NuGet doesn't actually deploy resource files for cultures it doesn't recognize. There are many decisions to make like that in ICU4N where there are gaps between Java and .NET. Unfortunately, nobody here seems willing to talk about the actual work that remains. Most want to move on to the next version of Lucene and pretend that we don't need to do this work for the upgrade, anyway.
We could alternatively move on to 4.8.0 release while keeping the Lucene.Net.ICU
and components that depend on it unstable, but unfortunately that means either splitting up the lucene-cli
component or releasing it as stable with unstable dependencies. I would argue we need to focus 100% on the remaining things that could break the API before we do such a thing (such as automated query parser generation), which could still be time-consuming. It also means we won't have a completely stable 4.8.0 release, the first fully stable release might be something like 4.8.0.17. Or else we would need to setup our build to make separate stable and unstable release packages to comply with the Apache release procedure. And we still wouldn't technically be able to start working on upgrading until we have a stable ICU4N, anyway. I don't see how this improves the situation, it only adds more work to do to make it stable and makes the versioning history more difficult to understand.
It really sucks for us to have to reject what would ordinarily be good ideas from the community, but unfortunately, most of these ideas never take everything into consideration when providing such advice, only the "normal stuff" that most projects deal with.
Shad, thank you for that. I feel like it just pulled me back into reality.
So I guess what you are saying is we can't have a "stable" Lucene.NET release unless its dependencies are stable and currently Lucene.NET.ICU is a work in progress with a changing API surface.
I'm reading into that, ICU4N, which Lucene.NET.ICU depends on, is also probably a work in progress. And it's certainly worth noting that ICU support is something the Java Lucene team got for free in the JDK that unfortunately isn't included in the .NET Framework (full or core). Hence the need to create ICU4N to provide that support. A nontrivial endeavor in its own right.
In using Lucene.NET to create a search index for an e-commerce marketplace, I've never hit any ICU-related functionality that was missing that I felt I needed. Unfortunately, I have no prior history with ICU so my only learnings about it have been here on the Lucene.NET project. So I guess for me, it's often an out-of-sight, out of mind, portion of Lucene.
But when I review the docs for Lucene.Net.ICU and see what's included, it feels very central to a search library and encompasses such basic functionality as finding word boundaries and line break boundaries. While this seems trivial in languages like English it's anything but trivial in languages like Chinese 要弄清楚如何分解中文單字是很困難的。or Japanese 中国語で単語を区切る方法を理解するのは難しいです.
Given that a great many of the developers using Lucene.NET only use it for English text, or other languages that use the Latin alphabet, it's easy to see how we can sometimes lose sight of what ICU is and why it's so important. Based on your post, I now better understand why Lucene.NET hasn't had a public release yet. Still, it seems very unfortunate that such a stable product (at least for indexing Latin languages) has a current version (beta) that doesn't indicate it's production-ready for Latin languages.
I'm with you a 100% that doing a Frankenstein version of Lucene that incorporates features from different versions. is a non-starter. Being able to compare execution paths with a corresponding Java version is too valuable to give up.
So I guess what you are saying is we can't have a "stable" Lucene.NET release unless its dependencies are stable and currently Lucene.NET.ICU is a work in progress with a changing API surface.
Not exactly. We could do a release if we go over the API surface of the core and other completed components to finalize it AND build a multi-release scheme so we have 2 different release labels, one for the stable components and one for the unstable components. While the API work is something we have to do anyway, changing the build, release policy, Git labeling scheme, etc. isn't exactly free.
Lucene.Net.ICU
will likely change because the CharacterIterator
still needs to be converted to a .NETified component and put into J2N (right now it exists in ICU4N.Support
, which is meant to go away from the public API). CharacterEnumerator
was made for this purpose, but it had to be commented out because I couldn't get it working on Lucene.NET components although it worked fine in ICU4N. This modification will definitely break the public API. I don't think there are any other things that will break it, though.
I'm reading into that, ICU4N, which Lucene.NET.ICU depends on, is also probably a work in progress. And it's certainly worth noting that ICU support is something the Java Lucene team got for free in the JDK that unfortunately isn't included in the .NET Framework (full or core). Hence the need to create ICU4N to provide that support. A nontrivial endeavor in its own right.
Yes, ICU4N is still a work in progress. There are several tests that either still fail, often due to gaps that we haven't yet covered. There are also some concurrency bugs to track down. Since it is only a partial port, we have lots of tests to go through that might be able to be ported, as well. The intention is not to port any more of the production code (except for perhaps some of the formatters and parsers because that is where most of its funding has come from so far).
The ICU4J functionality is not in the JDK. Instead ICU4N is a port of ICU4J. But it is hard to integrate because the gap between Java and ICU4J is not the same as the gap between .NET and ICU4N. Although, it is made easier because ICU is documented pretty well.
In short ICU4/J extend the text processing capabilities of .NET and Java by providing rules-based versions of some of the included components (such as the CompareInfo
.NET class which corresponds to the more powerful RuleBasedCollator
in ICU4N). These components allow you to control the behavior in custom ways that simply can't be done on the raw .NET or JDK platforms. There are also many other features that are super valuable, such as the UnicodeSet
which can be used like a regex character class but is much more powerful (it can even be passed a string to match all of the characters in a specific version of Unicode).
We use the ICU4N BreakIterator
in all cases where the JDK BreakIterator
is required because .NET is totally lacking this feature (even though it depends on ICU now, the API for this is not exposed anywhere). This has also caused some compatibility issues because of differences between how ICU4J and the JDK behave, so we had to patch the ThaiAnalyzer
and basically write our own tests for some of the highlighters. Unfortunately, the highlighters won't work exactly the same unless we do the research to work out what to recommend as the "JDK format" by providing custom rules that correspond to the Java behavior.
But when I review the docs for Lucene.Net.ICU and see what's included, it feels very central to a search library and encompasses such basic functionality as finding word boundaries and line break boundaries. While this seems trivial in languages like English it's anything but trivial in languages like Chinese 要弄清楚如何分解中文單字是很困難的。or Japanese 中国語で単語を区切る方法を理解するのは難しいです.
Given that a great many of the developers using Lucene.NET only use it for English text, or other languages that use the Latin alphabet, it's easy to see how we can sometimes lose sight of what ICU is and why it's so important. Based on your post, I now better understand why Lucene.NET hasn't had a public release yet. Still, it seems very unfortunate that such a stable product (at least for indexing Latin languages) has a current version (beta) that doesn't indicate it's production-ready for Latin languages.
Actually, there are several use cases that even make it valuable even to Western European languages. For example, for removing diacritics from words. In .NET, this cannot be done without a hack because the normalization feature is missing the case fold option that ICU has. I have seen many people post this hack in their questions about Lucene.NET even though they could just use the ICUFoldingFilter
or ICUNormalizer2Filter
instead.
These make it so words with accent characters such as resume, résumé, and resumé all normalize to the same root word for searches.
Although the components inside of the Lucene.Net.ICU
assembly are indeed valuable as is, the real value is in using ICU4N to build custom analysis components.
Thank you for the really nice and transparent explanation, @NightOwl888! Ultimately, it is down to a fundamental architectural decision on whether this is a line-by-line, version-by-version port of the Java Lucene or if this is a full-text search library based on Java Lucene. This decision is one that would be made by the maintainers, and respected by the users of this library.
While we could simply abandon 4.8.0 and start working on the latest version of Lucene now, we would be stuck in a situation where we have all of the same work to finish we do now plus an estimated 1800 hours of upgrading work
If I read the entire thread correctly, there was never a suggestion to just abandon 4.8, but instead to decide the API is stable and focus on bug fixes, then release 4.8 and figure out a different way to version the library so that API changes can be done later. This way, going from beta to release would mean the current feature set is stable, but without the guarantees of implementing 100% of the APIs of the Java version.
Just to give an example, speaking only from my experience with the library, I personally was not aware of the desire to keep on-disk binary formats the same between Java and .NET. We are only using a subset of all this functionality, and we would definitely not be using the Java version, let alone on the same data. We don't care about Java Lucene at all, we just want a really good .NET full text search engine (actually we don't care about on-disk format at all as we are 100% in memory, but that's a different story).
The bottom line is there is no maintenance plan for making a Frankenstein version of Lucene that incorporates features from different versions
I respect the decision to do a line-by-line port of Java Lucene, but I do like to point out that porting the most relevant features would not necessarily lead to a "Frankenstein" version. Obviously any feature that goes into the codebase have to be well architected and any technical dependencies for this feature have to be implemented properly. But consider if the goal was just to make the best .NET full text search engine out there, maybe omitting the long tail of rarely used features to not have to spend 1800 hours on version 4.8, instead focusing on the most popular features (again, building on robust foundation) may be serving the community better. This could perhaps lead to a higher engagement from the community (in terms of collaboration/PRs and possibly funding). You could still use Java Lucene as a blueprint for the implementation, but with the additional insight in what turned out well and what did not turn out so well there, without being burdened like they have by keeping compatibility also with less used and less well designed features.
We could alternatively move on to 4.8.0 release while keeping the Lucene.Net.ICU and components that depend on it unstable
To be blunt, and in all respect, it might get hard to find funding for hundreds or thousands of dev hours fixing the ICU library to support rare scripts and languages, until someone with a clear business case for it turns up. Just for comparison, if some company needed, say, vector valued fields (just as a random example) they might have the resources to fund the maintainers directly or devote professional developers to work with you on implementing this feature. But since I understand it you want to go to 9.something directly after 4.8, maybe we'll see a lot more contributions coming in as the field will be more open for new features.
but unfortunately that means either splitting up the lucene-cli component or releasing it as stable with unstable dependencies
If you have policies against pre-release libraries this is probably also a no go. I think policies like this are based on the assumption that pre-release means unstable implementation, while you mean unstable API. This is probably the core of this discussion, as it is clear that the code base is very stable from a bugs point of view.
It sounds like you have made a well-motivated and conscious decision w.r.t the versioning policy and the way to integrate new features. Your code, your versioning policy. Thank you for an awesome effort!
If I read the entire thread correctly, there was never a suggestion to just abandon 4.8, but instead to decide the API is stable and focus on bug fixes, then release 4.8 and figure out a different way to version the library so that API changes can be done later. This way, going from beta to release would mean the current feature set is stable, but without the guarantees of implementing 100% of the APIs of the Java version.
Just to give an example, speaking only from my experience with the library, I personally was not aware of the desire to keep on-disk binary formats the same between Java and .NET. We are only using a subset of all this functionality, and we would definitely not be using the Java version, let alone on the same data. We don't care about Java Lucene at all, we just want a really good .NET full text search engine (actually we don't care about on-disk format at all as we are 100% in memory, but that's a different story).
I respect the decision to do a line-by-line port of Java Lucene, but I do like to point out that porting the most relevant features would not necessarily lead to a "Frankenstein" version. Obviously any feature that goes into the codebase have to be well architected and any technical dependencies for this feature have to be implemented properly. But consider if the goal was just to make the best .NET full text search engine out there, maybe omitting the long tail of rarely used features to not have to spend 1800 hours on version 4.8, instead focusing on the most popular features (again, building on robust foundation) may be serving the community better. This could perhaps lead to a higher engagement from the community (in terms of collaboration/PRs and possibly funding). You could still use Java Lucene as a blueprint for the implementation, but with the additional insight in what turned out well and what did not turn out so well there, without being burdened like they have by keeping compatibility also with less used and less well designed features.
You are making some assumptions that just aren't true here.
- You are assuming that we have the high-level knowledge of each component to make such a derivative version.
- You are assuming that we would have some way to keep the feature set in line with Lucene if it were not a line-by-line port.
- You are assuming that we know which features our users find most valuable. While it is clear that a component such as
Lucene.Net.Analysis.Nori
(for Korean) will have very limited scope, it isn't so clear for more generalized components such asLucene.Net.ICU
that are useful in a lot more scenarios thatLucene.Net.Analysis.Common
simply doesn't cover. - You are assuming that we could get the tests to function the same way in .NET as they do in Java without a line-by-line port. Lucene has a custom test framework that uses repeatable randomized tests. This test framework is upgraded between versions of Lucene along with the tests.
Without keeping the binary formats the same, we would have to recreate all of the corrupt indexes for the tests. Arguably, the index format is the one thing that the Lucene team gave the most thought to about making Lucene portable across programming languages. Granted, we could use the documented format and try to reinvent the wheel for the rest, but there are a lot of components that would have to be analyzed at a high level so they could be recreated.
In addition, Lucene also has pluggable codecs so a newer version of Lucene can read the binary format from an older version so users can upgrade the software first and then upgrade the index later. Maybe you don't use this feature, but for users of apps with high availability, this feature is a must.
There are over 3000 code files in Lucene and it is not documented well - it could easily take years of analysis before we even start writing anything. We wouldn't even have much of an idea which features are important and which are not without tons of analysis and research. And when we are finished, there would be no reasonable way to incorporate features of new versions of Lucene (which is what happened on the NUnit project).
As for upgrading a single feature ahead of where it is in Lucene, this is where we run into problems. We have no idea before porting it what other patches it depends upon and whether any of those depend on binary formats that have changed. So we could start off porting to get the "future" feature in 4.8.0 only to find out later that it is incompatible and all of the work porting that one feature would go out the window. It would take much longer to port Lucene feature by feature than it would be to port the diff between 2 commits to get to a higher version. And we would always be sure to have a version that works (at least as well as it worked in Java).
We could alternatively move on to 4.8.0 release while keeping the Lucene.Net.ICU and components that depend on it unstable
To be blunt, and in all respect, it might get hard to find funding for hundreds or thousands of dev hours fixing the ICU library to support rare scripts and languages, until someone with a clear business case for it turns up. Just for comparison, if some company needed, say, vector valued fields (just as a random example) they might have the resources to fund the maintainers directly or devote professional developers to work with you on implementing this feature. But since I understand it you want to go to 9.something directly after 4.8, maybe we'll see a lot more contributions coming in as the field will be more open for new features.
That is true about funding. But the fact of the matter is that ICU4N has had more funding than Lucene.NET even though it is an alpha with unstable APIs and we still are working out how to properly package it. Maybe it is easier to get people to fund Lucene.NET if ICU4N is a done deal, but Lucene.NET moves on without ICU4N my fear is that ICU4N will never be released.
It is a tough sell to "release" Lucene.NET 4.8.0 and then ask for funding to "finish" it (which is basically to subsidize ICU4N). And it doesn't seem right to sell people on the idea that we are collecting funding for the upgrade only to shift that funding to finish ICU4N. It is far easier to finish ICU4N first, then release it, then release Lucene.NET, then ask for Lucene.NET funding for the 1800 hours to upgrade it (which is a pretty well defined scope).
You are right in that doing it in this order means there is less help on Lucene.NET, but that isn't really where the help is needed until the upgrade anyway. We have analyzed this pretty well and this is by far the fastest path (even though it is taking years because of limited funding and help).
but unfortunately that means either splitting up the lucene-cli component or releasing it as stable with unstable dependencies
If you have policies against pre-release libraries this is probably also a no go. I think policies like this are based on the assumption that pre-release means unstable implementation, while you mean unstable API. This is probably the core of this discussion, as it is clear that the code base is very stable from a bugs point of view.
For the most part, yes. There are a few intermittently failing tests we have yet to track down. We mostly just have several APIs that are likely to break before the release.
Since lucene-cli
contains the utilities to maintain the index, it doesn't seem right to make it a prerelease when the rest of the code is a release. But it is a command line app, so it isn't like anyone will depend on it directly. Lucene.Net.ICU
is another matter, though. I suspect it is the big companies that will require it most and those companies are the ones that are also most likely to have policies against pre-release libraries.
You are assuming that we have the high-level knowledge of each component to make such a derivative version. You are assuming that we would have some way to keep the feature set in line with Lucene if it were not a line-by-line port.
These are excellent points. Lucene is relatively easy to use as a library so it's easy not to realize just how sophisticated it is under the hood. It's hands down the most sophisticated software I have ever worked on. The amount of brilliant propeller head thinking that has gone into this product can't be overstated. Some of the best minds in search have contributed to Lucene. It's truly an amazing piece of software. And making changes to its internals is not for the faint-hearted. :-)
There are over 3000 code files in Lucene and it is not documented well - it could easily take years of analysis before we even start writing anything.
Points like these really sold your "line by line" approach to me. The (incorrect) assumption that I made was that most/all of the contributors and maintainers are as familiar with the (Java) Lucene codebase as the core Lucene devs, or the degree of communication between the projects. Admittedly, this was an assumption I made without looking it up. If this is not true, then any other approach would fail, agreed.
Just to clarify, Lucene has a lot of documentation, and Lucene.NET has it's flavor of that documentation as well. By many standards, it's decent documentation. But it's one thing to document how developers can use an expansive library like Lucene, and quite another to document why each design choice was made the way it was and how the specific implementation details of that design enable the insanely fast overall indexing and search speeds of Lucene.
There are many small aspects of the system that use such advanced software engineering approaches that a dev could easily spend more than a month if they wanted to understand that aspect of the system deeply. Lucene's use of automata is one example. Here is a video at a conference that does a high-level overview of how and why Lucene uses automata. If a dev wants to understand automata they will need to watch videos like that one and ultimately hunt down the whitepapers. Once those whitepapers have been digested, maybe the dev will have the ability to understand that portion of the code. Maybe. We are assuming a very senior dev here.
A dev is not going to find deep documentation on automata in Lucene's source code or external documentation. (shrug) There is, of course, the Lucene dev mailing list archive, an archive of completed issues, and PR notes. All three of which contain a fantastic amount of history and insights.
As for upgrading a single feature ahead of where it is in Lucene, this is where we run into problems. We have no idea before porting it what other patches it depends upon and whether any of those depend on binary formats that have changed. So we could start off porting to get the "future" feature in 4.8.0 only to find out later that it is incompatible and all of the work porting that one feature would go out the window. It would take much longer to port Lucene feature by feature than it would be to port the diff between 2 commits to get to a higher version. And we would always be sure to have a version that works (at least as well as it worked in Java).
Indeed explanation, so when I was working on adding the Sequence Number feature found the same issue really uncertain about the Lucnenet roadmap. @NightOwl888 / @rclabo can anyone list the issues where I can work on to me open to contributions focusing on production-grade features.
Furthermore, the binary structure of the index does change from one version to the next, making them incompatible and making it literally impossible to bring many Lucene 9.x features back to Lucene.NET 4.x. We had this issue with back-porting the analyzers-nori package.
We have 100% compatibility with creating an index in Lucene and opening it in Lucene.NET with the same version and plan to keep it that way going forward (and it worked once the other way around, but hasn't been tested in quite a while). The index isn't the only binary format that is also kept in sync between versions.
@NightOwl888 I am a Lucene Java programmer myself and am happy to help in any efforts to maintain two-way compatibility between Lucene and Lucene.NET.