solr icon indicating copy to clipboard operation
solr copied to clipboard

SOLR-17023: Use Modern NLP Models via ONNX and Apache OpenNLP with Solr

Open epugh opened this issue 2 years ago • 28 comments

https://issues.apache.org/jira/browse/SOLR-17023

Description

Code and BATS test that demonstrates downloading a sentiment classification model from Huggingface, converting it to Onnx model, uploading the model to Solr's FileStore, and then configuring a collection and associated pipeline to use the model for enriching content.

Solution

New DocumentCategorizerUpdateProcessorFactory that interacts with the model.

Tests

BATS test. Check this PR out and run: ./gradlew integrationTests --tests test_opennlp.bats

Checklist

Please review the following and check all that apply:

  • [x ] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [x] I have created a Jira issue and added the issue ID to my pull request title.
  • [x ] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • [ x] I have developed this patch against the main branch.
  • [ ] I have run ./gradlew check.
  • [ ] I have added tests for my changes.
  • [ ] I have added documentation for the Reference Guide
  • [ ] Look at GPU support
  • [ ] Integrate with Films demo.
  • [ ] Can we have BATS tests that depend on external resources, or do I check in the ONNX model instead? (make a stupid one???)
  • [ ] Write a unit test?
  • [ ] Provide both a label but ALSO a score!

epugh avatar Oct 10 '23 20:10 epugh

Check it out @jzonthemtn

epugh avatar Oct 10 '23 20:10 epugh

Thanks @risdenk , your suggestions worked. One slight wrinkle that I don't know how to deal with is that the com.microsoft.onnxruntime:onnxruntime_gpu:1.14.0 file doesn't work on OSX, only linux and windows. So if you look at the BATS test, you'll see some code where I replace it with com.microsoft.onnxruntime:onnxruntime:1.14.0. I don't really know how to handle that going forward.. I mean, swapping a magic jar is probably a bad idea ;-).

epugh avatar Oct 10 '23 21:10 epugh

./gradlew integrationTests --tests test_opennlp.bats

epugh avatar Oct 10 '23 21:10 epugh

> Task :solr:packaging:integrationTests
Running BATS tests with Solr base port 53065
1..1
ok 1 Check lifecycle of sentiment classification in 20000ms

This is after uploading the model to the PackageStore and reference it in there!

epugh avatar Oct 10 '23 22:10 epugh

@risdenk @jzonthemtn so, any ideas how to force the onnxruntime instead of the onnxruntime_gpu? I regenerated the licenses, and noticed that it created a "onnxruntime_gpu-1.14.0.jar.sha1".

epugh avatar Oct 11 '23 15:10 epugh

also, i am seeing two errors on looking up licenses...

Execution failed for task ':solr:modules:analysis-extras:validateJarLicenses'.
> Certain license/ notice files are missing:
    - License file missing ('com.microsoft.onnxruntime:onnxruntime_gpu:1.14.0'), expected it at: /Users/epugh/Documents/projects/solr-epugh/solr/licenses/onnxruntime_gpu-LICENSE-[type].txt, where [type] can be any of [ASL, BSD, BSD_LIKE, CDDL, CPL, EPL, MIT, MPL, PD, SUN, COMPOUND].
    - License file missing ('org.apache.opennlp:opennlp-dl:2.2.0'), expected it at: /Users/epugh/Documents/projects/solr-epugh/solr/licenses/opennlp-dl-LICENSE-[type].txt or /Users/epugh/Documents/projects/solr-epugh/solr/licenses/opennlp-LICENSE-[type].txt, where [type] can be any of [ASL, BSD, BSD_LIKE, CDDL, CPL, EPL, MIT, MPL, PD, SUN, COMPOUND].


epugh avatar Oct 11 '23 15:10 epugh

@jzonthemtn did open https://issues.apache.org/jira/browse/OPENNLP-1515, but that will require a release of OpenNLP before we can piggy back on it

epugh avatar Oct 11 '23 16:10 epugh

I think the first draft is done, and it "works". There are some caveats and things to be figured out, like the onnxruntime_gpu versus onnxruntime issue. Based on my work so far, we definitly need tutorials, there are too many moving pieces. The PackageStore is great, at least as far as I've worked with it, but it does expect to load the data in memory as it moves it around, so we need to figure that out. Right now we only get sentiment labels back, and we really need sentiment scores, i.e a number. I don't know if that can be a 0 to 1, of if it has to be 1,2,3,4,5 labels. I'd like to experiment with the saem thing as a copyField into a text analyzer, so you copy "review" into "review_sentiment_score" and get a number there. Or, do we have a streaming expression you run that updates the review_sentiment_score for all docs that don't have it with an atomic update?

epugh avatar Oct 11 '23 16:10 epugh

Looks like a great first step! Glad that OpenNLP 2.3.1 helped move it along.

jzonthemtn avatar Nov 29 '23 18:11 jzonthemtn

Looks like a great first step! Glad that OpenNLP 2.3.1 helped move it along.

I did a community demo yesterday, and it went well. Having 2.3.1 meant I could remove some ugly moving of Jars! Which made the demo more compelling.

epugh avatar Nov 30 '23 14:11 epugh

@cpoerschke when I demoed this code at the last community meetup, @gerlowskija asked why not to commit it, and I didn't have a super great reason. I'd love your thoughts on this PR since you played some with ONNX as well.. Is there anything here you think needs changing before its get merged? I'd love to get the ONNX stuff in and unblock your work...

epugh avatar Jan 09 '24 16:01 epugh

Looking at these build failures and the error message generated, it appears that it may be caused by us using Java 11 and OpenNLP being compiled with Java 17??

 /home/runner/work/solr/solr/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/DocumentCategorizerUpdateProcessorFactory.java:39: error: cannot access InferenceOptions
import opennlp.dl.InferenceOptions;
                 ^
  bad class file: /home/runner/.gradle/caches/modules-2/files-2.1/org.apache.opennlp/opennlp-dl/2.3.1/8ff28619e6a377fe467b47274f39fd1fc9b2c303/opennlp-dl-2.3.1.jar(/opennlp/dl/InferenceOptions.class)
    class file has wrong version 61.0, should be 55.0
    Please remove or make sure it appears in the correct subdirectory of the classpath.
/home/runner/work/solr/solr/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/DocumentCategorizerUpdateProcessorFactory.java:40: error: cannot access DocumentCategorizerDL

Is this a deal breaker for this PR?

epugh avatar Jan 09 '24 23:01 epugh

Looking at these build failures and the error message generated, it appears that it may be caused by us using Java 11 and OpenNLP being compiled with Java 17?? ... Is this a deal breaker for this PR?

Hmm, interesting. So we have:

  • OpenNLP as minimum Java17 as you mention -- https://github.com/apache/opennlp/blob/opennlp-2.3.1/pom.xml#L167
  • Lucene as minimum Java11 -- https://github.com/apache/lucene/blob/releases/lucene/9.9.1/build.gradle#L75-L76
  • Solr as minimum Java11 -- https://github.com/apache/solr/blob/releases/solr/9.4.1/build.gradle#L88

So if the classes here were built as an independent plugin (with minimum Java17) and then deployed (with the relevant dependencies) into a Solr setup running Java17 with the original Solr artefacts (built with Java11) -- I wonder if that would work?

Also noting that https://github.com/apache/lucene/pull/579 bumped Lucene to Java17 on main branch i.e. presumably then a future Lucene10 will be minimum Java17 version.

cpoerschke avatar Jan 18 '24 16:01 cpoerschke

So if the classes here were built as an independent plugin (with minimum Java17) and then deployed (with the relevant dependencies) into a Solr setup running Java17 with the original Solr artefacts (built with Java11) -- I wonder if that would work?

I guess this should work.

rzo1 avatar Jan 18 '24 16:01 rzo1

Feels like what we should be doing is having Solr 10 target Java 17 since Lucene 10 will require it, and then this code goes on Solr 10, but not on Solr 9. This lets us have some more time to experiment with out dealing with the headaches of supporting an official release in the 9.x line (backcompat and the rest)....??

epugh avatar Jan 18 '24 16:01 epugh

Feels like what we should be doing is having Solr 10 target Java 17 since Lucene 10 will require it, and then this code goes on Solr 10, but not on Solr 9. This lets us have some more time to experiment with out dealing with the headaches of supporting an official release in the 9.x line (backcompat and the rest)....??

I concur. Also a nice motivation for targeting Java 17 i.e. specific example of functionality that it would unlock. And in the meantime "independent plugin" approaches remain a possibility in the community, perhaps even in the https://github.com/apache/solr-sandbox if someone wanted to pursue that (haven't checked how that is built, just kinda "name dropping" solr-sandbox here).

cpoerschke avatar Jan 18 '24 17:01 cpoerschke

Lucene 9 requires older version of Java than the minimum required version that OpenNLP requires. That means that this PR is pending a release of Lucene 10, and the adoption of Lucene 10 by Solr. It would be interesting to think about if there was a way for Solr main branch to somehow depend on the Lucene main branch release that jumps the minimum Java versions all around, and would allow this PR to be merged.

epugh avatar Feb 13 '24 13:02 epugh

... It would be interesting to think about if there was a way for Solr main branch to somehow depend on the Lucene main branch release that jumps the minimum Java versions all around, and would allow this PR to be merged.

Technically I guess Solr main could continue to depend on whatever Lucene version and just jumping up the minimum Java version for Solr main to 17 would be sufficient? With all the ups-and-downs of main and branch_9x having different minimums.

cpoerschke avatar Feb 13 '24 16:02 cpoerschke

@cpoerschke https://github.com/apache/solr/pull/1510 might be helpful here. I have a few wip prs for newer jdks

risdenk avatar Feb 13 '24 16:02 risdenk

This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the [email protected] mailing list. Thank you for your contribution!

github-actions[bot] avatar Apr 17 '24 00:04 github-actions[bot]

This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again.

github-actions[bot] avatar Oct 07 '24 00:10 github-actions[bot]

this is still valid ;-)

epugh avatar Oct 07 '24 10:10 epugh

I updated this PR with main, to see what happened, and we're closer. Things that I think are still holding us back:

  1. Solr Main is NOT on Lucene 10, which means we have to override the version of OpenNLP that Lucene uses (maybe somehow?). However when #3053 gets in that should deal with it.
  2. Need to write a JUnit test for DocumentCategorizerUpdateProcessorFactory (oops!)

epugh avatar Feb 23 '25 13:02 epugh

Yep, need to wait for Lucene 10, otherwise we get some unit test failures:

gradlew test --tests TestOpenNLPExtractNamedEntitiesUpdateProcessorFactory.testExtractFieldRegexReplaceAll -Dtests.seed=F5DD0B40AC590A66 -Dtests.locale=pt-GW -Dtests.timezone=PRT -Dtests.asserts=true -Dtests.file.encoding=UTF-8
>     java.lang.NoSuchMethodError: 'opennlp.tools.util.Span[] opennlp.tools.sentdetect.SentenceDetectorME.sentPosDetect(java.lang.String)'
   >         at __randomizedtesting.SeedInfo.seed([F5DD0B40AC590A66:4351D7F53AC9F7A4]:0)
   >         at org.apache.lucene.analysis.opennlp.tools.NLPSentenceDetectorOp.splitSentences(NLPSentenceDetectorOp.java:41)
   >         at org.apache.lucene.analysis.opennlp.OpenNLPSentenceBreakIterator.setText(OpenNLPSentenceBreakIterator.java:199)
   >         at org.apache.lucene.analysis.util.SegmentingTokenizerBase.reset(SegmentingTokenizerBase.java:89)

epugh avatar Feb 24 '25 20:02 epugh

@epugh Just curious on the status of this pull request.

jzonthemtn avatar May 27 '25 18:05 jzonthemtn

@epugh Just curious on the status of this pull request.

Since we are giving a talk on it at C/C NA in a few months, we need to get it in!

More seriously, I think what stalled it was the Lucene 10 need... and the DocumentCategorizerUpdateProcessorFactory test...

Are you interested in giving it a bit of a run through with main and see if it's mergable, and then I'd be happy to start on the docs side of things?

epugh avatar May 27 '25 20:05 epugh

Looks like lucene is on OpenNLP 2.5.4, and we are on a older version...

epugh avatar May 27 '25 20:05 epugh