solr icon indicating copy to clipboard operation
solr copied to clipboard

User Behavior Insights implementation for Apache Solr

Open epugh opened this issue 1 year ago • 15 comments

Description

I am working with other folks, especially Stavros Macrakis ([email protected]), to come up with a solution for understanding what users are doing in response to search results. We have great visibility and understanding of an incoming query, what we do with it, and then what docs are sent back. We do NOT have a way of tying that search to then what does the user do next, and if the following query is connected to the original one.

Many teams lean on GA or Snowplow or custom code for tracking click through, add to cart, etc as signals, but nothing that is drop dead simple to use and open source.

Solution

User Behavior Insights is a shared schema for tracking search related activities. There is a basic implementation for OpenSearch and this is a version for Apache Solr.

Tasks to be done:

  • [x] Demonstrate providing a .expr file and using it to write to another Solr collection.
  • [ ] Look at performance implications of the every query generates a streaming expressoin.
  • [ ] Check we only record on the main node, not the replicas when sharding.
  • [ ] How can I load test this?
  • [ ] Write up Ref Guide Docs
  • [ ] Can we add it to techproducts as an example?
  • [x] Add UBI to Admin UI as flag
  • [ ] Add UBI to SolrJ basic client
  • [ ] Add UBI to SolrJ JSON Query client

Tests

Bats test to demonstrate the end to end use of UBI.

Checklist

Please review the following and check all that apply:

  • [ ] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [ ] I have created a Jira issue and added the issue ID to my pull request title.
  • [ ] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • [ ] I have developed this patch against the main branch.
  • [ ] I have run ./gradlew check.
  • [ ] I have added tests for my changes.
  • [ ] I have added documentation for the Reference Guide

epugh avatar May 09 '24 17:05 epugh

I'd love to review here, but I think I need some more starting information either in a ref guide page or a JIRA, I'm kind of lost right now...

HoustonPutman avatar May 10 '24 15:05 HoustonPutman

I'd love to review here, but I think I need some more starting information either in a ref guide page or a JIRA, I'm kind of lost right now...

Yeah... I'll go ahead and write up some ref guide docs! And finish the demo .bats script ;-)

epugh avatar May 10 '24 15:05 epugh

Usually, features like these are discussed in the dev@ list, or in JIRA or a SIP. Most important question I have in mind is whether this needs to be in the core search engine? If not, can this not be a plugin/package, shipped outside of solr-core?

chatman avatar May 20 '24 21:05 chatman

This is definitely draft mode code... I opened it as a PR just to be able to track the work, and once it gets a bit furthur, I plan on opening a proper discussion about it. Module? Solr Sandbox? A Component? A full blown package? So many fun options...

epugh avatar May 20 '24 23:05 epugh

I figured out how to parse and run a streaming expression that is used to write the query analytics data to, well, anywhere we want ;-). The next area is to look at is actually integrating the streaming expression INTO the component as more than just a one off.. I gotta figure out how to take the data and pass it into the streaming expression... input() maybe??? Also think about how to not rebuild/destroy/rebuild the streaming expression for every query.

Then more Ref Guide docs, a BATS integration test maybe, and then a discussion about who wants to use it first! Plus of course the ever critical, "where does the code live" conversation.

epugh avatar May 22 '24 20:05 epugh

Oh, and of course, we now have a machine readable schema via Json Schema available here https://github.com/o19s/ubi

epugh avatar May 22 '24 20:05 epugh

I am having some second thoughts about the idea of logging ubi queries to disk... Why? Any real use case you want them to go somewhere. Plus log4j is a pain to touch... So... May just rip that part out. You want to log to disk? just write a streaming expression ;-)

epugh avatar Aug 16 '24 21:08 epugh

Making progress.... I've added some tasks to do, and then I think I'm going to flip it from Draft to Ready for Review and email the community. I'd like to demo it at a upcoming community meeting.

BTW, UBI now has an actual website! https://www.ubisearch.dev/

epugh avatar Nov 16 '24 18:11 epugh

Something like this would be great to discuss at a Solr meetup/conf/whatever. Feel free to take half of a community meetup to show & tell when you're ready.

dsmiley avatar Nov 21 '24 22:11 dsmiley

Hey, we now have a website! http://ubisearch.dev for more info!

epugh avatar Nov 22 '24 12:11 epugh

A question for the smarter folks that me. Should the classes UBIQuery and UBIQueryStream be added to the UBIComponent.java? UBIQuery is just a pojo... And UBIQueryStream wires the use of the component up to a streaming expression. I don't see either ever being used elsewhere....

epugh avatar Nov 27 '24 13:11 epugh

Just stubbed my toe on the "Distributed processing is harder than single core processing"! With a two node set up, I discovered that I am logging to a SINGLE userfiles/ubi_queries.jsonl file, and I log once for each shard.. instead of just logging on the collector step..

{"query_id":"c4e40af6-67b7-4824-8b63-5aae70a485f6","timestamp":"2024-11-27T13:42:19.121Z"}
{"query_id":"5dfedf02-fd89-4e40-b3aa-7700c162800b","timestamp":"2024-11-27T13:42:19.121Z"}

Sigh.

epugh avatar Nov 27 '24 13:11 epugh

Argh, a bit stuck. I can't figure out how to have the UBIComponent during a distributed query, look up the final doc id's and record them before sending them back to the user. With a single node single shard, it works great, but not in a distributed fashion.

I keep getting:

2024-11-28 12:36:31.368 ERROR (qtp428039780-40-localhost-11) [c:twoshard s:shard1 r:core_node4 x:twoshard_shard1_replica_n2 t:localhost-11] o.a.s.s.HttpSolrCall 500 Exception => java.lang.NullPointerException: Cannot read field "docList" because the return value of "org.apache.solr.handler.component.ResponseBuilder.getResults()" is null
	at org.apache.solr.handler.component.UBIComponent.doStuff(UBIComponent.java:315)
java.lang.NullPointerException: Cannot read field "docList" because the return value of "org.apache.solr.handler.component.ResponseBuilder.getResults()" is null
	at org.apache.solr.handler.component.UBIComponent.doStuff(UBIComponent.java:315) ~[?:?]
	at org.apache.solr.handler.component.UBIComponent.distributedProcess(UBIComponent.java:252) ~[?:?]
	at org.apache.solr.handler.component.SearchHandler.processComponents(SearchHandler.java:552) ~[?:?]
	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:429) ~[?:?]
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:238) ~[?:?]
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2875) ~[?:?]

epugh avatar Nov 28 '24 12:11 epugh

I think I'm going to flip it from Draft to Ready for Review and email the community. I'd like to demo it at a upcoming community meeting.

@epugh did you ever demo this? I don't remember it, and think it'd be an awesome thing to highlight (and a great way to garner interest and get feedback)

gerlowskija avatar May 09 '25 14:05 gerlowskija