solr
solr copied to clipboard
SOLR-14673: Add bin/solr stream CLI
https://issues.apache.org/jira/browse/SOLR-14673
Description
Bring in code that @joel-bernstein wrote, but using the SolrCLI infrastructure. The original code is a patch in the associated JIRA.
Solution
Another CLI client ;-)
Tests
Copied over the basic tests from the patch. I still need to write an integration style test and ideally one that exercies the basic auth.
Checklist
Please review the following and check all that apply:
- [ ] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
- [ ] I have created a Jira issue and added the issue ID to my pull request title.
- [ ] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
- [ ] I have developed this patch against the
mainbranch. - [ ] I have run
./gradlew check. - [ ] I have added tests for my changes.
- [ ] I have added documentation for the Reference Guide
A few high-level questions/concerns:
-
bin/solralready has an "api" tool, which can be used to invoke streaming expressions e.g.bin/solr api -get "$SOLR_URL/techproducts/stream?expr=search(techproducts)". I'm all for syntactic-sugar, but I wonder whether this is worth the maintenance cost if the main thing that it "buys" us is saving people from having to provide the full API path as the "api" tool requires? -
If I'm reading the PR correctly, it looks like one other capability of the proposed
bin/solr streamtool is that it can evaluate streams "locally" in some cases i.e. without a full running Solr. Which is pretty cool - you could imagine a real super-user doing some pretty involved ETL that builds off of an expression like:update(techproducts, unique(cat(...))).But I'd worry about some of the documentation challenges surrounding this. For instance, how would a user know which expressions can be run locally, and which require a Solr to execute on? For expressions that have a mix of both locally and remotely-executed clauses, is there any way for a user to know which clauses are executed where?
To clarify - I think the upside here is pretty cool, I'm just worried that upside is hard to realize without some extensive work on the documentation end to make it usable by folks in practice.
Thanks for sharing the feedback @gerlowskija ! I think the value of the tool is only there if your second comment about being able to run a streaming expression locally is valid, and then having it do what yoru first comment highlights falls out easy, otherwise it really is a thin wrapper/duplication of the bin/solr api call. Especially without any special value add in formatting tuples or error handling etc.
I do believe the second part is the really cool thing, that I can run a streaming expression locally and use it to process some data.
We clearly need some way of specifying where the processing is happening, in the cluster or locally. I was trying to think if we have any other places in Solr where we define "Where am I doing work" that might provide a name for a parameter. bin/solr stream --environment cluster BLAH ? The search() expression has a qt parameter.. bin/solr stream -qt=/stream BLAH ?
Reading through docs more, we have the parralel() and it refers to workers. Maybe the command should be something like bin/solr stream --workers=local BLAH which would run on your laptop, and if you don't specify --workers then it runs on the cluster via /stream?
I have found that lots of streaming expressions don't require a Solr connection, especially during development. I'm just iterating on the logic, and I'm starting and ending iwth tuples.. it's only later when I get the mappings etc working that I then move to adding in my search() or update() clauses.
Also, as far as docs go, we have a LONG way to go in Streaming expressions. It's both the best docuemnted code, with all the howtos and guides, but also, I find a million expressions that exist but don't show up in our reference docs ;-).
Okay, I think this is ready for review! I've added some docs.. I especially liked being able to cat some local data right into a Solr collection!
cat example/exampledocs/books.csv | bin/solr stream -e local 'update(gettingstarted,parseCSV(stdin()))'
In my local playing, it's been nice to be able to write a complex streaming expression in a file and just run it from the command line....
@gerlowskija since you provided some early review, do you think the docs I've added etc are enough that I can merge this in?
A few more notes:
- [x]
bin/solr stream --helpdoes not include any reference to the stream expression argument required at position 0 (or any other position that is valid) - [ ] When providing an invalid stream expression, error message is not helpful
(missing collection in expression) returns "Unable to construct instance of org.apache.solr.client.solrj.io.stream.SearchFacadeStream"bin\solr.cmd stream "search(q=*:*)" -e local -s http://localhost:8983 -c techproducts- I would like a message that at least says something like "double check your stream expression" if it is one of the causes
- [x] For my understanding, is it a technical limitation that
-e localdoes not require a collection (-c), but-e solrrequires one? - [x] I also noticed that providing
-srequires a URI scheme, but-zdoes not allow one. Is this inconsistency for all commands?
@gerlowskija @malliaridis What do you think about using the word "environment" versus "context" for the option you pass for execution? I.e, your environment is either local or solr... Context to me means more like "this is athing that you set up that has lots of variables".... "enviroment" is where you run it...?
Leaning towards "execution" and either "local" or "remote".
@malliaridis in reference to your question, yeah -z never has a scheme, but -s always does. Why exactly, not totally sure.. I guess that ZK communciates over same port regardless of ssl or not?