spark-solr icon indicating copy to clipboard operation
spark-solr copied to clipboard

Realtime GET + repeated queries

Open acrossen opened this issue 6 years ago • 2 comments

I have a use case that I hope spark-solr can address, but I'm having a couple issues. Background: I have a bunch of records in a DataFrame from an external system, each identified by a unique ID, that have corresponding records in Solr; some of them, at least. I want to look each of them up in Solr using Realtime GET and fetch various field values into Spark for further processing.

I have a big SolrCloud cluster - hundreds of shards across dozens of machines. What I'm experiencing is that spark-solr's default behavior, upon issuing a query, is that it first reaches out to each shard with a luke query to inspect the schema (or something like that according to the logs). Due to the size of my cluster this takes a long time. (Hopefully due to me mis-using it) spark-solr isn't caching this schema inspection, so each time I issue a new query it takes just as long.

2 questions:

  1. Does spark-solr support Realtime GET "natively"? Or can I set the request_handler to "/get" and specify "solr.params" -> "ids:12345" in the options?
  2. More importantly, is there a way to prevent this time-consuming schema inspection for issuing thousands of subsequent queries? I'm a newbie with the package so assuming I'm doing something wrong here.

Thanks for this great bridge b/w Spark & Solr! -a

acrossen avatar Jul 16 '18 14:07 acrossen

Does spark-solr support Realtime GET "natively"? Or can I set the request_handler to "/get" and specify "solr.params" -> "ids:12345" in the options?

Currently, we don't support real time GET natively. If you want to try, it is not that much code to support it as you would just have to extend SolrRDD.

More importantly, is there a way to prevent this time-consuming schema inspection for issuing thousands of subsequent queries? I'm a newbie with the package so assuming I'm doing something wrong here.

Unfortunately, Solr does not have a distributed luke endpoint that gives all the populated fields for a collection. Hence, we have to resort to querying each shard and this can be very time consuming if there are lot of shards. Have you tried using SolrRDD to shortcircuit the schema checking and all that?

kiranchitturi avatar Jul 30 '18 19:07 kiranchitturi

We did add random sampling of shards to get the schema from luke. This should speed things up

kiranchitturi avatar Mar 15 '20 05:03 kiranchitturi