spark-solr icon indicating copy to clipboard operation
spark-solr copied to clipboard

Null values for multivalued fields are converted into list objects using the export handler

Open theoathinas opened this issue 4 years ago • 2 comments

We're seeing that when using the /export handler with the spark-solr connector and flatten_multivalued is set to false, documents that may not contain a value for a multi-valued field in the schema (implying that the value is null) will return a dataframe row value that is a list with a single null or None object, instead of null.

For example, given a collection with a schema snippet like this:

<uniqueKey>video_id</uniqueKey>
<field   name="id"            type="string" />
<field   name="tags"                type="string"  indexed="false" multiValued="true" />

And two documents in that collection, one with tags, and one without any tags, using pyspark, if you make a call like this:

solrdata = spark.read.format('solr').options(zkhost=zkhosts, collection=collection, splits=True, query='*:*',fields='id,tags', flatten_multivalued=False, request_handler='/export').load()
solrdata.show()

You would see something like this:

+-----------+---------------------+                                                                                                                                                               
|         id|                 tags|
+-----------+---------------------+
|id_dfdfdfdf|[I, got, tags, tags] |
|id_fgfgfgfg|                   []|
+-----------+---------------------+

In Python at least, that [] is actually [None,], and it appears to be a bug in the spark-solr connector code.

The problem seems to start in this section: https://github.com/lucidworks/spark-solr/blob/2afaacc3fd5198b761684312f803f5a2047e83a9/src/main/scala/com/lucidworks/spark/util/SolrRelationUtil.scala#L639-L646

Particularly this line: val obj = map.get(field.name).asInstanceOf[Object] the map.get(field.name) returns null (because there isn't a value in the export results), and the asInstanceOf[Object] seems to convert that null into an Object that represents None, which then gets added to the values list.

I think there should be a some kind of a key check before calling map.get(). I have a PR with a possible solution that I will submit as well.

theoathinas avatar May 01 '20 20:05 theoathinas

here is the PR: https://github.com/lucidworks/spark-solr/pull/295

theoathinas avatar May 01 '20 20:05 theoathinas

Thanks @theoathinas for the contribution. Can you also add a unit test here? This class is the right place to add https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/RelationTestSuite.scala

kiranchitturi avatar May 01 '20 20:05 kiranchitturi