spark-solr
spark-solr copied to clipboard
Null values for multivalued fields are converted into list objects using the export handler
We're seeing that when using the /export
handler with the spark-solr connector and flatten_multivalued
is set to false, documents that may not contain a value for a multi-valued field in the schema (implying that the value is null) will return a dataframe row value that is a list with a single null or None object, instead of null.
For example, given a collection with a schema snippet like this:
<uniqueKey>video_id</uniqueKey>
<field name="id" type="string" />
<field name="tags" type="string" indexed="false" multiValued="true" />
And two documents in that collection, one with tags, and one without any tags, using pyspark, if you make a call like this:
solrdata = spark.read.format('solr').options(zkhost=zkhosts, collection=collection, splits=True, query='*:*',fields='id,tags', flatten_multivalued=False, request_handler='/export').load()
solrdata.show()
You would see something like this:
+-----------+---------------------+
| id| tags|
+-----------+---------------------+
|id_dfdfdfdf|[I, got, tags, tags] |
|id_fgfgfgfg| []|
+-----------+---------------------+
In Python at least, that []
is actually [None,]
, and it appears to be a bug in the spark-solr connector code.
The problem seems to start in this section: https://github.com/lucidworks/spark-solr/blob/2afaacc3fd5198b761684312f803f5a2047e83a9/src/main/scala/com/lucidworks/spark/util/SolrRelationUtil.scala#L639-L646
Particularly this line:
val obj = map.get(field.name).asInstanceOf[Object]
the map.get(field.name)
returns null (because there isn't a value in the export results), and the asInstanceOf[Object]
seems to convert that null into an Object that represents None, which then gets added to the values list.
I think there should be a some kind of a key check before calling map.get()
. I have a PR with a possible solution that I will submit as well.
here is the PR: https://github.com/lucidworks/spark-solr/pull/295
Thanks @theoathinas for the contribution. Can you also add a unit test here? This class is the right place to add https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/RelationTestSuite.scala