spark-xml icon indicating copy to clipboard operation
spark-xml copied to clipboard

The problem with the case of words for identical names

Open hipp0gryph opened this issue 1 year ago • 3 comments

Hello! If I load files with identical names, but different letter case - I'm getting an error. But I wish get NULL string or two columns with different letter case in schema. I think it's logical.

Code:

spark = SparkSession.builder \
    .appName("Read XML") \
    .config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.18.0")\
    .getOrCreate()

df = spark.read.format("xml") \
    .option("rowTag", "Root") \
    .option("attributePrefix", "") \
    .option("mode", "PERMISSIVE") \
    .option("charset", "utf-8") \
    .option("inferSchema", False) \
    .option("ignoreNamespace", False) \
    .load(f"case_test/*.xml")
df.printSchema()

xml 1 for folder case_test:

<Root>
    <Element>Block for case switch</Element>
</Root>

xml 2 for folder case_test:

<Root>
    <ElemenT>Block for case switch</ElemenT>
</Root>

Error:

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-2-b867e6c5fcd7> in <module>
    348     .option("inferSchema", False) \
    349     .option("ignoreNamespace", False) \
--> 350     .load(f"case_test/*.xml")
    351 df.printSchema()
    352 init_new_spark_df_methods()

/usr/local/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
    202         self.options(**options)
    203         if isinstance(path, str):
--> 204             return self._df(self._jreader.load(path))
    205         elif path is not None:
    206             if type(path) != list:

/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

AnalysisException: Found duplicate column(s) in the data schema: `element`

Thank you in advice!

hipp0gryph avatar May 21 '24 10:05 hipp0gryph

Hm, I actually don't even know if that's 'correct' behavior or not. Spark is not case sensitive but XML is. You're welcome to investigate and come up with an argument about what it should do and see if the schema inference can be changed. I just don't want to break any existing behavior over this as it's operated this way forever. But making something work that never worked could be OK.

srowen avatar May 21 '24 14:05 srowen

Hm, I actually don't even know if that's 'correct' behavior or not. Spark is not case sensitive but XML is. You're welcome to investigate and come up with an argument about what it should do and see if the schema inference can be changed. I just don't want to break any existing behavior over this as it's operated this way forever. But making something work that never worked could be OK.

Thank you for fast answer! Into w3 doc about xml: https://www.w3.org/TR/xml/#dt-entref We see into 4.3.3 Character Encoding in Entities: XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-registered encodings).

I think the right way is to read entity's with different case as the same.

hipp0gryph avatar May 21 '24 16:05 hipp0gryph

I also doubted that is the same entity's)

hipp0gryph avatar May 21 '24 16:05 hipp0gryph