spark-xml
spark-xml copied to clipboard
The problem with the case of words for identical names
Hello! If I load files with identical names, but different letter case - I'm getting an error. But I wish get NULL string or two columns with different letter case in schema. I think it's logical.
Code:
spark = SparkSession.builder \
.appName("Read XML") \
.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.18.0")\
.getOrCreate()
df = spark.read.format("xml") \
.option("rowTag", "Root") \
.option("attributePrefix", "") \
.option("mode", "PERMISSIVE") \
.option("charset", "utf-8") \
.option("inferSchema", False) \
.option("ignoreNamespace", False) \
.load(f"case_test/*.xml")
df.printSchema()
xml 1 for folder case_test:
<Root>
<Element>Block for case switch</Element>
</Root>
xml 2 for folder case_test:
<Root>
<ElemenT>Block for case switch</ElemenT>
</Root>
Error:
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<ipython-input-2-b867e6c5fcd7> in <module>
348 .option("inferSchema", False) \
349 .option("ignoreNamespace", False) \
--> 350 .load(f"case_test/*.xml")
351 df.printSchema()
352 init_new_spark_df_methods()
/usr/local/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
202 self.options(**options)
203 if isinstance(path, str):
--> 204 return self._df(self._jreader.load(path))
205 elif path is not None:
206 if type(path) != list:
/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
AnalysisException: Found duplicate column(s) in the data schema: `element`
Thank you in advice!
Hm, I actually don't even know if that's 'correct' behavior or not. Spark is not case sensitive but XML is. You're welcome to investigate and come up with an argument about what it should do and see if the schema inference can be changed. I just don't want to break any existing behavior over this as it's operated this way forever. But making something work that never worked could be OK.
Hm, I actually don't even know if that's 'correct' behavior or not. Spark is not case sensitive but XML is. You're welcome to investigate and come up with an argument about what it should do and see if the schema inference can be changed. I just don't want to break any existing behavior over this as it's operated this way forever. But making something work that never worked could be OK.
Thank you for fast answer! Into w3 doc about xml: https://www.w3.org/TR/xml/#dt-entref We see into 4.3.3 Character Encoding in Entities: XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-registered encodings).
I think the right way is to read entity's with different case as the same.
I also doubted that is the same entity's)