polars
polars copied to clipboard
Support XML as input format
Problem description
It would be great to be able to read XML into Polars DataFrame, similarly to what pandas offers (https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html).
I believe XML is not a hot language anymore, but still widely use so I believe adds lots of value.
It's still massively used in production throughout the world. As of today I agree that almost no-one would use XML as a format to export / import their data. But some (most ?) compagnies that existed before the JSON, CSV, XLSX hype didn't wait for those formats to create programs, API's and whatnots. And since you don't fix what's not broken, it's still very popular.
In my work there's not a single month (or dare I say week) without encountering tabular data presented as XML.
This would indeed be very helpful!
For now, I would use pandas' function df_pd = pd.read_xml()
to parse the XML file and then use df_pl = pl.from_pandas(df_pd)
.
What are alternatives?
@MariusMerkleQC pd.read_xml
doesn't appear to do too much.
https://github.com/pandas-dev/pandas/blob/49ca01ba9023b677f2b2d1c42e99f45595258b74/pandas/io/xml.py#L757-L861
It seems to be essentially a small wrapper around lxml.etree
/ xml.etree.ElementTree
doc = lxml.etree(...)
nodes = doc.xpath(...)
df = pd.DataFrame(nodes)
Then it should be relatively easy to bring this to polars
, what do you think?
As it is not supported yet, I just used the library ElementTree to parse the .xml
file. I then extracted value by value and just put them into a pl.DataFrame()
one by one.
Would definitely love to have native xml support in polars. Not hard to add but annoying when coming from pandas.
If this were to be implemented in Rust, the spark-xml data source Databricks created for Spark might be worth borrowing some ideas from. It uses a StAX approach to XML parsing, the same approach quick-xml takes.
polars isn't going to implement an xml reader based on python's xml reader it'd have to be rust. I can't say whether or not the maintainers want the extra binary size
It seems that calamine (used by fastexcel in the read_excel
engine https://github.com/pola-rs/polars/pull/14000) uses the quick-xml
library that @blackerby has mentioned.
Perhaps something could be done with quick-xml if Calamine integration happens at the Rust level.
We may be able to push the speed even further by integrating directly with Calamine down in the lower-levels of the Rust engine
With XML reading I'd like to see https://github.com/pola-rs/polars/issues/13063 HTML reading too.