polars icon indicating copy to clipboard operation
polars copied to clipboard

Support XML as input format

Open fchareyr opened this issue 1 year ago • 11 comments

Problem description

It would be great to be able to read XML into Polars DataFrame, similarly to what pandas offers (https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html).

fchareyr avatar Jun 20 '23 11:06 fchareyr

I believe XML is not a hot language anymore, but still widely use so I believe adds lots of value.

horahh avatar Jul 31 '23 01:07 horahh

It's still massively used in production throughout the world. As of today I agree that almost no-one would use XML as a format to export / import their data. But some (most ?) compagnies that existed before the JSON, CSV, XLSX hype didn't wait for those formats to create programs, API's and whatnots. And since you don't fix what's not broken, it's still very popular.

In my work there's not a single month (or dare I say week) without encountering tabular data presented as XML.

mdeville avatar Aug 11 '23 15:08 mdeville

This would indeed be very helpful!

For now, I would use pandas' function df_pd = pd.read_xml() to parse the XML file and then use df_pl = pl.from_pandas(df_pd).

What are alternatives?

MariusMerkleQC avatar Sep 07 '23 08:09 MariusMerkleQC

@MariusMerkleQC pd.read_xml doesn't appear to do too much.

https://github.com/pandas-dev/pandas/blob/49ca01ba9023b677f2b2d1c42e99f45595258b74/pandas/io/xml.py#L757-L861

It seems to be essentially a small wrapper around lxml.etree / xml.etree.ElementTree

doc = lxml.etree(...)
nodes = doc.xpath(...)

df = pd.DataFrame(nodes)

cmdlineluser avatar Sep 07 '23 12:09 cmdlineluser

Then it should be relatively easy to bring this to polars, what do you think?

MariusMerkleQC avatar Sep 07 '23 12:09 MariusMerkleQC

As it is not supported yet, I just used the library ElementTree to parse the .xml file. I then extracted value by value and just put them into a pl.DataFrame() one by one.

MariusMerkleQC avatar Sep 25 '23 20:09 MariusMerkleQC

Would definitely love to have native xml support in polars. Not hard to add but annoying when coming from pandas.

rupurt avatar Jan 24 '24 00:01 rupurt

If this were to be implemented in Rust, the spark-xml data source Databricks created for Spark might be worth borrowing some ideas from. It uses a StAX approach to XML parsing, the same approach quick-xml takes.

blackerby avatar Apr 22 '24 23:04 blackerby

polars isn't going to implement an xml reader based on python's xml reader it'd have to be rust. I can't say whether or not the maintainers want the extra binary size

deanm0000 avatar Apr 25 '24 12:04 deanm0000

It seems that calamine (used by fastexcel in the read_excel engine https://github.com/pola-rs/polars/pull/14000) uses the quick-xml library that @blackerby has mentioned.

Perhaps something could be done with quick-xml if Calamine integration happens at the Rust level.

We may be able to push the speed even further by integrating directly with Calamine down in the lower-levels of the Rust engine

cmdlineluser avatar Apr 25 '24 13:04 cmdlineluser

With XML reading I'd like to see https://github.com/pola-rs/polars/issues/13063 HTML reading too.

deanm0000 avatar Apr 25 '24 14:04 deanm0000