datasets Support loading XML datasets

CC: @davanstrien

Sep 20 '22 18:09 albertvillanova

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Sep 20 '22 18:09 HuggingFaceDocBuilderDev

CC: @davanstrien

I should have some time to look at this on Friday :)

Sep 21 '22 09:09 davanstrien

@albertvillanova I've tried this with a few different XML datasets. One issue I've run into is getting a KeyError when the attributes of a field differ from the first parsed row. Unfortunately, this can come up in the ALTO XML format, for example, if you want to parse the 'string' field, which contains the text in the ALTO XML files.

When parsing a file, this instance has no 'STYLE' attribute:

<TextLine HEIGHT="39" WIDTH="295" VPOS="926" HPOS="247"><String WC="0.4600000083" CONTENT="jufqu’en" HEIGHT="39" WIDTH="117" VPOS="926" HPOS="247"/><SP WIDTH="14" VPOS="928" HPOS="365"/><String WC="0.6075000167" CONTENT="l’an" HEIGHT="26" WIDTH="50" VPOS="928" HPOS="380"/><SP WIDTH="24" VPOS="936" HPOS="431"/><String WC="0.4300000072" CONTENT="1" HEIGHT="16" WIDTH="9" VPOS="936" HPOS="456"/><String STYLE="italics" WC="0.5774999857" CONTENT="361." HEIGHT="25" WIDTH="68" VPOS="933" HPOS="474"/></TextLine>

Whereas this one which appears later in the file, does have this field:

<TextLine HEIGHT="39" WIDTH="712" VPOS="966" HPOS="297"><String STYLE="italics" WC="0.6999999881" CONTENT="I" HEIGHT="17" WIDTH="9" VPOS="977" HPOS="297"/><String WC="0.5" CONTENT="I." HEIGHT="18" WIDTH="25" VPOS="976" HPOS="318"/><SP WIDTH="24" VPOS="971" HPOS="344"/><String STYLE="italics" WC="0.3359999955" CONTENT="Crade" HEIGHT="26" WIDTH="91" VPOS="967" HPOS="369"/><SP WIDTH="31" VPOS="971" HPOS="461"/><String STYLE="italics" WC="0.6060000062" CONTENT="Pétri" HEIGHT="26" WIDTH="71" VPOS="968" HPOS="493"/><SP WIDTH="23" VPOS="968" HPOS="565"/><String STYLE="italics" WC="0.612857163" CONTENT="Candidi" HEIGHT="27" WIDTH="111" VPOS="967" HPOS="589"/><SP WIDTH="19" VPOS="967" HPOS="701"/><String STYLE="italics" WC="0.4088888764" CONTENT="Decembrii" HEIGHT="28" WIDTH="144" VPOS="966" HPOS="721"/><SP WIDTH="10" VPOS="968" HPOS="866"/><String STYLE="italics" WC="0.4600000083" CONTENT="in" HEIGHT="25" WIDTH="27" VPOS="968" HPOS="877"/><SP WIDTH="9" VPOS="967" HPOS="905"/><String STYLE="italics" WC="0.5099999905" CONTENT="funere" HEIGHT="38" WIDTH="94" VPOS="967" HPOS="915"/></TextLine>

Since the first-seen fields define what is passed to arrow_writer, this causes a KeyError when the version with the extra attributes is encountered because it doesn't expect this column.

Since it's important to support streaming, I'm not sure there is a nice way to detect attributes for the whole file easily in an automatic way. The two potential ways I can see of doing it.

Do an initial pass on a batch of data to have a higher chance of encountering variations in attributes before doing the arrow write.
Do a full pass on one file (and assume that this won't change across files)

I think the other way of doing this would be to allow users to define expected/wanted attributes as another loading argument. This could then be used to extract the described attributes (and make them None if not found). This requires a bit more work from the user but could be helpful. For example, in the XML above, likely, most users will only want the WC and CONTENT attributes. So they could specify this upfront and avoid loading extra data they don't need or want. I suspect this option would make more sense than making this operation automatic for the case where attributes might change. WDYT?

Nov 01 '22 12:11 davanstrien