acl-anthology
acl-anthology copied to clipboard
Removed comments from XML
I removed comments from the XML and the Python code that permitted skipping them when parsing. In its place, I added a "note" attribute to <volume-id> to help sort workshops. Would this work or does it also complicate things?
Build successful. Some useful links:
- Complete site preview: https://preview.aclanthology.org/remove-xml-comments
- Potential volumes of interest:
This preview will be removed when the branch is merged.
These notes are intended to work more like groupings, right? Maybe we could do
<colocated group="unsorted">
...
</colocated>
<colocated group="acl-2023">
...
</colocated>
Alternatively, do you foresee this being used for anything other than ws volumes? Maybe this is also a good opportunity to get rid of ws altogether and refactor the way we classify something as a "workshop".
This is very clearly a better way to do this, thanks!
It is only for workshops. I've had requests from senior people to maintain the workshop listing, since it's useful for people to browse. It's a bit of a pain to update this list, and to keep it sorted, which is why I've been pushing for this grouping idea. Definitely open to re-factoring; did you have something in mind?
Firstly, rather than creating a hypothetical event called "ws-<year>", shouldn't it be that the workshop volume gets <venue>ws</venue> in its <meta> block, in addition to its other venues? That's how it's been done in the past, and I think that is clearer than going the event route.
Secondly, if that's how it's represented, it could be refactored in a number of ways, e.g. (i) replacing <venue>ws</venue> with a <is-workshop/> tag, (ii)) adding an attribute workshop="true" to the <meta> or the <volume> tag, or (iii) changing <volume type="proceedings"> to <volume type="workshop">.
In any case, adapting the build so that everything marked as a workshop is compiled on its own page (which being part of the "ws" venue does now) should be a simple change.
In #1117, we discussed adding a "workshop" flag to venues, but I think it's clearer to attach it to volumes, as this both mirrors how it currently works and avoids the issue of workshop venues turning into full conferences at some point.
(On this note, there are a couple of volumes in the current "ws-2023" list that do say "conference" in the proceedings title — is this intentional?)
The distinction between workshop/conference can be fuzzy. I think thought that *SEM, IWSLT, etc should not be listed under the workshops event. I suspect they were just blindly copied over along with all other colocated events.
You're right in pointing out that we currently have redundant ways to add to the workshop "event". I'm not sure why I didn't see that and instead created the {yyyy}.ws.xml files. For the refactoring, I'm in favor of a static boolean tag, e.g., <is-workshop/> in the <meta> block (or maybe a variant like <add-to-workshops/> that more directly conveys the purpose). This might also be a step in the direction of no longer abusing the "event" idea to display amalgamated workshops, though I suspect we'll still have to create that HTML page or make it discoverable somehow. The downside to this is that there is no longer a single place to view all the workshops in a given year, prior to building the site out. Though I guess one could accomplish it by grepping through all files with such a tag in the current year.
So if we go through with this, I guess the proposal is to eliminate all files of the format {year}.ws.xml, moving that information to a tag?
The distinction between workshop/conference can be fuzzy. I think thought that *SEM, IWSLT, etc should not be listed under the workshops event. I suspect they were just blindly copied over along with all other colocated events.
Noticing these types of issues might be easier when workshops are flagged within the volume itself, I think.
This might also be a step in the direction of no longer abusing the "event" idea to display amalgamated workshops, though I suspect we'll still have to create that HTML page or make it discoverable somehow.
We can start by treating "workshop tags" the same as before during the build, i.e., creating the virtual "ws" venue and attaching these volumes to it. That should mean that on the front-end, everything stays the same. The question of how to do this better is then probably related to a redesign of the front page.
The downside to this is that there is no longer a single place to view all the workshops in a given year, prior to building the site out. Though I guess one could accomplish it by grepping through all files with such a tag in the current year.
XPath expressions work:
~/r/acl-anthology/data/xml $ cat ?19.xml | xq -x '//meta[venue="ws"]/url' [09:33:53]
D19-51
D19-54
D19-58
D19-59
D19-60
D19-61
D19-63
D19-64
D19-66
W19-03
W19-11
W19-56
W19-68
W19-70
W19-71
W19-72
W19-73
W19-85
And we could add functionality to the new library that'll make this easy too.
So if we go through with this, I guess the proposal is to eliminate all files of the format
{year}.ws.xml, moving that information to a tag?
Yes, and also:
- Convert all instances of
<venue>ws</venue>to<is-workshop/> - Add
<is-workshop/>to all volumes where the "ws" venue is currently inferred implicitly (all W*.xml files, I believe) - Find all volumes with
contains(booktitle, "Workshop")that did not already get the<is-workshop/>tag and determine if they should have it — because I suspect it was not always added in recent years
For the build, I'd just turn <is-workshop/> into giving the volume the "ws" venue in the respective Python class; in the new library we can handle this a little bit saner then.