acl-anthology Explicit event representation

In a similar vein as #2603, I'd like to make all events explicit in the XML.

We implicitly generate events of the form "{venue}-{year}" for every venue referenced in a volume. For example, the following XML ...

<collection id="W14">
  <volume id="1" type="proceedings">
    <meta>
      ...
      <year>2014</year>
      <venue>gwc</venue>

... induces an event "gwc-2014".

However, old-style collections like W14 contain many different workshops, and therefore many different "events". This creates some challenges to making these events explicit:

The schema currently only allows a single <event> per collection.
An explicitly defined <event> lists <colocated> volume IDs. This doesn't include the volume IDs defined in the same XML file, as they are assumed to be part of the event automatically.
Thus, if we allowed multiple <event> definitions per collection, we'd have to list all associated volume IDs explicitly. This is because an event like "gwc-2014" should only include "W14-1", but not all the other volumes defined in W14.

In my opinion, this raises the question if events should be defined in the collection XML at all. In the library, if we want to find an event with ID "gwc-2014", it is undeterminable which file will contain this event (except for the rather loose association that it's probably in a "?14.xml" or "2014.*.xml" file). On the other hand, events can also define bibliographic items like talks, which we also hope to expand in the future — and those clearly belong in the XML.

I'm not sure how best to resolve this yet, and welcome opinions.

Aug 19 '23 16:08 mbollmann

I've started with this already: for any old-style workshop you can create a file named YYYY.VENUE.xml with just an event block. It's a little ugly because it means the event is separate from the proceedings but at least it works and is explicit. The other thing we could do (this might work already) is to move the WXX-YY volume block into that file so it's colocated with the event.

Aug 20 '23 11:08 mjpost

I've started with this already: for any old-style workshop you can create a file named YYYY.VENUE.xml with just an event block. It's a little ugly because it means the event is separate from the proceedings but at least it works and is explicit.

Ah, I didn't see that there are XML files like that already! Yeah, that should work.

The other thing we could do (this might work already) is to move the WXX-YY volume block into that file so it's colocated with the event.

This I would probably argue against. I think it's good to keep the convention that XML files must be named after the collection they contain, which would be violated if "1994.rocling.xml" would contain "O94-1". We already have to derive a lot of information through parsing all the data files, I wouldn't want to exacerbate this situation.

Aug 20 '23 13:08 mbollmann

I think the advantage of the explicit information is that we don't have to derive information from hidden sources, such as the file name. The "W" files were essentially large miscellaneous / unsorted dumping grounds for all workshops and even conferences prior to the introduction of our new IDs, and they are very large and unwieldy, getting worse the closer you get to 2019 (where we ran out of Ws and had to put EMNLP workshops under D). Moving those volumes into files that reflect their actual venue could be viewed as a restoration of information that was missing, without breaking backwards compatibility.

Aug 21 '23 14:08 mjpost

I like this from a data organization perspective, but the thing I'm concerned about is that we already have to compute a lot of indirect information when loading Anthology data, and this would add more. If you want to find all publications by "Matt Post", you currently have to parse every single XML file. If you want to find all events for a given volume, you currently have to parse every single XML file as it could - technically speaking at least - be colocated with any event defined anywhere. If we moved volumes into files not named after their collection ID, we'd now also have to parse every single XML file just to know where which volume is.

This may be less relevant for building the website, where we have to parse all data files anyway, but it can make any other data access (say, fetching one specific paper) a lot slower. That's why I'm currently in favour of enforcing the naming convention that XML files must be named after their collection ID.

If we moved, say, W19 workshops into different files, we'd also have multiple files with <collection id="W19">, i.e., the collection IDs wouldn't be unique across XML files anymore. That could be problematic too, I'm not sure.

Maybe one (partial) solution: when moving a volume out of its "collection ID file", leave in a placeholder that says where the volume can be found. Say if we moved W19-01 into 2019.scil.xml, we could leave something like <volume-link id="1" file="2019.scil.xml"> in W19.xml.

Just brainstorming here.

Aug 21 '23 15:08 mbollmann

acl-anthology acl-anthology copied to clipboard

Explicit event representation

acl-anthology
acl-anthology copied to clipboard