acl-anthology
acl-anthology copied to clipboard
Don't display joint volumes on venue page
Closes #1848. Venue pages should no longer should colocated volumes (e.g., ACL's venue page will only show ACL volumes, not all the colocated workshops). This is done by a bit of logic that filters colocated lists down to ones that match the parent venue ID. This isn't perfect as can be seen below, because sometimes a volume's primary affiliation was a one-off.
Is also a start to #1164. We need to split out joint.yaml
to distinguish joint versus colocated events, and to explicitly list exceptions when volumes were misnamed according to the pre-2020 ID format change. For example:
- COLING 1984 was named P84 because it was joint with ACL
- CoNLL was a workshop from 1997--2014 (then got "K")
- All EMNLP 2019 workshops used "D" for a prefix instead of "W" because we ran out of W volumes
We'll see if the preview builds.
This has worked pretty well. For example, the following venue pages now only (correctly) show their own volumes:
However, there are a few errors that persist, and can't really be fixed without either hard-coded logic or an explicit mechanism for overriding the default volumes that belong to a venue:
- EMNLP (particularly 2019, where we had to use the D19 code for workshops)
- *SEM, which has no volumes, because they all belong to SemEval. The "S" code was used for both, with one volume typically assigned to each
Making this explicit is part of the work that needs to be done for #1164, so I expect I will do this next.
It strikes me that a problem that remains since the new ID format was enacted, is that we don't know the venue affiliation for every volume. We mostly infer it. I think we are going to need to make those all explicit, for every workshop prior to 2020. Many of them have been done for venues that people care about (e.g., I filled them in for WMT), but there are others. I think the next step is to come up with a list of orphaned volumes: volumes, mostly workshop ones, that have not been assigned to a venue.
Merged latest. Next steps for volume ownership:
- In the Anthology class, every
Venue
should know all volumes it owns. This is determined by rules (volume name is the same as the venue name, or old style letter matches), explicit terms ("volumes" key in venue file), and subtracting out excluded volumes ("excluded-volumes" key in the venue file) - We use these values directly when writing the YAML files for hugo
Then we can move to explicit modeling of events, following this comment and the discussion around it.
I think for events, we want a default event, which is every volume associated with that event. If there are other volumes colocated with it, the will have to be explicitly listed in the events file (e.g., data/events/acl-2021.yaml
).
I have now removed joint.yaml
and split out the venues into their own files under data/yaml/venues/{slug}.yaml
. Each venue can now lay claim to the vast number of previously indistinguished volumes by listing them under a volumes
tag. Venues can also disclaim volumes using the excluded-volumes
list (EMNLP uses this to disassociate itself from workshops that had to use the D
prefix in 2019).
The next step is to explicitly add events. This will allow us to restore the lost functionality of joint.yaml
. For a particular event, we can add all the volumes that were colocated with that event. I plan to do this in the following way:
- Create
data/events/{venue}-{YYYY}.xml
, following our current event format (e.g., https://aclanthology.org/events/acl-2021). - By default, every (venue, year) has an event page auto-generated, containing only the volumes associated with the venue.
- Volumes can be added to the event by listing them in the events file. For example, for ACL 2022, we'll list the Findings:ACL'22, along with all the workshops that appeared at ACL.
This will complete the process of separating out the conflated purposes in joint.yaml
, and will also provide us with a way of representing data associated with an event, such as videos of keynotes.
Build successful. Some useful links:
- Complete site preview: https://preview.aclanthology.org/venue-listing-fix
- Potential volumes of interest: 2022.acl-long, 2022.acl-short, 2022.acl-srw, 2022.acl-demo, 2022.acl-tutorials
This preview will be removed when the branch is merged.
Building now, but currently getting some duplicate volumes (e.g., IJCLCP).
@mbollmann @akoehn I'd be interested in your feedback here, if you have time/interest (anyone else, too). I'm halfway through splitting out the venues and distinguishing joint (shared) volumes from colocated events. The current preview only displays a venue's owned volumes under /volumes/{venue}
(whereas before it used to display colocated volumes). To do this, I
- Split out
data/yaml/venues.yaml
into individual venue files underdata/yaml/venues/{slug}.yaml
- Added a
volumes
key to these YAML files, so that venues can claim their old-style volumes - Added an
excluded_volumes
key to these YAML files, so that venues can disclaim volumes that are associated with it under normal rules (e.g., all workshops for EMNLP 2019, where we ran out of workshop IDs and were forced to use D19-50+).
Currently, pages under events/{venue}
is broken. I plan to fix that next in a way that will also all us to address #298. But the first step, which is all I plan to do in this PR, is to simply
- Establish default rules for creating the event pages (this currently exists), but allow them to be overridden if there is a file
data/events/{year}.{slug}.xml
for each event - Add a
<colocated>
tag that links volumes that were joint with this event (one tag per joint volume, e.g.,<colocated>D19-50</colocated>
Later, this same file will be used to add keynotes and so on.
Thoughts? Objections?
Sorry for the super late reply, I am currently working backwards through my inbox ... is there anything else I should look at that has changed in the meantime? In general, this looks reasonable from your description, I will have a look at the data to get a better feeling for the implications.
I’ve got to update this and then merge. There’s a bit more to do to restore events, which explains the delay. This sanity check is helpful, thank you!
Good, then I know what the current state is and will have a more in depth look on Friday.
@mbollmann @akoehn if you have a minute to look at this comment (not necessarily the code), I'd be grateful.
I've been working through this issue here, trying to separate venue pages (which should just show venue volumes) from event pages (which should show all volumes associated with an event). This is a precursor to having explicit representations of events so that we can display plenaries and other items. It touches heavily on #1164. In working through this, I've tried a number of approaches, and they all run into difficulties.
As a first step here, I've done something uncontroversial, which is split the venues.yaml
file out into individual files under data/yaml/venues
. These are now easier to read and maintain and I don't anticipate objections.
For the next part, I have come around to Marcel's thinking in #1164: we should get rid of joint.yaml
entirely, and move the information into the XML. It's quite tricky, though. to split apart the information that is found there. You can see the logic if you like in bin/split_joint.py
, but the high-level ideas are:
- There are three types of items in joint.yaml: (a) joint events, (b) colocated events, and (c) "identification" events, where an old style volume is associated with a venue tag (e.g., W18-52 with wmt).
- It is extraordinarily tricky to split these apart heuristically.
- Instead, I have split them into
main-venue
, which seeks to identify each pre-2020 volume as if it were a newstyle volume (and had its venue in the file name), andassociated-venue
, which covers joint/colocated events. - The main-venue association basically means that every volume in the Anthology has an explicit venue association. This is very useful and could also simplify the code.
- The associated-venue associations still fail to distinguish between joint or colocated events, but I'm not sure it's necessary to do that. The main thing is to have the volume show up under the right venue pages (update: I realize now this is not true: joint volumes should appear under every venue page, whereas associated volumes only appear in event pages)
- I didn't require the main venue tag, since it's redundant for newstyle IDs, but it might be simpler if we just did that: exactly one
main-venue
tag and zero or moreassociated-venue
tags.
You can see the results in the XML. Any thoughts before I move forward?
(I should add, I am hoping to wrap this up soon, like within the next week. It keeps getting delayed and then complicated because of new ingestions).
I updated the comment above a bit. I see that we are going to have to distinguish joint from colocated events. I think we would do this with a <joint-venue>
tag, and it might have to be done by hand afterwards. I think that shouldn't be too tricky.
So the proposal would be
- Every volume would have one
<main-venue>
tag (maybe just<venue>
?) - Every volume would have 0 or more
<joint-venue>
tags - Every volume would have 0 or more
<associated-venue>
(maybe<colocated-venue>
?) tags
Any thoughts before I move forward?
Sounds good to me, but I will have to look at the XML a bit before I give a definitive answer.
My first question would be: what is the difference between a main-venue
and a joint-venue
? For example, which one would be main and which one would be joint in https://aclanthology.org/volumes/P09-1/ ?
If there is no difference between the two, maybe we only need 1-N venue
tags and no distinction between main and joint.
@akoehn The main venue is the one we would have assigned if it were a new-style ID. So for example, https://aclanthology.org/volumes/2021.acl-long/ has main-venue=acl
and joint-venue=ijcnlp
.
But since the event is joint, maybe we shouldn't be making this primary / secondary distinction. I guess I am starting to think we should take the simpler route, and just use <venue>
for main and joint venues and <colocated>
for colocated events. This is Marcel's original suggestion.
Not easy to wrap my head around this after such a long pause, but also fun. :sweat_smile:
I agree that joint venues should be "equals"; if someone were to look for "IJCNLP" events, the 2021.acl-* volumes should show up exactly the same way as if someone were looking for "ACL" events. The concept of "joint venues" via joint.yaml
is really more of a crutch because we currently assign exactly one venue implicitly through the ID. If we make the venue association explicit, we can just have multiple <venue>
tags.
Yes, it's hard to swap it back in.
I have finished adding <venue>
tags to all events. This links every volume to its venue. The tag is mandatory in every <meta>
block.
I am thinking now, though that a <colocated>
tag is the wrong approach. Instead of volumes pointing to events, I am going to make an explicit file for every event under data/events/{event_name}.xml
. This event will link to all the volumes in the event.
So instead of volumes pointing to events, events will specify what volumes were presented there. I think this is more direct, even if it means that a volume (in its XML) doesn't know firsthand what events it was presented in.
Sounds good @mjpost, listing the volumes in the events also has the benefit that it can directly encode the ordering in which they are supposed to be shown (or is that not a thing?).
Also want to congratulate you on this commit message: merged split_joint.py
This is my kind of humour :-)
All right, @mbollmann and @akoehn, this has been a lot of churn, and I realize you two are maybe not in the thick of it. At the same time I'm hoping to get a sign-on so that I can merge this before master
drifts too much.
Here are a few outstanding things I plan to fix before merging (you can see them on the preview):
- [x] Currently, on the front page, the slug is being printed, instead of the acronym. I have to figure out how to fix that.
- [x] We need workshop event pages for all years for which there are workshops
- [x] There are a few more missing associations that were tricky to convert from the implicit representation
I expect this merge will not be perfect and that we will have to fix a few things in the coming months, as we become aware of them. But it's a lot easier to do. Here is a summary of the changes:
- Venues are now in individual files under
data/yaml/venues
. They contain the same info. - Every volume now has to list its main venue association, including new-style volumes. There can be one or more associations, allowing a true / top-level notion of joint events (except for the fact that the Anthology ID has to choose one of the venue identifiers).
- Events are now explicit, represented with an
<event>
block at the top of a collection. All volumes with the same collection ID as the event are automatically added. In addition, the event can specify colocated volumes with the<colocated>
tag. - The
Anthology
class now builds anEventIndex
to support this explicit representation. Much of the code is simplified, since both main venues, and events, are now explicit.
-
There is still the venue slug instead of the acronym on
- the volume pages
- the paper pages
- the venue pages (in the heading)
I think we want the acronym everywhere, right?
-
Should events also be listed (in a similar way to the venues) on the paper/volume pages?
-
There's a new function
read_leaves()
inutils.py
which is imported invenues.py
but not actually used there (nor anywhere else). Is this a leftover?
I've had a brief look through the code and didn't notice any major issues otherwise.
Oh, thanks, I missed these. Should all be addressed, assuming I didn't break anything.
FTR: the venue data structures are now all keyed by slug instead of acronym. So it's a simple change.
I think this is ready for review (I'll merge). I've covered all the issues I can see. There may be some stray ones, but I'd like to get this merged and then deal with them as they arise, so we can get ready for EMNLP and AACL.
Build successful. Some useful links:
- Complete site preview: https://preview.aclanthology.org/venue-listing-fix
- Potential volumes of interest: 1952.earlymt-1, 1956.earlymt-1, 1957.earlymt-1, 1960.earlymt-nsmt, 1960.earlymt-fsmtw, 1961.earlymt-1, 1962.earlymt-1, 1963.earlymt-1, 1971.earlymt-1, 1976.earlymt-1, 1977.ws-ws-1977, 1978.tc-1, 1979.ws-ws-1979, 1980.tc-1, 1981.tc-1, 1981.ws-ws-1981, 1982.tc-1, 1983.tc-1, 1983.ws-ws-1983, 1984.bcs-1, 1984.tc-1, 1985.tc-1, 1985.tmi-1, 1985.ws-ws-1985, 1986.tc-1, 1987.mtsummit-1, 1987.tc-1, 1987.ws-ws-1987, 1988.tc-1, 1988.tmi-1, 1989.mtsummit-1, 1989.tc-1, 1989.ws-ws-1989, 1990.tc-1, 1990.ws-ws-1990, 1991.iwpt-1, 1991.mtsummit-papers, 1991.mtsummit-panels, 1991.tc-1, 1991.ws-ws-1991, 1992.tc-1, 1992.tmi-1, 1993.eamt-1, 1993.iwpt-1, 1993.mtsummit-1, 1993.rocling-rocling-1993, 1993.tc-1, 1993.tmi-1, 1993.ws-ws-1993, 1994.amta-1, (plus 2696 more...)
This preview will be removed when the branch is merged.