aclpub2 Anthology build information

(Generally using this as a dumping ground for Anthology needs/wants, will sort out and clean up eventually)

While it's fresh (ACL 2023 ingestion) I want to make a few notes about the build format.

First, my understanding. I haven't used the software, but t seems that ACLPUB2 takes its input files and produces output in a build directory, similar say to cmake. I like this design. The output has the constructed book and watermarked PDFs, among other things. It also seems to create other directories; in addition to build, there is output, output/inputs, and sometimes some other directories, with some redundancy in files.

For ingestion, it would be helpful if you delivered to us a single build directory. It will help to know how ingestion works. We run ingest_aclpub2.py with the following syntax:

ingest_aclpub2.py -i ACLPUB2_DIR

It looks in this directory for the files papers.yml and conference_details.yml, which contain all the necessary metadata. It then reads through the papers in order, assigning Anthology IDs starting with 1 (0 is reserved for frontmatter), and copying and renaming PDFs for upload.

Here is a list of problems that the format delivered to us creates.

The information we need seems actually to be spread around a number of different directories: the root directory, build, output, output/inputs, etc.
These two YAML files are not present in the build directory, but in the root directory, and also in output/inputs/.
The papers.yml that we read should have PDF and attachments links that are relative paths, so that we can read them directly with no hidden or hard-coded assumptions
It seems there is no consistent format for the front matter or for the full proceedings (or are these in build/front_matter.pdf and build/proceedings.df?)

Here is what would be ideal for us:

[x] Create a build directory for the Anthology. This could be the current build/, or a new directory, say anthology/
[ ] The files papers.yml and conference_details.yml should be copied to this directory's root
[x] There should be a watermarked_pdfs directory inside it. I think this already exists. Since it is part of the build, it could also just be called pdfs/ (the idea being that this subdir contains "built" PDFs, i.e., watermarked ones)
[x] Same for attachments: there should be an attachments/ subdirectory.
[x] Ideally front_matter.pdf and proceedings.pdf (if built) will be in the build directory, too, so we can test for their presence
[ ] All paths in the copied papers.yml should be updated to be relative to the build directory, and ideally should be contained within the build directory

Here is an example layout that would work extremely well for us

aclpub2/
    build/
        front_matter.pdf
        proceedings.pdf
        papers.yml   # updated with relative paths for attachments and PDFs
        conference_details.yml  # this could probably be unchanged
        pdfs/
            1.pdf
            ...
        attachments/
             49_software.zip
             17_dataset.tgz
             ...

Note that if PDFs were missing, we would still ingest the metadata. So this would allow us, for example, to introduce a three-stage ingestion:

(T minus two months) Camera-readies: deliver a file, we create a live Anthology page with all papers assigned metadata, but no PDFs available yet
(T minus two weeks) PDFs and attachments: these are delivered two weeks before the conference and made available
(T minus one day) Front matter and handbook: these touch the real-world and are often delayed by last-minute adjustments, so we consume them at the last minute.

This would just be for *ACL main conferences; workshops would have to stick to a single-stage ingest.

Jul 08 '23 19:07 mjpost

I just noticed the format on the main page. It seems that in fact many of these items are already checked off. The main missing components are (a) to copy over the two yml files to the build directory and (b) to make linked paths relative.

It might be nice to add a Makefile so people could just type make build or make anthology, that would create the build directory and also a tarball to share possibly.

Jul 13 '23 15:07 mjpost

Another issue: we need the copyright forms. Ideally these would be placed in a parallel folder, copyright/.

Jul 17 '23 14:07 mjpost

There's also a point of confusion: many workshops organizers seem to think that we do the building. We (note to self) need a validation script that tells them whether their tarball is up-to-spec, and if not, what the likely reasons are.

I really do like the build format and the use of Github repos for each workshop! It makes it quite easy to find everything.

Jul 21 '23 13:07 mjpost

It seems many people are confused about who's going to do the building (them or us). Having a Makefile target (make anthology) would help clarify this

Jul 21 '23 18:07 mjpost

Many workshops use "main" for the volume_name, should be 1 by convention

Jul 21 '23 18:07 mjpost

We need to add examples where people can list the SIG so it gets automatically back-linked.

Jul 26 '23 13:07 mjpost

We should build automatically using Github actions, from a raw format (will add watermarks)

Jul 26 '23 13:07 mjpost

aclpub2 aclpub2 copied to clipboard

Anthology build information

aclpub2
aclpub2 copied to clipboard