acl-anthology
acl-anthology copied to clipboard
Extract abstracts from PDF
The anthology currently only shows the abstracts if there is an authoritative version in the XML. It would be nice if we could scrape the PDF using some off-the-shelf software to extract the abstracts and dump them into a different file (to not tamper with handcrafted information). Having an abstract on the web pages makes quickly searching through literature much faster.
I'm currently using Tika to extract author names from PDFs. It works very well on modern PDFs, but not so well on the older PDFs (roughly, 2000 and earlier). Unfortunately, it's also the older PDFs that lack abstracts.
Here's a file with automatically extracted abstracts.
I thought I'd try extracting abstracts from the ACL Anthology Reference Corpus. Concretely, I used the March 2016 version of the ParsCit XML and:
- Looked for a
<sectionHeader genericSection="abstract">
- Extracted all
<bodyText>
that comes after it (until the next sectionHeader) - Removed end-of-line hyphenation (e.g.
partici- pation
–>participation
) when the merged token is a recognized English word
Some stats:
- 86 files were skipped due to already having an abstract in the Anthology XML
- 1740 files were skipped due to having no parsed "abstract" section
- 929 files were skipped due to having "abstracts" longer than 500 words, which I took to be indicative of a parsing error
- 65 files were skipped due to other errors I haven't looked into yet
- 18,627 files had an abstract successfully extracted
This process works well for many files, but also produces silly results in some cases. The most common problem appears to be the parser not correctly identifying the "Abstract" section. Still, we could consider using this as a starting point maybe?
Do we know how ParsCit compares with GROBID?
I don't; maybe @knmnyn knows?
I briefly tried Tika on a couple of cases that my extraction process got wrong, and it handled them better. Maybe we could combine pipelines – run Tika on all our PDFs, and if it matches what we get from the ACL ARC, treat it as trustworthy enough to add it; manually check only the remaining cases.
@mbollmann @davidweichiang : GROBID is much more functional than my group's legacy tool, being able to ingest PDFs natively. My group is trying to catch up and build a Tika pipeline for feeding in data to a NN (word+char embedder+BiLSTM+CRF) pipeline for extraction. Any hints would be welcomed, and my group could definitely add in some effort towards this task! Definitely of mutual interest!
Hello Everyone. I am Abhinav Ramesh Kashyap. I am a PhD student at NUS with Prof Min. Like he mentioned in the previous post, we have been working on reliable pipelines for scientific document processing and we have a framework called SciWING. You can checkout SciWING at sciwing.io
I have developed a solution to extract abstracts from PDF. It reads a pdf using PdfBox and classifies the lines of the document (a Glove + Elmo + Bilstm network for now). I have attached a screenshot here. The other screenshots are available https://github.com/abhinavkashyap/sciwing/tree/master/screenshots/acl_anthology_abstracts.
Please let us know how we can further our efforts to help ACL Anthology. I am trying to understand the specifics of this issue
- Are you looking to extract the abstract for all the papers in ACL Anthology ?
- Or is there a pipeline which finds out that there is no meta information in the XML and only those specific pdfs need to be parsed
- Are there specific format or ways in which the team expects the abstract. If you could point me to resources I can refer in order to contribute to ACL Anthology, please let me know
SAMPLE OUTPUT
ACL ANTHOLOGY PAPER: https://www.aclweb.org/anthology/W19-4505/
ABSTRACT
In this work we propose to leverage resources available with discourse-level annotations to facilitate the identification of argumentative components and relations in scientific texts, which has been recognized as a particularly challenging task. In particular, we implement and evaluate a transfer learning approach in which contextualized representations learned from discourse parsing tasks are used as input of argument mining models. As a pilot application, we explore the feasibility of using automatically identified argumentative components and relations to predict the acceptance of papers in computer science venues. In order to conduct our experiments, we propose an annotation scheme for argumentative units and relations and use it to enrich an existing corpus with an argumentation layer.1
Great! We need plaintext abstracts for the papers that have no abstract in the XML file.
For example, I could provide you with a list of pdf urls that need an abstract and you could provide a plaintext abstract for each of these.
Okay sure. Please send me the files. I will run it through the system and provide the plain text abstracts 👍
To re-create this file, use this command in the data/xml
directory:
xmlstarlet sel -t -m '//paper[not(abstract)]' -v $'concat(url, "\n")' *xml | sed '/http/! s|\(.*\)|http://www.aclweb.org/anthology/\1.pdf|' > no-abstract.txt
These are about 40k files, so you may not want to run all of them at once ... no-abstract.txt
Thank you for this! Will try this and get back to you soon.
Update
Hi @akoehn I have just run a hundred pdfs through our system and I am attaching the abstracts here. Over all, the abstracts look okay. However, here are some problems that I encountered with these pdfs
-
With Tika and pdfbox, extra spaces are introduced between words which will cause problems for tokenisation. For eg,
abstract
appears asa b s t r a c t
. This is a known issue and I follow some heuristics for now to clean out the spaces in lines and seems to work okay. I have noticed that the older pdfs have this problem. -
Some of the pdfs have different languages like Russian and arabic. The system fails to identify these.
-
Sometimes there is no abstract in the pdf. For eg. A00-2010.pdf.
I am attaching the extracted abstracts and a log file here. Please let me know if this is satisfactory and I will run through the other pdfs.
In the mean time, I can discuss with you and @knmnyn to annotate more data and make the system more robust.
First of all, thanks for offering your help @abhinavkashyap! That SciWING pipeline looks really cool.
I've clicked through a few of the abstracts and observed that several of those looked better in the file I generated from ACL-ARC, e.g. A00-1002, A00-1008, A00-1013, ...
That said, the files that you picked are also among the more challenging ones I'd think. It would be interesting to look at some results for P16-* papers, for example, which should be easier to extract text from (since they're almost all LaTeX-generated) and are also missing abstracts in the Anthology.
Thanks for this @mbollmann. Thanks for the suggestion to run it to run it on the P16-* series of papers. I will give that a try and let you guys know.
@abhinavkashyap, do you just run the PDFs through SciWing to get the abstracts or is there more pre-/post-processing involved? I'm asking because I have a simple pipeline now of manually written heuristics to detect the abstract in Tika output (which I started working on before you offered your help), and am wondering if there's potential to pool our resources to get the best result possible.
Hi @mbollmann. I just read the pdf and run SciWING on it. I check for a section header
named abstract and continue to collect all the lines until I find another section header
. There is not much post-processing done as well. I remove if there is any hyphenation at the end of every line - in no intelligent way. It will be good if we can pool our methods if it helps to get the abstracts for all the 40k pdfs.
I also saw your approach of using the ACL ARC which are labelled by neural parscit. I think, I will use the ACL ARC data to further train SciWING. Right now SciWING is trained on very small amount of hand-annotated data. The ACL ARC dataset can serve as pseudo labels to improve the performance
@mbollmann @akoehn . I ran the system through all P-16* papers as suggested by @mbollmann . I am attaching the extracted abstracts here. It took around 10 seconds on the GPU to extract one abstract and around 1 hour to extract everything. All the abstracts look pretty okay. Do let me know your thoughts. Thanks
https://drive.google.com/file/d/17lL_sh0ylj0yLr39DXUWtzHZgYe3_My3/view?usp=sharing
Thanks, Abhinav. Did you see any problems with the extraction?
- M
On Sat, 2 May 2020 at 21:41, Abhinav Ramesh Kashyap < [email protected]> wrote:
@mbollmann https://github.com/mbollmann @akoehn https://github.com/akoehn . I ran the system for P-16* papers as suggested by @mbollmann https://github.com/mbollmann . I am attaching the extracted abstracts here. It took around 10 seconds on the GPU to extract one GPU and around 1 hour to extract all the P-16* abstracts. Do let me know your thoughts. Thanks Uploading p16_abstracts.zip… http://Abstracts
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/395#issuecomment-622955581, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABU72ZGMBHUAD5OJHF2WKTRPQPIZANCNFSM4HVAHRLA .
--
- M
Hello Prof Min. I didn't encounter any computational problems with the P16-* papers. The extracted abstracts look okay for now. The problem is with the older PDFs. The machine learning model is not robust to noise in the data.
@mbollmann, @akoehn : Any input on the P16 abstracts? I think after talking with @abhinavkashyap now, we think that you'd be able tell us whether there's a good way to combine SciWING output and the Tika output to get the most appropriate output.
Where @abhinavkashyap needs most help is to create clean(er) plain text from the PDF. Those quirks about A B S T R A C T and other hyphenation or mis-recognition may be the bottleneck for most of the errors. We believe that the abstract extraction itself is solvable.
I can look at the abstracts sometime later this week, and will also compare them to what my simple pipeline produces.
For dealing with hyphenation, I currently have a simply heuristic based on the wordfreq
package: hyphenation is removed iff the un-hyphenated word exists (has frequency > 0) and has a higher frequency than the components on their own. It seems to work well, but I'll have to take a closer look.
For getting cleaner text from the PDFs, I don't have a ready-made solution. Have you looked at why the A00 abstracts appear to be worse than what I got from ParsCit? Would it make sense to somehow utilize the ParsCit versions (from ACL ARC) as a second signal?
Other than that, OCR post-correction is a thing, right? Surely NLP must have produced a tool somewhere that can help with this... :-) If no-one has any concrete pointers, I can also do some research here later this week or the next.
Just some background: for older PDFs in the Anthology (especially ones that were scanned in as rasters), I ran Adobe Acrobat on the original PDF sources to insert a machine readable layer. I'm pretty sure I replaced the original documents with this enhanced ones, so the current Anthology and the ACL ARC should have exactly the same PDF files.
But the text extraction in the ACL ARC is better, that is because we used a commercial OCR system (Nuance's Omnipage, then version 15) to extract the text. It was brutal because it needed to be run on a Windows Server pipeline that crashed unpredictably.
@abhinavkashyap if you want to use the text from the ACL ARC, you can just take it directly from the directory structure there. We did try to organize it well so it should be pretty transparent. The canonical v2 of the ACL ARC still sits on the VM at acl-arc.comp.nus.edu.sg .
FWIW, I don't think attaching the P16 abstracts to your issue worked, @abhinavkashyap
Yes, I see that that didn't work. Funny I tried the link a few days back and seem to recall that it was working. @abhinavkashyap perhaps can you try again or put a link to an open GDrive file?
Re the hyphenation, I browsed through P16 and my approach currently fails for some words that wordfreq
apparently doesn't know about:
annota-tors
corefer-ence
geospa-tional
la-belers
reg-ularizer
rerank-ing
sum-marizer
system-aticity
It also fails with proper nouns:
MaltOpti-mizer
Morfes-sor
But in general, the cases where the hyphenation is correctly not removed (phrase-based, context-aware, low-resource, word-level, ...
) are the vast majority.
@mbollmann @knmnyn Here is the link for the abstracts https://drive.google.com/file/d/17lL_sh0ylj0yLr39DXUWtzHZgYe3_My3/view?usp=sharing
Thanks @abhinavkashyap! They generally look very good to me. I compared them with my own Tika pipeline, and they're mostly identical, and also appear to have the same problems; e.g., the footnotes in P16-1036 are interpreted as being part of the abstract.
My thoughts on how to proceed: I've now extracted 38k+ abstracts from a combination of ACL ARC and my own Tika pipeline. I think it would make sense to compile a list of volumes where this approach produces many bad results, and then focus our efforts on those and see if we can improve them with SciWING. However, this means I have 330 XML files to skim through, so it might take me a bit :)
Thanks @mbollmann . Please let me know when there is an update on the situation 👍 :)
Hi @mbollmann and @davidweichiang . Do we have any updates for this ?