David Chiang
David Chiang
Like in the fake example above (I can look for a real example later if needed); the title is in Chinese and has some Latin characters mixed in. Since Chinese...
In CCL 2020, for example, “self-attention” was used.
I'm currently using [Tika](https://tika.apache.org) to extract author names from PDFs. It works very well on modern PDFs, but not so well on the older PDFs (roughly, 2000 and earlier). Unfortunately,...
Do we know how ParsCit compares with GROBID?
I just checked -- indeed, his name in START is just `Jan Hajic`.
Missing diacritics is a widespread problem that we hoped to sidestep by allowing name variants. I suppose one could try to write a scraper to try to detect them. If...
I adapted the `auto_first_names.py` script and am running it on L18 now. It's catching quite a few errors; not just the one @nschneid pointed out, but removing _extra_ accents, decapitalizing...
In L18 (528 papers, wow), the script made 150 changes (also wow) and printed another 100+ warnings that usually indicate a typo or missing word. The automatic changes are easy...
What system does LREC use to fill metadata? Do they use START also? I'm running the script on L16 now (for #341) and seeing some PDF/XML mismatches that are the...
@mjpost what are your thoughts about editing XML to match PDF in these cases where the PDF has _less_ information than the current XML: 1. XML currently has `Matt Post`,...