David Chiang comments

Results 88 comments of


                                            David Chiang

Fixed-caser makes mistakes with mixed-script titles

Like in the fake example above (I can look for a real example later if needed); the title is in Chinese and has some Latin characters mixed in. Since Chinese...

Fixed-caser makes mistakes with mixed-script titles

In CCL 2020, for example, “self-attention” was used.

Extract abstracts from PDF

I'm currently using [Tika](https://tika.apache.org) to extract author names from PDFs. It works very well on modern PDFs, but not so well on the older PDFs (roughly, 2000 and earlier). Unfortunately,...

Extract abstracts from PDF

Do we know how ParsCit compares with GROBID?

Correction: Diacritics missing from author name though present in PDF

I just checked -- indeed, his name in START is just `Jan Hajic`.

Correction: Diacritics missing from author name though present in PDF

Missing diacritics is a widespread problem that we hoped to sidestep by allowing name variants. I suppose one could try to write a scraper to try to detect them. If...

Correction: Diacritics missing from author name though present in PDF

I adapted the `auto_first_names.py` script and am running it on L18 now. It's catching quite a few errors; not just the one @nschneid pointed out, but removing _extra_ accents, decapitalizing...

Correction: Diacritics missing from author name though present in PDF

In L18 (528 papers, wow), the script made 150 changes (also wow) and printed another 100+ warnings that usually indicate a typo or missing word. The automatic changes are easy...

Correction: Diacritics missing from author name though present in PDF

What system does LREC use to fill metadata? Do they use START also? I'm running the script on L16 now (for #341) and seeing some PDF/XML mismatches that are the...

Correction: Diacritics missing from author name though present in PDF

@mjpost what are your thoughts about editing XML to match PDF in these cases where the PDF has _less_ information than the current XML: 1. XML currently has `Matt Post`,...