David Chiang
David Chiang
There's already some code to match against existing author names. It could be updated and improved, and that might address this problem partly. I've suggested in the past that we...
Would it be too specific to apply your heuristic only to names that are written in Pinyin, which is very easy to check?
We do have a script that scrapes from PDF. It is not run regularly, though. And sometimes authors use all caps in the PDF too. The Pinyin filter would be...
I agree that ideally this should happen earlier than ingestion into the Anthology, because names from START also appear in the conference website, handbook, etc.
I tried a simpler version of these heuristics on the EMNLP 2018 authors, and it worked perfectly except for one possible false positive (the first name "cmcc"). The heuristic is:...
FWIW, START does have a tool in the pub chair console for correcting case problems in both titles and authors. I don't know whether it is regularly used. It also...
Running this heuristic on the current Anthology authors yields [872 corrections](https://gist.github.com/davidweichiang/39767f2894709e5175190f72e9b0fad0). There are some false positives, though. Some seem fixable (MAXWELL III -> Maxwell Iii) but some seem tougher, especially...
It would be a tedious process each time. I am hoping that START will incorporate something like this so we don't have to deal with it. But otherwise, it would...
I'm still trying to understand the issue, but what I think is that regardless of order, the author's id should be `hongying-yan`, their canonical name should be `Hongying Zan`, and...
If `` is another name-part alongside `` and ``, I think I agree with @mjpost that by default these should be considered different names: ``` HongyingZan HongyingZan红英昝 ``` It's exactly...