CoreNLP
CoreNLP copied to clipboard
parsing '`'
curl 'http://localhost:9000/?properties={%22annotators%22%3A%22lemma%22%2C%22outputFormat%22%3A%22json%22}' -d '`'
Gives me:
{
"sentences": [
{
"index": 0,
"tokens": [
{
"index": 1,
"word": "`",
"originalText": "`",
"lemma": "`",
"characterOffsetBegin": 0,
"characterOffsetEnd": 1,
"pos": "``",
"before": "",
"after": ""
}
]
}
]
}
With the standard English model. Is this expected? I'm particularly surprised at the POS.
Using 4.5.1
Ok, to be honest I wasn't entirely clear if this was a question about the
server interface or a question specifically about the model results, so I
just ignored it for a couple days. Looking back at it, I realize your main
question is about the POS tag, but I think this is expected behavior. The
PTB dataset turns the curvy open quotes “ into with the tag
, and `
doesn't show up anywhere in the dataset but kinda looks like , so there it is. EWT goes one step further and even has some single backticks tagged as
So the end result is the tags we return are exactly the tags the
data teaches the model to use.
On Thu, Nov 3, 2022 at 10:46 PM Anton Melser @.***> wrote:
Using 4.5.1
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1315#issuecomment-1303004611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOCJ26RFATG7WH4KULWGSPMBANCNFSM6AAAAAARW2VKSY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks for that. I'll confess I don't fully understand how the tags are set (are they model-specific? So, eg, the Chinese PKU and CTB might be different?).
I ended up doing a search in the code and found a list in what looked like a template file. It had what was supposed to be the TB tags (what I had found somewhere plus the punctuation tags) and the "web extension", if I understood correctly. There were some other tags mentioned from another set but I don't know if they are used anywhere. I guess if they are model specific it doesn't even make sense to put a list of tags on the site?
Honestly, we've had documentation of the various tagsets in the past, but I'm not sure we still have that available after updating some of the models to use the UD tagset. I will leave this issue open as a reminder to update that documentation, if you don't mind
Agree that this performs according to spec, given the historical usage of the English Penn Treebank, etc. where ` is used for open quotes, and all open quotes including `, ``, “, etc. get POS tag ``.