CoreNLP icon indicating copy to clipboard operation
CoreNLP copied to clipboard

parsing '`'

Open AntonOfTheWoods opened this issue 2 years ago • 5 comments

curl 'http://localhost:9000/?properties={%22annotators%22%3A%22lemma%22%2C%22outputFormat%22%3A%22json%22}' -d '`'

Gives me:

{
  "sentences": [
    {
      "index": 0,
      "tokens": [
        {
          "index": 1,
          "word": "`",
          "originalText": "`",
          "lemma": "`",
          "characterOffsetBegin": 0,
          "characterOffsetEnd": 1,
          "pos": "``",
          "before": "",
          "after": ""
        }
      ]
    }
  ]
}

With the standard English model. Is this expected? I'm particularly surprised at the POS.

AntonOfTheWoods avatar Nov 04 '22 05:11 AntonOfTheWoods

Using 4.5.1

AntonOfTheWoods avatar Nov 04 '22 05:11 AntonOfTheWoods

Ok, to be honest I wasn't entirely clear if this was a question about the server interface or a question specifically about the model results, so I just ignored it for a couple days. Looking back at it, I realize your main question is about the POS tag, but I think this is expected behavior. The PTB dataset turns the curvy open quotes “ into with the tag, and ` doesn't show up anywhere in the dataset but kinda looks like , so there it is. EWT goes one step further and even has some single backticks tagged as So the end result is the tags we return are exactly the tags the data teaches the model to use.

On Thu, Nov 3, 2022 at 10:46 PM Anton Melser @.***> wrote:

Using 4.5.1

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1315#issuecomment-1303004611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOCJ26RFATG7WH4KULWGSPMBANCNFSM6AAAAAARW2VKSY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AngledLuffa avatar Nov 08 '22 06:11 AngledLuffa

Thanks for that. I'll confess I don't fully understand how the tags are set (are they model-specific? So, eg, the Chinese PKU and CTB might be different?).

I ended up doing a search in the code and found a list in what looked like a template file. It had what was supposed to be the TB tags (what I had found somewhere plus the punctuation tags) and the "web extension", if I understood correctly. There were some other tags mentioned from another set but I don't know if they are used anywhere. I guess if they are model specific it doesn't even make sense to put a list of tags on the site?

AntonOfTheWoods avatar Nov 08 '22 07:11 AntonOfTheWoods

Honestly, we've had documentation of the various tagsets in the past, but I'm not sure we still have that available after updating some of the models to use the UD tagset. I will leave this issue open as a reminder to update that documentation, if you don't mind

AngledLuffa avatar Nov 08 '22 08:11 AngledLuffa

Agree that this performs according to spec, given the historical usage of the English Penn Treebank, etc. where ` is used for open quotes, and all open quotes including `, ``, “, etc. get POS tag ``.

manning avatar Nov 22 '22 18:11 manning