CommonsenseERL_EMNLP_2019 icon indicating copy to clipboard operation
CommonsenseERL_EMNLP_2019 copied to clipboard

MaltParser caused nyt event extraction failure

Open ccclyu opened this issue 4 years ago • 3 comments

As mentioned in https://github.com/knowitall/ollie/issues/9, the Open IE tool internally used MaltParser for parsing, which can not handle unicode correclty.

So the extraction script in https://github.com/MagiaSN/CommonsenseERL_EMNLP_2019/blob/master/preproc/OpenExtract.scala should add the following code to handle coding exception.

import scala.io.Codec
import java.nio.charset.CodingErrorAction

implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)

ccclyu avatar Jun 03 '21 12:06 ccclyu

Thanks for your investigation, but I don't seem to encounter this issue while processing the NYT dataset.

Anyway, this enhancement looks good to me, so feel free to open a pull request if it solves your problem😄

MagiaSN avatar Jun 03 '21 13:06 MagiaSN

Thanks for the reply. I would pull a requet later. By the way, I have the following question concerning the preprocess of atomic dataset. Would appreciate your clarification!

  1. How do you generate embeddings for PersonX and PersonY?
  2. There are many placeholders in atomic events like "PersonX | reaches | another ___ ". How do you handle the such case?

ccclyu avatar Jun 28 '21 03:06 ccclyu

  1. We use the average of 200 common English names as initial embedding of PersonX and PersonY. You can download Glove with these extra embeddings from links in README:

The pretrained word embedding can be downloaded from google drive or baidu netdisk. We add embeddings for word "PersonX" and "PersonY" to the original Glove word embedding.

  1. We just ignore placeholders and they will use the UNK embedding.

MagiaSN avatar Jun 28 '21 11:06 MagiaSN