CommonsenseERL_EMNLP_2019
CommonsenseERL_EMNLP_2019 copied to clipboard
MaltParser caused nyt event extraction failure
As mentioned in https://github.com/knowitall/ollie/issues/9, the Open IE tool internally used MaltParser for parsing, which can not handle unicode correclty.
So the extraction script in https://github.com/MagiaSN/CommonsenseERL_EMNLP_2019/blob/master/preproc/OpenExtract.scala should add the following code to handle coding exception.
import scala.io.Codec
import java.nio.charset.CodingErrorAction
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
Thanks for your investigation, but I don't seem to encounter this issue while processing the NYT dataset.
Anyway, this enhancement looks good to me, so feel free to open a pull request if it solves your problem😄
Thanks for the reply. I would pull a requet later. By the way, I have the following question concerning the preprocess of atomic dataset. Would appreciate your clarification!
- How do you generate embeddings for PersonX and PersonY?
- There are many placeholders in atomic events like "PersonX | reaches | another ___ ". How do you handle the such case?
- We use the average of 200 common English names as initial embedding of PersonX and PersonY. You can download Glove with these extra embeddings from links in README:
The pretrained word embedding can be downloaded from google drive or baidu netdisk. We add embeddings for word "PersonX" and "PersonY" to the original Glove word embedding.
- We just ignore placeholders and they will use the UNK embedding.