preprocess-conll05
preprocess-conll05 copied to clipboard
Problems building Brown test set
The preprocessing steps seem to work for all of the WSJ data, but I'm running into some issues with the Brown test set. It might be a version issue with my Penn Treebank data and/or stanford parser, but I'm curious if anyone else has had the same issue. A specific example is available starting on line 743 of the file,
LDC99T42/treebank_3/parsed/mrg/brown/ck/ck02.mrg
( (SQ
(NP-SBJ (-NONE- *) )
(VP (VB Remember)
(SBAR
(WHNP-1 (WP what) )
(S
(NP-SBJ (PRP I) )
(VP (VBD said)
(NP (-NONE- *T*-1) )
(PP (IN about)
(S-NOM
(NP-SBJ (-NONE- *) )
(VP (VBG going)
(ADVP-DIR (RP out) )
(S-PRP
(NP-SBJ (-NONE- *) )
(VP (TO to)
(VP (VB get)
(NP
(NP (NN anybody) )
(VP (VBN left)
(ADVP (IN behind) ))))))))))))))
(. ?) (. ?) )
( (S
(NP-SBJ (DT That) )
(ADVP-TMP (RB still) )
(VP (VBZ holds) )
(. .) ))
( (S
(NP-SBJ (PRP We) )
(VP (VBP bring)
(ADVP-DIR (RB back) )
(NP
(NP (DT all) )
(ADJP (JJ dead)
(CC and)
(VBN wounded) )))
('' '') (. .) ))
Note that in the first sentence (which ends in double question marks) the outermost parentheses enclose the entire sentence.
The syntax parse here (I replaced the awk statement with awk '!/^\*x\*/ {print}'
),
- https://github.com/strubell/preprocess-conll05/blob/master/bin/basic/extract_test_from_brown.sh
produces
$CONLL05/test.brown/synt/test.brown.synt.gz
VB (SQ(VP*
WP (SBAR(WHNP-1*)
PRP (S(NP-SBJ*)
VBD (VP*
IN (PP*
VBG (S-NOM(VP*
RP (ADVP-DIR*)
TO (S-PRP(VP*
VB (VP*
NN (NP(NP*)
VBN (VP*
IN (ADVP*))))))))))))))
. *
. *
DT (S(NP-SBJ*)
RB (ADVP-TMP*)
VBZ (VP*)
. *)
PRP (S(NP-SBJ*)
VBP (VP*
RB (ADVP-DIR*)
DT (NP(NP*)
JJ (ADJP*
CC *
VBN *)))
'' *
. *)
Note that the elements representing the question marks are no longer contained within the parentheses. Next we run,
- https://github.com/strubell/preprocess-conll05/blob/master/bin/basic/make-brown-test.sh
$CONLL05/test.brown.gz
Remember VB (SQ(VP* * - remember (V*) * * * *
what WP (SBAR(WHNP-1*) * - - (A1* (R-A1*) * * *
I PRP (S(NP-SBJ*) * - - * (A0*) * * *
said VBD (VP* * - say * (V*) * * *
about IN (PP* * - - * (A3* * * *
going VBG (S-NOM(VP* * - go * * (V*) * *
out RP (ADVP-DIR*) * - - * * (AM-DIR*) * *
to TO (S-PRP(VP* * - - * * (AM-PNC* * *
get VB (VP* * - get * * * (V*) *
anybody NN (NP(NP*) * - - * * * (A1* (A0*)
left VBN (VP* * - leave * * * * (V*)
behind IN (ADVP*)))))))))))))) * - - *) *) *) *) (AM-ADV*)
? . * * - - * * * * *
? . * * - - * * * * *
That DT (S(NP-SBJ*) * - - (A1*)
still RB (ADVP-TMP*) * - - (AM-TMP*)
holds VBZ (VP*) * - hold (V*)
. . *) * - - *
We PRP (S(NP-SBJ*) * - - (A0*)
bring VBP (VP* * - bring (V*)
back RB (ADVP-DIR*) * - - (AM-DIR*)
all DT (NP(NP*) * - - (A1*
dead JJ (ADJP* * - - *
and CC * * - - *
wounded VBN *))) * - - *)
'' '' * * - - *
. . *) * - - *
When we continue and run the,
- https://github.com/strubell/preprocess-conll05/blob/master/bin/preprocess_conll05_sdeps.sh
script with $CONLL05/test.brown.gz
as input we get a series of outputs like this,
$CONLL05/test.brown.gz.parse
(from applying awk and sed commands to the input file $CONLL05/test.brown.gz
)
(SQ(VP(VB Remember)
(SBAR(WHNP-1(WP what))
(S(NP-SBJ(PRP I))
(VP(VBD said)
(PP(IN about)
(S-NOM(VP(VBG going)
(ADVP-DIR(RP out))
(S-PRP(VP(TO to)
(VP(VB get)
(NP(NP(NN anybody))
(VP(VBN left)
(ADVP(IN behind)))))))))))))))
(. ?)
(. ?)
(S(NP-SBJ(DT That))
(ADVP-TMP(RB still))
(VP(VBZ holds))
(. .))
(S(NP-SBJ(PRP We))
(VP(VBP bring)
(ADVP-DIR(RB back))
(NP(NP(DT all))
(ADJP(JJ dead)
(CC and)
(VBN wounded))))
('' '')
(. .))
$CONLL05/test.brown.gz.parse.sdeps
(from applying the Standford parser to $CONLL05/test.brown.gz.parse
)
1 Remember _ VERB VB _ 0 root _ _
2 what _ PRON WP _ 4 dobj _ _
3 I _ PRON PRP _ 4 nsubj _ _
4 said _ VERB VBD _ 1 ccomp _ _
5 about _ SCONJ IN _ 4 prep _ _
6 going _ VERB VBG _ 5 pcomp _ _
7 out _ ADP RP _ 6 advmod _ _
8 to _ PART TO _ 9 aux _ _
9 get _ VERB VB _ 6 xcomp _ _
10 anybody _ PRON NN _ 9 dobj _ _
11 left _ VERB VBN _ 10 vmod _ _
12 behind _ ADP IN _ 11 advmod _ _
1 ? _ PUNCT . _ 0 root _ _
1 ? _ PUNCT . _ 0 root _ _
1 That _ PRON DT _ 3 nsubj _ _
2 still _ ADV RB _ 3 advmod _ _
3 holds _ VERB VBZ _ 0 root _ _
4 . _ PUNCT . _ 3 punct _ _
1 We _ PRON PRP _ 2 nsubj _ _
2 bring _ VERB VBP _ 0 root _ _
3 back _ ADV RB _ 2 advmod _ _
4 all _ DET DT _ 2 dobj _ _
5 dead _ ADJ JJ _ 4 amod _ _
6 and _ CONJ CC _ 5 cc _ _
7 wounded _ VERB VBN _ 5 conj _ _
8 '' _ PUNCT '' _ 2 punct _ _
9 . _ PUNCT . _ 2 punct _ _
Note that the question marks have been put on their own lines here.
$CONLL05/test.brown.gz.parse.sdeps.posonly
(from applying awk to $CONLL05/test.brown.gz.parse.sdeps
)
Remember what I said about going out to get anybody left behind
?
?
That still holds .
We bring back all dead and wounded '' .
$CONLL05/test.brown.gz.parse.sdeps.pos
(from applying edu.stanford.nlp.tagger.maxent.MaxentTagger
to $CONLL05/test.brown.gz.parse.sdeps.posonly
)
Remember VB
what WP
I PRP
said VBD
about IN
going VBG
out RP
to TO
get VB
anybody NN
left VBD
behind IN
? .
? .
That DT
still RB
holds VBZ
. .
We PRP
bring VBP
back RP
all DT
dead JJ
and CC
wounded VBN
'' ''
. .
$CONLL05/test.brown.gz.parse.sdeps.combined
from applying the paste
command to
-
f_converted
=$CONLL05/test.brown.gz.parse.sdeps
-
f_pos
=$CONLL05/test.brown.gz.parse.sdeps.pos
conll05 200 0 Remember VB VB 0 root _ - remember - - * (V*) * * * *
conll05 200 1 what WP WP 4 dobj _ - - - - * (A1* (R-A1*) * * *
conll05 200 2 I PRP PRP 4 nsubj _ - - - - * * (A0*) * * *
conll05 200 3 said VBD VBD 1 ccomp _ - say - - * * (V*) * * *
conll05 200 4 about IN IN 4 prep _ - - - - * * (A3* * * *
conll05 200 5 going VBG VBG 5 pcomp _ - go - - * * * (V*) * *
conll05 200 6 out RP RP 6 advmod _ - - - - * * * (AM-DIR*) * *
conll05 200 7 to TO TO 9 aux _ - - - - * * * (AM-PNC* * *
conll05 200 8 get VB VB 6 xcomp _ - get - - * * * * (V*) *
conll05 200 9 anybody NN NN 9 dobj _ - - - - * * * * (A1* (A0*)
conll05 200 10 left VBN VBD 10 vmod _ - leave - - * * * * * (V*)
conll05 200 11 behind IN IN 11 advmod _ - - - - * *) *) *) *) (AM-ADV*)
conll05 200 12 ? - - - - * * * * * *
conll05 200 13 ? . . 0 root _ - - - - * * * * * *
conll05 201 0 That . . 0 root _ - - - - * (A1*)
conll05 201 1 still - - - - * (AM-TMP*)
conll05 201 2 holds DT DT 3 nsubj _ - hold - - * (V*)
conll05 201 3 . RB RB 3 advmod _ - - - - * *
VBZ VBZ 0 root _
conll05 202 0 We . . 3 punct _ - - - - * (A0*)
conll05 202 1 bring - bring - - * (V*)
conll05 202 2 back PRP PRP 2 nsubj _ - - - - * (AM-DIR*)
conll05 202 3 all VBP VBP 0 root _ - - - - * (A1*
conll05 202 4 dead RB RP 2 advmod _ - - - - * *
conll05 202 5 and DT DT 2 dobj _ - - - - * *
conll05 202 6 wounded JJ JJ 4 amod _ - - - - * *)
conll05 202 7 '' CC CC 5 cc _ - - - - * *
conll05 202 8 . VBN VBN 5 conj _ - - - - * *
'' '' 2 punct _
Well that was a long read! We see the problem here with the two lines containing,
VBZ VBZ 0 root _
and
'' '' 2 punct _
This pattern continues to cause problems further down the file. Wondering if anyone else ran into this problem and found a solution?
@strubell ?
soooooooooo, I think I figured out my mistake. turns out, the syntax generated by lines like this,
cat $BROWN/$section/$section$subsection.mrg \ | awk '!/^\*x\*/ {print}' \ | $SRLCONLL/bin/wsj-removetraces.pl \ | $SRLCONLL/bin/wsj-to-se.pl -w 0 \ >> $STRUBELL18/$DATA_SEGMENT/synt/$DATA_SEGMENT.synt
are what cause the problems initially. if instead one uses a syntax parse provided in the CONLL05 data set (synt.cha for example) the extra lines are not introduced. hopefully someone out there finds this useful!
Hello,
I have the same problem on Brown test set. Did you resolve it?
Thanks