preprocess-conll05 icon indicating copy to clipboard operation
preprocess-conll05 copied to clipboard

Problems building Brown test set

Open galtay opened this issue 5 years ago • 2 comments

The preprocessing steps seem to work for all of the WSJ data, but I'm running into some issues with the Brown test set. It might be a version issue with my Penn Treebank data and/or stanford parser, but I'm curious if anyone else has had the same issue. A specific example is available starting on line 743 of the file,

LDC99T42/treebank_3/parsed/mrg/brown/ck/ck02.mrg

( (SQ                                                                                                                                                                                                                                           
    (NP-SBJ (-NONE- *) )                                                                                                                                                                                                                        
    (VP (VB Remember)                                                                                                                                                                                                                           
      (SBAR                                                                                                                                                                                                                                     
        (WHNP-1 (WP what) )                                                                                                                                                                                                                     
        (S                                                                                                                                                                                                                                      
          (NP-SBJ (PRP I) )                                                                                                                                                                                                                     
          (VP (VBD said)                                                                                                                                                                                                                        
            (NP (-NONE- *T*-1) )                                                                                                                                                                                                                
            (PP (IN about)                                                                                                                                                                                                                      
              (S-NOM                                                                                                                                                                                                                            
                (NP-SBJ (-NONE- *) )                                                                                                                                                                                                            
                (VP (VBG going)                                                                                                                                                                                                                 
                  (ADVP-DIR (RP out) )                                                                                                                                                                                                          
                  (S-PRP                                                                                                                                                                                                                        
                    (NP-SBJ (-NONE- *) )                                                                                                                                                                                                        
                    (VP (TO to)                                                                                                                                                                                                                 
                      (VP (VB get)                                                                                                                                                                                                              
                        (NP                                                                                                                                                                                                                     
                          (NP (NN anybody) )                                                                                                                                                                                                    
                          (VP (VBN left)                                                                                                                                                                                                        
                            (ADVP (IN behind) ))))))))))))))                                                                                                                                                                                    
  (. ?) (. ?) )
( (S                                                                                                                                                                         
    (NP-SBJ (DT That) )                                                                                                                                                      
    (ADVP-TMP (RB still) )                                                                                                                                                   
    (VP (VBZ holds) )                                                                                                                                                        
    (. .) ))                                                                                                                                                                 
( (S                                                                                                                                                                         
    (NP-SBJ (PRP We) )                                                                                                                                                       
    (VP (VBP bring)                                                                                                                                                          
      (ADVP-DIR (RB back) )                                                                                                                                                  
      (NP                                                                                                                                                                    
        (NP (DT all) )                                                                                                                                                       
        (ADJP (JJ dead)                                                                                                                                                      
          (CC and)                                                                                                                                                           
          (VBN wounded) )))                                                                                                                                                  
    ('' '') (. .) ))

Note that in the first sentence (which ends in double question marks) the outermost parentheses enclose the entire sentence.

The syntax parse here (I replaced the awk statement with awk '!/^\*x\*/ {print}'),

  • https://github.com/strubell/preprocess-conll05/blob/master/bin/basic/extract_test_from_brown.sh

produces

$CONLL05/test.brown/synt/test.brown.synt.gz

VB                         (SQ(VP*                                          
WP                   (SBAR(WHNP-1*)                                         
PRP                     (S(NP-SBJ*)                                         
VBD                           (VP*                                          
IN                            (PP*                                          
VBG                     (S-NOM(VP*                                          
RP                      (ADVP-DIR*)                                         
TO                      (S-PRP(VP*                                          
VB                            (VP*                                          
NN                         (NP(NP*)                                         
VBN                           (VP*                                          
IN                          (ADVP*))))))))))))))                            
.                                *                                          
.                                * 

DT                      (S(NP-SBJ*)                                                                                                                                          
RB                      (ADVP-TMP*)                                                                                                                                          
VBZ                           (VP*)                                                                                                                                          
.                                *)                                                                                                                                          
                                                                                                                                                                             
PRP                     (S(NP-SBJ*)                                                                                                                                          
VBP                           (VP*                                                                                                                                           
RB                      (ADVP-DIR*)                                                                                                                                          
DT                         (NP(NP*)                                                                                                                                          
JJ                          (ADJP*                                                                                                                                           
CC                               *                                                                                                                                           
VBN                              *)))                                                                                                                                        
''                               *                                                                                                                                           
.                                *)

Note that the elements representing the question marks are no longer contained within the parentheses. Next we run,

  • https://github.com/strubell/preprocess-conll05/blob/master/bin/basic/make-brown-test.sh

$CONLL05/test.brown.gz

Remember  VB                         (SQ(VP*                            *    -  remember    (V*)         *            *        *            *                   
what      WP                   (SBAR(WHNP-1*)                           *    -  -          (A1*     (R-A1*)           *        *            *                   
I         PRP                     (S(NP-SBJ*)                           *    -  -             *       (A0*)           *        *            *                   
said      VBD                           (VP*                            *    -  say           *        (V*)           *        *            *                   
about     IN                            (PP*                            *    -  -             *       (A3*            *        *            *                   
going     VBG                     (S-NOM(VP*                            *    -  go            *          *          (V*)       *            *                   
out       RP                      (ADVP-DIR*)                           *    -  -             *          *     (AM-DIR*)       *            *                   
to        TO                      (S-PRP(VP*                            *    -  -             *          *     (AM-PNC*        *            *                   
get       VB                            (VP*                            *    -  get           *          *            *      (V*)           *                   
anybody   NN                         (NP(NP*)                           *    -  -             *          *            *     (A1*         (A0*)                  
left      VBN                           (VP*                            *    -  leave         *          *            *        *          (V*)                  
behind    IN                          (ADVP*))))))))))))))              *    -  -             *)         *)           *)       *)    (AM-ADV*)                  
?         .                                *                            *    -  -             *          *            *        *            *                   
?         .                                *                            *    -  -             *          *            *        *            *

That   DT                      (S(NP-SBJ*)                           *    -  -          (A1*)                                                                                
still  RB                      (ADVP-TMP*)                           *    -  -      (AM-TMP*)                                                                                
holds  VBZ                           (VP*)                           *    -  hold        (V*)                                                                                
.      .                                *)                           *    -  -             *                                                                                 
                                                                                                                                                                             
We       PRP                     (S(NP-SBJ*)                           *    -  -           (A0*)                                                                             
bring    VBP                           (VP*                            *    -  bring        (V*)                                                                             
back     RB                      (ADVP-DIR*)                           *    -  -       (AM-DIR*)                                                                             
all      DT                         (NP(NP*)                           *    -  -           (A1*                                                                              
dead     JJ                          (ADJP*                            *    -  -              *                                                                              
and      CC                               *                            *    -  -              *                                                                              
wounded  VBN                              *)))                         *    -  -              *)                                                                             
''       ''                               *                            *    -  -              *                                                                              
.        .                                *)                           *    -  -              * 

When we continue and run the,

  • https://github.com/strubell/preprocess-conll05/blob/master/bin/preprocess_conll05_sdeps.sh

script with $CONLL05/test.brown.gz as input we get a series of outputs like this,

$CONLL05/test.brown.gz.parse

(from applying awk and sed commands to the input file $CONLL05/test.brown.gz)

(SQ(VP(VB Remember)                                                                                                                                                           
(SBAR(WHNP-1(WP what))                                                                                                                                                        
(S(NP-SBJ(PRP I))                                                                                                                                                             
(VP(VBD said)                                                                                                                                                                 
(PP(IN about)                                                                                                                                                                 
(S-NOM(VP(VBG going)                                                                                                                                                          
(ADVP-DIR(RP out))                                                                                                                                                            
(S-PRP(VP(TO to)                                                                                                                                                              
(VP(VB get)                                                                                                                                                                   
(NP(NP(NN anybody))                                                                                                                                                           
(VP(VBN left)                                                                                                                                                                 
(ADVP(IN behind)))))))))))))))                                                                                                                                                
(. ?)                                                                                                                                                                         
(. ?)

(S(NP-SBJ(DT That))                                                                                                                                                          
(ADVP-TMP(RB still))                                                                                                                                                         
(VP(VBZ holds))                                                                                                                                                              
(. .))                                                                                                                                                                       
                                                                                                                                                                             
(S(NP-SBJ(PRP We))                                                                                                                                                           
(VP(VBP bring)                                                                                                                                                               
(ADVP-DIR(RB back))                                                                                                                                                          
(NP(NP(DT all))                                                                                                                                                              
(ADJP(JJ dead)                                                                                                                                                               
(CC and)                                                                                                                                                                     
(VBN wounded))))                                                                                                                                                             
('' '')                                                                                                                                                                      
(. .))

$CONLL05/test.brown.gz.parse.sdeps

(from applying the Standford parser to $CONLL05/test.brown.gz.parse )

1       Remember        _       VERB    VB      _       0       root    _       _                                                                               
2       what    _       PRON    WP      _       4       dobj    _       _                                                                                       
3       I       _       PRON    PRP     _       4       nsubj   _       _                                                                                       
4       said    _       VERB    VBD     _       1       ccomp   _       _                                                                                       
5       about   _       SCONJ   IN      _       4       prep    _       _                                                                                       
6       going   _       VERB    VBG     _       5       pcomp   _       _                                                                                       
7       out     _       ADP     RP      _       6       advmod  _       _                                                                                       
8       to      _       PART    TO      _       9       aux     _       _                                                                                       
9       get     _       VERB    VB      _       6       xcomp   _       _                                                                                       
10      anybody _       PRON    NN      _       9       dobj    _       _                                                                                       
11      left    _       VERB    VBN     _       10      vmod    _       _                                                                                       
12      behind  _       ADP     IN      _       11      advmod  _       _                                                                                       
                                                                                                                                                                
1       ?       _       PUNCT   .       _       0       root    _       _                                                                                       
                                                                                                                                                                
1       ?       _       PUNCT   .       _       0       root    _       _

1       That    _       PRON    DT      _       3       nsubj   _       _                                                                                                    
2       still   _       ADV     RB      _       3       advmod  _       _                                                                                                    
3       holds   _       VERB    VBZ     _       0       root    _       _                                                                                                    
4       .       _       PUNCT   .       _       3       punct   _       _                                                                                                    
                                                                                                                                                                             
1       We      _       PRON    PRP     _       2       nsubj   _       _                                                                                                    
2       bring   _       VERB    VBP     _       0       root    _       _                                                                                                    
3       back    _       ADV     RB      _       2       advmod  _       _                                                                                                    
4       all     _       DET     DT      _       2       dobj    _       _                                                                                                    
5       dead    _       ADJ     JJ      _       4       amod    _       _                                                                                                    
6       and     _       CONJ    CC      _       5       cc      _       _                                                                                                    
7       wounded _       VERB    VBN     _       5       conj    _       _                                                                                                    
8       ''      _       PUNCT   ''      _       2       punct   _       _                                                                                                    
9       .       _       PUNCT   .       _       2       punct   _       _ 

Note that the question marks have been put on their own lines here.

$CONLL05/test.brown.gz.parse.sdeps.posonly

(from applying awk to $CONLL05/test.brown.gz.parse.sdeps)

Remember what I said about going out to get anybody left behind                                                                                                              
?                                                                                                                                                                            
?                                                                                                                                                                            
That still holds .                                                                                                                                                           
We bring back all dead and wounded '' .

$CONLL05/test.brown.gz.parse.sdeps.pos

(from applying edu.stanford.nlp.tagger.maxent.MaxentTagger to $CONLL05/test.brown.gz.parse.sdeps.posonly)

Remember        VB                                                                                                                                                           
what    WP                                                                                                                                                                   
I       PRP                                                                                                                                                                  
said    VBD                                                                                                                                                                  
about   IN                                                                                                                                                                   
going   VBG                                                                                                                                                                  
out     RP                                                                                                                                                                   
to      TO                                                                                                                                                                   
get     VB                                                                                                                                                                   
anybody NN                                                                                                                                                                   
left    VBD                                                                                                                                                                  
behind  IN                                                                                                                                                                   
                                                                                                                                                                             
?       .                                                                                                                                                                    
                                                                                                                                                                             
?       .                                                                                                                                                                    
                                                                                                                                                                             
That    DT                                                                                                                                                                   
still   RB                                                                                                                                                                   
holds   VBZ                                                                                                                                                                  
.       .                                                                                                                                                                    
                                                                                                                                                                             
We      PRP                                                                                                                                                                  
bring   VBP                                                                                                                                                                  
back    RP                                                                                                                                                                   
all     DT                                                                                                                                                                   
dead    JJ                                                                                                                                                                   
and     CC                                                                                                                                                                   
wounded VBN                                                                                                                                                                  
''      ''                                                                                                                                                                   
.       .                                                                                                                                                                    

$CONLL05/test.brown.gz.parse.sdeps.combined

from applying the paste command to

  • f_converted = $CONLL05/test.brown.gz.parse.sdeps
  • f_pos = $CONLL05/test.brown.gz.parse.sdeps.pos
conll05 200     0       Remember        VB      VB      0       root    _       -       remember        -       -       *       (V*)    *       *       *       *            
conll05 200     1       what    WP      WP      4       dobj    _       -       -       -       -       *       (A1*    (R-A1*) *       *       *                            
conll05 200     2       I       PRP     PRP     4       nsubj   _       -       -       -       -       *       *       (A0*)   *       *       *                            
conll05 200     3       said    VBD     VBD     1       ccomp   _       -       say     -       -       *       *       (V*)    *       *       *                            
conll05 200     4       about   IN      IN      4       prep    _       -       -       -       -       *       *       (A3*    *       *       *                            
conll05 200     5       going   VBG     VBG     5       pcomp   _       -       go      -       -       *       *       *       (V*)    *       *                            
conll05 200     6       out     RP      RP      6       advmod  _       -       -       -       -       *       *       *       (AM-DIR*)       *       *                    
conll05 200     7       to      TO      TO      9       aux     _       -       -       -       -       *       *       *       (AM-PNC*        *       *                    
conll05 200     8       get     VB      VB      6       xcomp   _       -       get     -       -       *       *       *       *       (V*)    *                            
conll05 200     9       anybody NN      NN      9       dobj    _       -       -       -       -       *       *       *       *       (A1*    (A0*)                        
conll05 200     10      left    VBN     VBD     10      vmod    _       -       leave   -       -       *       *       *       *       *       (V*)                         
conll05 200     11      behind  IN      IN      11      advmod  _       -       -       -       -       *       *)      *)      *)      *)      (AM-ADV*)                    
conll05 200     12      ?                               -       -       -       -       *       *       *       *       *       *                                            
conll05 200     13      ?       .       .       0       root    _       -       -       -       -       *       *       *       *       *       *                            
                                                                                                                                                                             
conll05 201     0       That    .       .       0       root    _       -       -       -       -       *       (A1*)                                                        
conll05 201     1       still                           -       -       -       -       *       (AM-TMP*)                                                                    
conll05 201     2       holds   DT      DT      3       nsubj   _       -       hold    -       -       *       (V*)                                                         
conll05 201     3       .       RB      RB      3       advmod  _       -       -       -       -       *       *                                                            
        VBZ     VBZ     0       root    _                                                                                                                                    
conll05 202     0       We      .       .       3       punct   _       -       -       -       -       *       (A0*)                                                        
conll05 202     1       bring                           -       bring   -       -       *       (V*)                                                                         
conll05 202     2       back    PRP     PRP     2       nsubj   _       -       -       -       -       *       (AM-DIR*)                                                    
conll05 202     3       all     VBP     VBP     0       root    _       -       -       -       -       *       (A1*                                                         
conll05 202     4       dead    RB      RP      2       advmod  _       -       -       -       -       *       *                                                            
conll05 202     5       and     DT      DT      2       dobj    _       -       -       -       -       *       *                                                            
conll05 202     6       wounded JJ      JJ      4       amod    _       -       -       -       -       *       *)                                                           
conll05 202     7       ''      CC      CC      5       cc      _       -       -       -       -       *       *                                                            
conll05 202     8       .       VBN     VBN     5       conj    _       -       -       -       -       *       *                                                            
        ''      ''      2       punct   _

Well that was a long read! We see the problem here with the two lines containing,

VBZ     VBZ     0       root    _

and

''      ''      2       punct   _ 

This pattern continues to cause problems further down the file. Wondering if anyone else ran into this problem and found a solution?

@strubell ?

galtay avatar Jun 07 '19 17:06 galtay

soooooooooo, I think I figured out my mistake. turns out, the syntax generated by lines like this,

cat $BROWN/$section/$section$subsection.mrg \
    | awk '!/^\*x\*/ {print}' \
    | $SRLCONLL/bin/wsj-removetraces.pl \
    | $SRLCONLL/bin/wsj-to-se.pl -w 0 \
    >> $STRUBELL18/$DATA_SEGMENT/synt/$DATA_SEGMENT.synt

are what cause the problems initially. if instead one uses a syntax parse provided in the CONLL05 data set (synt.cha for example) the extra lines are not introduced. hopefully someone out there finds this useful!

galtay avatar Jun 19 '19 03:06 galtay

Hello,

I have the same problem on Brown test set. Did you resolve it?

Thanks

alirezamshi-zz avatar Nov 14 '20 11:11 alirezamshi-zz