stanza icon indicating copy to clipboard operation
stanza copied to clipboard

[QUESTION] How can I make the tregex method faster?

Open 694040837 opened this issue 9 months ago • 5 comments

Hello, I have multiple Tregex patterns and want to use the CoreNLPClient.tregex()method to get matching results for sentences. Do I need to call tregex()multiple times for multiple patterns, or can I pass all patterns at once? Additionally, if I provide the trees parameter to the tregex()method, will it skip parsing to make the process faster?

694040837 avatar Jul 19 '25 17:07 694040837

I would really appreciate it if someone could answer my question. Here is my code:

import stanza
from stanza.server import CoreNLPClient
from collections import defaultdict
import math

class SyntacticComplexityAnalyzer:
    def __init__(self):
        self.stats = defaultdict(int)
    
    def analyze_text(self, text, client):
        """Analyze the given text and compute all syntactic complexity measures"""
        # Reset statistics
        self.stats = defaultdict(int)
        
        # First parse the text to get the parse trees
        ann = client.annotate(text)
        
        # Count words (excluding punctuation)
        self.count_words(ann)
        
        # Count sentences
        self.count_sentences(ann)
        
        # Count clauses (C)
        self.count_clauses(client, text, ann)
        
        # Count dependent clauses (DC)
        self.count_dependent_clauses(client, text, ann)
        
        # Count T-units (T)
        self.count_tunits(client, text, ann)
        

        for measure, value in self.stats.items():
            print(f"{measure}: {value:.3f}")

    
    def count_words(self, ann):
        """Count words excluding punctuation"""
        for sentence in ann.sentence:
            for token in sentence.token:
                # Exclude punctuation
                if not any(cat in token.pos for cat in ['.', ',', ':', "''", "``", '-', '(', ')']):
                    self.stats['words'] += 1
    
    def count_sentences(self, ann):
        """Count sentences"""
        self.stats['S'] = len(ann.sentence)
    
    def count_clauses(self, client, text, ann):
        """Count clauses using Tregex pattern: S|SINV|SQ < (VP <# MD|VBD|VBP|VBZ)"""
        trees = [sen.parseTree for sen in ann.sentence]
        
        pattern = 'S|SINV|SQ < (VP <# MD|VBD|VBP|VBZ)'
        matches = client.tregex(text, pattern, trees)
        self.stats['C'] = sum(len(sent_matches) for sent_matches in matches['sentences'])

        
        # Also count sentence fragments (FRAG > ROOT !<< VP)
        frag_pattern = 'FRAG > ROOT !<< VP'
        frag_matches = client.tregex(text, frag_pattern, trees)
        self.stats['C'] += sum(len(sent_matches) for sent_matches in frag_matches['sentences'])

    
    def count_dependent_clauses(self, client, text, ann):
        """Count dependent clauses using Tregex pattern: SBAR < (S|SINV|SQ < (VP <# MD|VBD|VBP|VBZ))"""
        trees = [sen.parseTree for sen in ann.sentence]

        pattern = 'SBAR < (S|SINV|SQ < (VP <# MD|VBD|VBP|VBZ))'
        matches = client.tregex(text, pattern, trees)
        self.stats['DC'] = sum(len(sent_matches) for sent_matches in matches['sentences'])
    
    def count_tunits(self, client, text, ann):
        """Count T-units using Tregex pattern: S|SBARQ|SINV|SQ > ROOT | [ $- S|SBARQ|SINV|SQ !>> SBAR|VP ]"""
        trees = [sen.parseTree for sen in ann.sentence]
        pattern = 'S|SBARQ|SINV|SQ > ROOT | [ $- S|SBARQ|SINV|SQ !>> SBAR|VP ]'
        matches = client.tregex(text, pattern,trees)
        self.stats['T'] = sum(len(sent_matches) for sent_matches in matches['sentences'])
        
        # Also count sentence fragments (FRAG > ROOT)
        frag_pattern = 'FRAG > ROOT'
        frag_matches = client.tregex(text, frag_pattern,trees)
        self.stats['T'] += sum(len(sent_matches) for sent_matches in frag_matches['sentences'])
    
    


if __name__ == "__main__":
    text = """We use it when a girl in our dorm is acting like a spoiled child. 
    Saving energy is really important. I know you like to read."""
    
    analyzer = SyntacticComplexityAnalyzer()
    
    with CoreNLPClient(
            annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse'],
            timeout=30000,
            memory='16G',
            be_quiet=True, 
            max_char_length=100000,
            threads=8               
            ) as client:
        
        
        analyzer.analyze_text(text, client)
        

694040837 avatar Jul 19 '25 17:07 694040837

In fact it is possible to send in trees rather than raw text, in which case the CoreNLP side won't have to parse the text. You could do this by either reading trees with stanza.models.constituency.tree_reader or by parsing raw text with the constituency annotator, which is significantly more accurate than the CoreNLP constituency annotator.

If you have a specific use case for tregex that isn't covered by the basic URL endpoint, please let us know. I've been looking for a use case for this tool - so far I've only ever needed to use semgrex, ssurgeon, or tsurgeon from Python

AngledLuffa avatar Jul 19 '25 20:07 AngledLuffa

In fact it is possible to send in trees rather than raw text, in which case the CoreNLP side won't have to parse the text. You could do this by either reading trees with stanza.models.constituency.tree_reader or by parsing raw text with the constituency annotator, which is significantly more accurate than the CoreNLP constituency annotator.

If you have a specific use case for tregex that isn't covered by the basic URL endpoint, please let us know. I've been looking for a use case for this tool - so far I've only ever needed to use semgrex, ssurgeon, or tsurgeon from Python

Thanks for your reply! I use the constituency annotator to parse the raw text, and construct trees like this:

nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency')
doc = nlp(text)
trees = [sentence.constituency for sentence in doc.sentences]

pattern = 'S|SINV|SQ < (VP <# MD|VBD|VBP|VBZ)'
matches = client.tregex(pattern, trees)

when I try to send this trees in to client.tregex(), got the error: stanza.server.client.AnnotationException: edu.stanford.nlp.trees.tregex.TregexParseException: Could not parse (ROOT (S (NP (PRP I)) (VP (VBP know) (SBAR (S (NP (PRP you)) (VP (VBP like) (S (VP (TO to) (VP (VB read)))))))) (. .)))

Do I construct the trees in the wrong way?

694040837 avatar Jul 20 '25 11:07 694040837

Do I construct the trees in the wrong way?

The trees are fine. You need to do

matches = client.tregex(pattern, trees=trees)

AngledLuffa avatar Jul 20 '25 13:07 AngledLuffa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 11 '25 18:12 stale[bot]