[QUESTION] How can I make the tregex method faster?
Hello, I have multiple Tregex patterns and want to use the CoreNLPClient.tregex()method to get matching results for sentences. Do I need to call tregex()multiple times for multiple patterns, or can I pass all patterns at once? Additionally, if I provide the trees parameter to the tregex()method, will it skip parsing to make the process faster?
I would really appreciate it if someone could answer my question. Here is my code:
import stanza
from stanza.server import CoreNLPClient
from collections import defaultdict
import math
class SyntacticComplexityAnalyzer:
def __init__(self):
self.stats = defaultdict(int)
def analyze_text(self, text, client):
"""Analyze the given text and compute all syntactic complexity measures"""
# Reset statistics
self.stats = defaultdict(int)
# First parse the text to get the parse trees
ann = client.annotate(text)
# Count words (excluding punctuation)
self.count_words(ann)
# Count sentences
self.count_sentences(ann)
# Count clauses (C)
self.count_clauses(client, text, ann)
# Count dependent clauses (DC)
self.count_dependent_clauses(client, text, ann)
# Count T-units (T)
self.count_tunits(client, text, ann)
for measure, value in self.stats.items():
print(f"{measure}: {value:.3f}")
def count_words(self, ann):
"""Count words excluding punctuation"""
for sentence in ann.sentence:
for token in sentence.token:
# Exclude punctuation
if not any(cat in token.pos for cat in ['.', ',', ':', "''", "``", '-', '(', ')']):
self.stats['words'] += 1
def count_sentences(self, ann):
"""Count sentences"""
self.stats['S'] = len(ann.sentence)
def count_clauses(self, client, text, ann):
"""Count clauses using Tregex pattern: S|SINV|SQ < (VP <# MD|VBD|VBP|VBZ)"""
trees = [sen.parseTree for sen in ann.sentence]
pattern = 'S|SINV|SQ < (VP <# MD|VBD|VBP|VBZ)'
matches = client.tregex(text, pattern, trees)
self.stats['C'] = sum(len(sent_matches) for sent_matches in matches['sentences'])
# Also count sentence fragments (FRAG > ROOT !<< VP)
frag_pattern = 'FRAG > ROOT !<< VP'
frag_matches = client.tregex(text, frag_pattern, trees)
self.stats['C'] += sum(len(sent_matches) for sent_matches in frag_matches['sentences'])
def count_dependent_clauses(self, client, text, ann):
"""Count dependent clauses using Tregex pattern: SBAR < (S|SINV|SQ < (VP <# MD|VBD|VBP|VBZ))"""
trees = [sen.parseTree for sen in ann.sentence]
pattern = 'SBAR < (S|SINV|SQ < (VP <# MD|VBD|VBP|VBZ))'
matches = client.tregex(text, pattern, trees)
self.stats['DC'] = sum(len(sent_matches) for sent_matches in matches['sentences'])
def count_tunits(self, client, text, ann):
"""Count T-units using Tregex pattern: S|SBARQ|SINV|SQ > ROOT | [ $- S|SBARQ|SINV|SQ !>> SBAR|VP ]"""
trees = [sen.parseTree for sen in ann.sentence]
pattern = 'S|SBARQ|SINV|SQ > ROOT | [ $- S|SBARQ|SINV|SQ !>> SBAR|VP ]'
matches = client.tregex(text, pattern,trees)
self.stats['T'] = sum(len(sent_matches) for sent_matches in matches['sentences'])
# Also count sentence fragments (FRAG > ROOT)
frag_pattern = 'FRAG > ROOT'
frag_matches = client.tregex(text, frag_pattern,trees)
self.stats['T'] += sum(len(sent_matches) for sent_matches in frag_matches['sentences'])
if __name__ == "__main__":
text = """We use it when a girl in our dorm is acting like a spoiled child.
Saving energy is really important. I know you like to read."""
analyzer = SyntacticComplexityAnalyzer()
with CoreNLPClient(
annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse'],
timeout=30000,
memory='16G',
be_quiet=True,
max_char_length=100000,
threads=8
) as client:
analyzer.analyze_text(text, client)
In fact it is possible to send in trees rather than raw text, in which case the CoreNLP side won't have to parse the text. You could do this by either reading trees with stanza.models.constituency.tree_reader or by parsing raw text with the constituency annotator, which is significantly more accurate than the CoreNLP constituency annotator.
If you have a specific use case for tregex that isn't covered by the basic URL endpoint, please let us know. I've been looking for a use case for this tool - so far I've only ever needed to use semgrex, ssurgeon, or tsurgeon from Python
In fact it is possible to send in trees rather than raw text, in which case the CoreNLP side won't have to parse the text. You could do this by either reading trees with
stanza.models.constituency.tree_readeror by parsing raw text with the constituency annotator, which is significantly more accurate than the CoreNLP constituency annotator.If you have a specific use case for tregex that isn't covered by the basic URL endpoint, please let us know. I've been looking for a use case for this tool - so far I've only ever needed to use semgrex, ssurgeon, or tsurgeon from Python
Thanks for your reply! I use the constituency annotator to parse the raw text, and construct trees like this:
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency')
doc = nlp(text)
trees = [sentence.constituency for sentence in doc.sentences]
pattern = 'S|SINV|SQ < (VP <# MD|VBD|VBP|VBZ)'
matches = client.tregex(pattern, trees)
when I try to send this trees in to client.tregex(), got the error: stanza.server.client.AnnotationException: edu.stanford.nlp.trees.tregex.TregexParseException: Could not parse (ROOT (S (NP (PRP I)) (VP (VBP know) (SBAR (S (NP (PRP you)) (VP (VBP like) (S (VP (TO to) (VP (VB read)))))))) (. .)))
Do I construct the trees in the wrong way?
Do I construct the trees in the wrong way?
The trees are fine. You need to do
matches = client.tregex(pattern, trees=trees)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.