TextDescriptives
TextDescriptives copied to clipboard
quality_test/contains doesn't function
How to reproduce the behaviour
I try to set new quality thrseholds, I do as it specificed in the documentations (using "set_quality_thresholds"). When I run a quality test, I see that all the fields I'd tried to change did change, but to the "contains" field, which remain {"lorem_ipsum": False}. I tried some ways around it and couldn't mange to change specifically this test, no matter how simple is the dictionary I tried to replace it with. Moreover, I tried the default test on a text with "lorem_ipsum" and it passed the test, so nothing works (for me) with this test... Am I missing something?
Your Environment
- textdescriptives Version Used: 2.8.0
- Operating System: macOS 13.6
- Python Version Used: 3.10.13
Can you add a code snippet that reproduces the behaviour?
@dvirnimrod when I try to reproduce the stated behaviour I get the following behavior (python 3.8):
import textdescriptives as td
td.__version__
# 2.8.0
import spacy
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")
quality_pipe = nlp.add_pipe("textdescriptives/quality")
docs = nlp.pipe(["lorem ipsum"])
doc = next(docs)
doc._.passed_quality_check
# False
doc._.quality
# QualityOutput(
# passed=False, ...
# contains={'lorem ipsum': ThresholdsOutput(value=1.0, passed=False, threshold=False)}, ...
Hi, thanks for the quick respond!
Here's a code snippet for example:
import textdescriptives as td
import spacy
from spacy.cli import download
QUALITY_THRESHOLDS = td.QualityThresholds(
n_stop_words=(None, None),
alpha_ratio=(0.6, None),
mean_word_length=(3, 10),
doc_length=(1, 1000),
symbol_to_word_ratio={"@": (None, 0.3)},
proportion_ellipsis=(None, None),
proportion_bullet_points=(None, 0.7),
contains={"fake": False},
duplicate_line_chr_fraction=(None, 0.2),
duplicate_paragraph_chr_fraction=(None, 0.2),
duplicate_ngram_chr_fraction={
"5": (None, 0.15),
"6": (None, 0.14),
"7": (None, 0.13),
"8": (None, 0.12),
"9": (None, 0.11),
"10": (None, 0.1),
},
top_ngram_chr_fraction={"2": (None, 0.2), "3": (None, 0.18), "4": (None, 0.16)},
oov_ratio=(None, 0.3)
)
download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")
quality_pipe = nlp.add_pipe("textdescriptives/quality")
quality_pipe.set_quality_thresholds(QUALITY_THRESHOLDS)
text = "This is fake @@@@@"
doc = nlp(text)
print(doc._.quality)
And here's the output:
passed=True
n_stop_words=ThresholdsOutput(value=2.0, passed=True, threshold=(None, None))
alpha_ratio=ThresholdsOutput(value=0.75, passed=True, threshold=(0.6, None))
mean_word_length=ThresholdsOutput(value=3.75, passed=True, threshold=(3.0, 10.0))
doc_length=ThresholdsOutput(value=4.0, passed=True, threshold=(1.0, 1000.0))
symbol_to_word_ratio={'#': ThresholdsOutput(value=0.0, passed=True, threshold=None)}
proportion_ellipsis=ThresholdsOutput(value=0.0, passed=True, threshold=(None, None))
proportion_bullet_points=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.7))
contains={'lorem ipsum': ThresholdsOutput(value=0.0, passed=True, threshold=None)}
duplicate_line_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2))
duplicate_paragraph_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2))
duplicate_ngram_chr_fraction={'5': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.15)), '6': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.14)), '7': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.13)), '8': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.12)), '9': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.11)), '10': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.1))}
top_ngram_chr_fraction={'2': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), '3': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.18)), '4': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.16))}
oov_ratio=ThresholdsOutput(value=0.25, passed=True, threshold=(None, 0.3))
As you can see, other attributes that I've set are updated to a new value (like "alpha_ratio" and "doc_length"), but the attributes "contains" and "symbol_to_word_ratio" haven't...
Hi @dvirnimrod. The td.QualityThresholds
have default for these. You can disable them e.g. by setting:
...
contains = {} # nothing should be checked
symbol_to_word_ratio = {}
...
Edit: Aahh sorry It seems like a misread the code, @HLasse caught it though
Ah, I see. It seems that .set_quality_threshold
updates the thresholds correctly, but does not set self.contains
and self.symbols
(which it should). I'll take a look.
EDIT: Fixed in #353
Great! Thank you guys :)