TextDescriptives icon indicating copy to clipboard operation
TextDescriptives copied to clipboard

quality_test/contains doesn't function

Open dvirnimrod opened this issue 10 months ago • 4 comments

How to reproduce the behaviour

I try to set new quality thrseholds, I do as it specificed in the documentations (using "set_quality_thresholds"). When I run a quality test, I see that all the fields I'd tried to change did change, but to the "contains" field, which remain {"lorem_ipsum": False}. I tried some ways around it and couldn't mange to change specifically this test, no matter how simple is the dictionary I tried to replace it with. Moreover, I tried the default test on a text with "lorem_ipsum" and it passed the test, so nothing works (for me) with this test... Am I missing something?

Your Environment

  • textdescriptives Version Used: 2.8.0
  • Operating System: macOS 13.6
  • Python Version Used: 3.10.13

dvirnimrod avatar Apr 25 '24 16:04 dvirnimrod

Can you add a code snippet that reproduces the behaviour?

HLasse avatar Apr 26 '24 06:04 HLasse

@dvirnimrod when I try to reproduce the stated behaviour I get the following behavior (python 3.8):

import textdescriptives as td

td.__version__
# 2.8.0
import spacy
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")
quality_pipe = nlp.add_pipe("textdescriptives/quality")
docs = nlp.pipe(["lorem ipsum"])
doc = next(docs)
doc._.passed_quality_check
# False
doc._.quality
# QualityOutput(
# 	passed=False, ...
#	contains={'lorem ipsum': ThresholdsOutput(value=1.0, passed=False, threshold=False)}, ...

KennethEnevoldsen avatar Apr 29 '24 08:04 KennethEnevoldsen

Hi, thanks for the quick respond!

Here's a code snippet for example:

import textdescriptives as td
import spacy
from spacy.cli import download

QUALITY_THRESHOLDS = td.QualityThresholds(
    n_stop_words=(None, None),
    alpha_ratio=(0.6, None),
    mean_word_length=(3, 10),
    doc_length=(1, 1000),
    symbol_to_word_ratio={"@": (None, 0.3)},
    proportion_ellipsis=(None, None),
    proportion_bullet_points=(None, 0.7),
    contains={"fake": False},
    duplicate_line_chr_fraction=(None, 0.2),
    duplicate_paragraph_chr_fraction=(None, 0.2),
    duplicate_ngram_chr_fraction={
        "5": (None, 0.15),
        "6": (None, 0.14),
        "7": (None, 0.13),
        "8": (None, 0.12),
        "9": (None, 0.11),
        "10": (None, 0.1),
    },
    top_ngram_chr_fraction={"2": (None, 0.2), "3": (None, 0.18), "4": (None, 0.16)},
    oov_ratio=(None, 0.3)
)

download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")
quality_pipe = nlp.add_pipe("textdescriptives/quality")
quality_pipe.set_quality_thresholds(QUALITY_THRESHOLDS)

text = "This is fake @@@@@"
doc = nlp(text)
print(doc._.quality)

And here's the output:

passed=True 
	n_stop_words=ThresholdsOutput(value=2.0, passed=True, threshold=(None, None)) 
	alpha_ratio=ThresholdsOutput(value=0.75, passed=True, threshold=(0.6, None)) 
	mean_word_length=ThresholdsOutput(value=3.75, passed=True, threshold=(3.0, 10.0)) 
	doc_length=ThresholdsOutput(value=4.0, passed=True, threshold=(1.0, 1000.0)) 
	symbol_to_word_ratio={'#': ThresholdsOutput(value=0.0, passed=True, threshold=None)} 
	proportion_ellipsis=ThresholdsOutput(value=0.0, passed=True, threshold=(None, None)) 
	proportion_bullet_points=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.7)) 
	contains={'lorem ipsum': ThresholdsOutput(value=0.0, passed=True, threshold=None)} 
	duplicate_line_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)) 
	duplicate_paragraph_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)) 
	duplicate_ngram_chr_fraction={'5': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.15)), '6': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.14)), '7': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.13)), '8': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.12)), '9': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.11)), '10': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.1))} 
	top_ngram_chr_fraction={'2': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), '3': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.18)), '4': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.16))} 
	oov_ratio=ThresholdsOutput(value=0.25, passed=True, threshold=(None, 0.3))

As you can see, other attributes that I've set are updated to a new value (like "alpha_ratio" and "doc_length"), but the attributes "contains" and "symbol_to_word_ratio" haven't...

dvirnimrod avatar May 05 '24 08:05 dvirnimrod

Hi @dvirnimrod. The td.QualityThresholds have default for these. You can disable them e.g. by setting:

    ...
    contains = {} # nothing should be checked
    symbol_to_word_ratio = {} 
    ...

Edit: Aahh sorry It seems like a misread the code, @HLasse caught it though

KennethEnevoldsen avatar May 06 '24 13:05 KennethEnevoldsen

Ah, I see. It seems that .set_quality_threshold updates the thresholds correctly, but does not set self.contains and self.symbols (which it should). I'll take a look.

EDIT: Fixed in #353

HLasse avatar May 07 '24 09:05 HLasse

Great! Thank you guys :)

dvirnimrod avatar May 09 '24 18:05 dvirnimrod