qalsadi icon indicating copy to clipboard operation
qalsadi copied to clipboard

Add lemmatization confidence scoring (e.g., based on frequency or match level).

Open linuxscout opened this issue 3 months ago โ€ข 3 comments
trafficstars

linuxscout avatar Jul 29 '25 15:07 linuxscout

โœ… Why Add Confidence Scoring?

When Qalsadi returns multiple possible lemmas or a single one from ranked options, adding a confidence score helps:

Use Case Benefit
๐Ÿงช Downstream NLP (e.g., tagging) Filter or weight based on confidence
๐Ÿง  Human-in-the-loop review Prioritize uncertain cases
๐Ÿ“ˆ Error analysis or benchmarking Identify weak predictions

๐ŸŽฏ How Confidence Might Be Computed

1. By Frequency Rank (if Qalsadi uses solution ordering)

  • If lemma is chosen from solutions[0], itโ€™s more confident.

  • Confidence could be a function of:

    • Frequency
    • Rank in list
    • Relative score compared to other solutions (if available)

2. By Match Quality

  • Use heuristics like:

    • Exact surface match? โœ”๏ธ High
    • Rule-based lemma? โš ๏ธ Medium
    • Fallback default? โŒ Low

3. By Source

  • Lexicon match > rule > guess

โœ… Example Output

[
  {
    "word": "ุงู„ู…ุนู„ู…ูŠู†",
    "lemma": "ู…ุนู„ู…",
    "pos": "noun",
    "confidence": 0.92
  },
  {
    "word": "ูƒุชุจูˆุง",
    "lemma": "ูƒุชุจ",
    "pos": "verb",
    "confidence": 0.78
  }
]

๐Ÿ”ง How to Add (Conceptually)

In your lemmatizer:

solutions = self.get_solutions(word)
best = solutions[0]
confidence = self._compute_confidence(solutions)

return {"lemma": best["lemma"], "confidence": confidence}

Where _compute_confidence() could be something like:

def _compute_confidence(self, solutions):
    if not solutions:
        return 0.0
    return max(0.1, 1.0 / (1 + len(solutions) - 1))  # inverse of rank

Or assign weights based on frequency info if available.


๐Ÿšฉ Note

Qalsadi would need either:

  • frequency metadata in its dictionaries (or ranked solutions)
  • Or heuristics to infer confidence from rules/source

linuxscout avatar Jul 29 '25 15:07 linuxscout

โœ… TODO List: Lemmatization Confidence Scoring

๐Ÿ“ฆ Data & Structure

  • [ ] Review how lemmatizer.py selects lemmas from solutions.
  • [ ] Identify where frequency or ranking of solutions is available.
  • [ ] Decide where to insert confidence score in the output structure.

๐Ÿ” Confidence Strategy

  • [ ] Define a scoring model based on:

    • [ ] โœ… Solution rank (e.g. top solution = 1.0, second = 0.8, etc.)
    • [ ] Optional: Match type (lexicon match, rule, fallback)
    • [ ] Optional: Frequency (if available in dictionary)
  • [ ] Implement compute_confidence(solutions: list) โ†’ float

๐Ÿ›  Code Integration

  • [ ] Modify lemmatization result to include "confidence" field.
  • [ ] Ensure scoring does not affect logic โ€” just adds metadata.
  • [ ] Add an optional flag --with-confidence (CLI) or confidence=True (API)

โœ… Output

  • [ ] Update ResultFormatter to include confidence if available.
  • [ ] Add support to --profile lemmas and main to show it.
  • [ ] Ensure as_json(), as_csv() etc. handle the field.

๐Ÿงช Testing

  • [ ] Write unit tests for compute_confidence()
  • [ ] Add example tests showing output with confidence scores.

๐Ÿ“š Documentation

  • [ ] Document the scoring method in README.md or CLI --help.
  • [ ] Provide example with JSON including confidence.

linuxscout avatar Jul 29 '25 15:07 linuxscout

โœ… Function: compute_confidence()

Place this inside your Lemmatizer class (or wherever solutions are ranked):

def compute_confidence(self, solutions):
    """
    Compute confidence score based on solution rank.

    - Top-ranked solution: 1.0
    - Next ranks: linearly decreasing (e.g., 0.9, 0.8, ...)
    - Minimum threshold: 0.1
    """
    if not solutions:
        return 0.0

    rank = 0  # first solution
    max_conf = 1.0
    step = 0.1  # decrease per rank

    return max(max_conf - (rank * step), 0.1)

You can also modify it to assign lower scores to fallback/rule-based results if needed.


โœ… Integration (Minimal)

In your lemmatizer method (e.g., lemmatize_text or lemmatize_word):

results = []
for word in words:
    solutions = self.analyze(word)
    if not solutions:
        lemma = word
        conf = 0.0
    else:
        best = solutions[0]
        lemma = best.get("lemma", word)
        conf = self.compute_confidence(solutions)
    
    results.append({
        "word": word,
        "lemma": lemma,
        "pos": best.get("pos") if solutions else None,
        "confidence": conf
    })

โœ… CLI Option

In argparse, add:

parser.add_argument("--with-confidence", action="store_true",
                    help="Include confidence score in lemmatization output")

Then pass with_confidence=args.with_confidence to your lemmatizer.


โœ… Formatter Update

In ResultFormatter, check:

if any("confidence" in result for result in self.results):
    self.used_fields.append("confidence")

โœ… Output Example

[
  { "word": "ูƒุชุจูˆุง", "lemma": "ูƒุชุจ", "pos": "verb", "confidence": 1.0 },
  { "word": "ุงู„ู…ุนู„ู…ูŠู†", "lemma": "ู…ุนู„ู…", "pos": "noun", "confidence": 0.9 }
]

linuxscout avatar Jul 29 '25 15:07 linuxscout