qalsadi
qalsadi copied to clipboard
Add lemmatization confidence scoring (e.g., based on frequency or match level).
โ Why Add Confidence Scoring?
When Qalsadi returns multiple possible lemmas or a single one from ranked options, adding a confidence score helps:
| Use Case | Benefit |
|---|---|
| ๐งช Downstream NLP (e.g., tagging) | Filter or weight based on confidence |
| ๐ง Human-in-the-loop review | Prioritize uncertain cases |
| ๐ Error analysis or benchmarking | Identify weak predictions |
๐ฏ How Confidence Might Be Computed
1. By Frequency Rank (if Qalsadi uses solution ordering)
-
If lemma is chosen from
solutions[0], itโs more confident. -
Confidence could be a function of:
- Frequency
- Rank in list
- Relative score compared to other solutions (if available)
2. By Match Quality
-
Use heuristics like:
- Exact surface match? โ๏ธ High
- Rule-based lemma? โ ๏ธ Medium
- Fallback default? โ Low
3. By Source
- Lexicon match > rule > guess
โ Example Output
[
{
"word": "ุงูู
ุนูู
ูู",
"lemma": "ู
ุนูู
",
"pos": "noun",
"confidence": 0.92
},
{
"word": "ูุชุจูุง",
"lemma": "ูุชุจ",
"pos": "verb",
"confidence": 0.78
}
]
๐ง How to Add (Conceptually)
In your lemmatizer:
solutions = self.get_solutions(word)
best = solutions[0]
confidence = self._compute_confidence(solutions)
return {"lemma": best["lemma"], "confidence": confidence}
Where _compute_confidence() could be something like:
def _compute_confidence(self, solutions):
if not solutions:
return 0.0
return max(0.1, 1.0 / (1 + len(solutions) - 1)) # inverse of rank
Or assign weights based on frequency info if available.
๐ฉ Note
Qalsadi would need either:
- frequency metadata in its dictionaries (or ranked solutions)
- Or heuristics to infer confidence from rules/source
โ TODO List: Lemmatization Confidence Scoring
๐ฆ Data & Structure
- [ ] Review how
lemmatizer.pyselects lemmas from solutions. - [ ] Identify where frequency or ranking of solutions is available.
- [ ] Decide where to insert confidence score in the output structure.
๐ Confidence Strategy
-
[ ] Define a scoring model based on:
- [ ] โ Solution rank (e.g. top solution = 1.0, second = 0.8, etc.)
- [ ] Optional: Match type (lexicon match, rule, fallback)
- [ ] Optional: Frequency (if available in dictionary)
-
[ ] Implement
compute_confidence(solutions: list) โ float
๐ Code Integration
- [ ] Modify lemmatization result to include
"confidence"field. - [ ] Ensure scoring does not affect logic โ just adds metadata.
- [ ] Add an optional flag
--with-confidence(CLI) orconfidence=True(API)
โ Output
- [ ] Update
ResultFormatterto includeconfidenceif available. - [ ] Add support to
--profile lemmasandmainto show it. - [ ] Ensure
as_json(),as_csv()etc. handle the field.
๐งช Testing
- [ ] Write unit tests for
compute_confidence() - [ ] Add example tests showing output with confidence scores.
๐ Documentation
- [ ] Document the scoring method in
README.mdor CLI--help. - [ ] Provide example with JSON including
confidence.
โ
Function: compute_confidence()
Place this inside your Lemmatizer class (or wherever solutions are ranked):
def compute_confidence(self, solutions):
"""
Compute confidence score based on solution rank.
- Top-ranked solution: 1.0
- Next ranks: linearly decreasing (e.g., 0.9, 0.8, ...)
- Minimum threshold: 0.1
"""
if not solutions:
return 0.0
rank = 0 # first solution
max_conf = 1.0
step = 0.1 # decrease per rank
return max(max_conf - (rank * step), 0.1)
You can also modify it to assign lower scores to fallback/rule-based results if needed.
โ Integration (Minimal)
In your lemmatizer method (e.g., lemmatize_text or lemmatize_word):
results = []
for word in words:
solutions = self.analyze(word)
if not solutions:
lemma = word
conf = 0.0
else:
best = solutions[0]
lemma = best.get("lemma", word)
conf = self.compute_confidence(solutions)
results.append({
"word": word,
"lemma": lemma,
"pos": best.get("pos") if solutions else None,
"confidence": conf
})
โ CLI Option
In argparse, add:
parser.add_argument("--with-confidence", action="store_true",
help="Include confidence score in lemmatization output")
Then pass with_confidence=args.with_confidence to your lemmatizer.
โ Formatter Update
In ResultFormatter, check:
if any("confidence" in result for result in self.results):
self.used_fields.append("confidence")
โ Output Example
[
{ "word": "ูุชุจูุง", "lemma": "ูุชุจ", "pos": "verb", "confidence": 1.0 },
{ "word": "ุงูู
ุนูู
ูู", "lemma": "ู
ุนูู
", "pos": "noun", "confidence": 0.9 }
]