mishkal
mishkal copied to clipboard
Improve tashkeel speed
Try to improve mishkal by profiling source code, after profiling code, we noticed a time consuming in Qalsadi, we improve Qalsadi in 2023, as 0.5.0 version,
π Mishkal Diacritization Evaluation Report
ποΈ Date: 19 Ψ¬ΩΩΩΩΨ©, 2025 β CET
π§ͺ Evaluation on samples/vocalized/aljazeera.txt
πΉ Using Qalsadi 0.4.5
Function calls: 426,394,179 (426,391,468 primitive)
Time: 225.037 seconds
| ID | Fully Correct | Strip Correct | Fully WER | Strip WER | LER | Total | Line Fully Correct | Line Strip Correct |
|---|---|---|---|---|---|---|---|---|
| 1447 | 76.98% | 89.80% | 9601 | 4256 | 12168 | 41716 | 90.00% | 90.00% |
| 1448 | 76.98% | 89.79% | 9614 | 4264 | 12187 | 41756 | 67.50% | 80.00% |
| 1449 | 76.98% | 89.79% | 9614 | 4264 | 12187 | 41756 | β | β |
πΉ Using Qalsadi 0.5.0 (speedup) β Improved Version
Function calls: 333,277,281 (333,274,551 primitive)
Time: 173.994 seconds
| ID | Fully Correct | Strip Correct | Fully WER | Strip WER | LER | Total | Line Fully Correct | Line Strip Correct |
|---|---|---|---|---|---|---|---|---|
| 1447 | 77.62% | 90.38% | 9336 | 4011 | 11792 | 41716 | 90.00% | 90.00% |
| 1448 | 77.61% | 90.38% | 9348 | 4019 | 11809 | 41756 | 70.00% | 80.00% |
| 1449 | 77.61% | 90.38% | 9348 | 4019 | 11809 | 41756 | β | β |
π§ͺ Evaluation on samples/vocalized/rndlines.txt
πΉ Using Qalsadi 0.5.0 (speedup)
Function calls: 7,342,645,429 (7,342,642,699 primitive)
Time: 3428.477 seconds
| ID | Fully Correct | Strip Correct | Fully WER | Strip WER | LER | Total | Line Fully Correct | Line Strip Correct |
|---|---|---|---|---|---|---|---|---|
| 9996 | 73.88% | 87.74% | 252,311 | 127,245 | 357,698 | 1,037,636 | 77.59% | 87.93% |
| 9997 | 73.88% | 87.74% | 252,337 | 127,259 | 357,736 | 1,037,746 | 76.36% | 87.27% |
πΉ Using Qalsadi 0.4.5
Function calls: 10,561,428,832 (10,561,426,121 primitive)
Time: 5091.545 seconds
| ID | Fully Correct | Strip Correct | Fully WER | Strip WER | LER | Total | Line Fully Correct | Line Strip Correct |
|---|---|---|---|---|---|---|---|---|
| 9996 | 71.98% | 86.77% | 272,017 | 137,320 | 380,440 | 1,037,636 | 74.14% | 84.48% |
| 9997 | 71.98% | 86.77% | 272,044 | 137,335 | 380,479 | 1,037,746 | 75.45% | 86.36% |
β Summary
- Accuracy: Qalsadi 0.5.0 outperforms 0.4.5 in all metrics across both datasets.
- Speed: Execution time reduced by 23%β33%.
- Efficiency: Function calls significantly reduced in 0.5.0, indicating better optimization.
Analysis
| Metric | Description |
|---|---|
| Fully Correct | % of tokens where full diacritization matches exactly. |
| Strip Correct | % of tokens correct ignoring short vowels (i.e., stripped tashkeel). |
| WER | Word Error Rate β number of incorrect tokens. |
| LER | Letter Error Rate β more fine-grained than WER. |
| Total | Total number of characters or tokens. |
| Function Calls | Profiling data (used for performance analysis). |
| Time (s) | Total runtime in seconds. |
β Results Summary & Comparison
π° Dataset: aljazeera.txt
| Metric | Qalsadi 0.4.5 | Qalsadi 0.5.0 | Improvement |
|---|---|---|---|
| Fully Correct (%) | 76.98% | 77.61% | β +0.63% |
| Strip Correct (%) | 89.79% | 90.38% | β +0.59% |
| Fully WER | 9614 | 9348 | β lower |
| Strip WER | 4264 | 4019 | β lower |
| LER | 12187 | 11809 | β lower |
| Total Tokens | 41756 | 41756 | βΈοΈ same |
| Runtime (sec) | 225.037 | 173.994 | β faster |
| Function Calls | 426M | 333M | β fewer |
π§ Insight:
Qalsadi 0.5.0 provides both higher accuracy and better performance on aljazeera.txt, showing it is a clear improvement.
π Dataset: rndlines.txt
| Metric | Qalsadi 0.4.5 | Qalsadi 0.5.0 | Improvement |
|---|---|---|---|
| Fully Correct (%) | 71.98% | 73.88% | β +1.9% |
| Strip Correct (%) | 86.77% | 87.74% | β +0.97% |
| Fully WER | 272044 | 252337 | β lower |
| Strip WER | 137335 | 127259 | β lower |
| LER | 380479 | 357736 | β lower |
| Total Tokens | 1037746 | 1037746 | βΈοΈ same |
| Runtime (sec) | 5091.545 | 3428.477 | β faster |
| Function Calls | 10.56B | 7.34B | β fewer |
π§ Insight: Qalsadi 0.5.0 shows clear improvements in both speed and quality, particularly reducing word and letter errors.
π§Ύ Overall Conclusion
| Aspect | Result |
|---|---|
| Accuracy | β
Qalsadi 0.5.0 improved Fully Correct by 0.6β1.9%. |
| Speed | β
Reduced processing time by ~23% on aljazeera.txt and ~33% on rndlines.txt. |
| Efficiency | β Function calls reduced by 20β30%, indicating better optimization. |
| Stability | β Results are consistent across datasets. |
π§ Recommendations
- Adopt Qalsadi 0.5.0 as default β itβs better in both accuracy and speed.
- Visualize metrics over versions to track progress.
- Break down WER/LER by error types for fine-grained evaluation.
- Add benchmark suite to the repo to automate these comparisons.