mishkal icon indicating copy to clipboard operation
mishkal copied to clipboard

Improve tashkeel speed

Open linuxscout opened this issue 5 months ago β€’ 2 comments

Try to improve mishkal by profiling source code, after profiling code, we noticed a time consuming in Qalsadi, we improve Qalsadi in 2023, as 0.5.0 version,

linuxscout avatar Jul 19 '25 15:07 linuxscout

πŸ“Š Mishkal Diacritization Evaluation Report

πŸ—“οΈ Date: 19 Ψ¬ΩˆΩŠΩ„ΩŠΨ©, 2025 – CET


πŸ§ͺ Evaluation on samples/vocalized/aljazeera.txt

πŸ”Ή Using Qalsadi 0.4.5

Function calls: 426,394,179 (426,391,468 primitive)
Time: 225.037 seconds
ID Fully Correct Strip Correct Fully WER Strip WER LER Total Line Fully Correct Line Strip Correct
1447 76.98% 89.80% 9601 4256 12168 41716 90.00% 90.00%
1448 76.98% 89.79% 9614 4264 12187 41756 67.50% 80.00%
1449 76.98% 89.79% 9614 4264 12187 41756 β€” β€”

πŸ”Ή Using Qalsadi 0.5.0 (speedup) – Improved Version

Function calls: 333,277,281 (333,274,551 primitive)
Time: 173.994 seconds
ID Fully Correct Strip Correct Fully WER Strip WER LER Total Line Fully Correct Line Strip Correct
1447 77.62% 90.38% 9336 4011 11792 41716 90.00% 90.00%
1448 77.61% 90.38% 9348 4019 11809 41756 70.00% 80.00%
1449 77.61% 90.38% 9348 4019 11809 41756 β€” β€”

πŸ§ͺ Evaluation on samples/vocalized/rndlines.txt

πŸ”Ή Using Qalsadi 0.5.0 (speedup)

Function calls: 7,342,645,429 (7,342,642,699 primitive)
Time: 3428.477 seconds
ID Fully Correct Strip Correct Fully WER Strip WER LER Total Line Fully Correct Line Strip Correct
9996 73.88% 87.74% 252,311 127,245 357,698 1,037,636 77.59% 87.93%
9997 73.88% 87.74% 252,337 127,259 357,736 1,037,746 76.36% 87.27%

πŸ”Ή Using Qalsadi 0.4.5

Function calls: 10,561,428,832 (10,561,426,121 primitive)
Time: 5091.545 seconds
ID Fully Correct Strip Correct Fully WER Strip WER LER Total Line Fully Correct Line Strip Correct
9996 71.98% 86.77% 272,017 137,320 380,440 1,037,636 74.14% 84.48%
9997 71.98% 86.77% 272,044 137,335 380,479 1,037,746 75.45% 86.36%

βœ… Summary

  • Accuracy: Qalsadi 0.5.0 outperforms 0.4.5 in all metrics across both datasets.
  • Speed: Execution time reduced by 23%–33%.
  • Efficiency: Function calls significantly reduced in 0.5.0, indicating better optimization.

linuxscout avatar Jul 19 '25 15:07 linuxscout

Analysis

Metric Description
Fully Correct % of tokens where full diacritization matches exactly.
Strip Correct % of tokens correct ignoring short vowels (i.e., stripped tashkeel).
WER Word Error Rate – number of incorrect tokens.
LER Letter Error Rate – more fine-grained than WER.
Total Total number of characters or tokens.
Function Calls Profiling data (used for performance analysis).
Time (s) Total runtime in seconds.

βœ… Results Summary & Comparison

πŸ“° Dataset: aljazeera.txt

Metric Qalsadi 0.4.5 Qalsadi 0.5.0 Improvement
Fully Correct (%) 76.98% 77.61% βœ… +0.63%
Strip Correct (%) 89.79% 90.38% βœ… +0.59%
Fully WER 9614 9348 βœ… lower
Strip WER 4264 4019 βœ… lower
LER 12187 11809 βœ… lower
Total Tokens 41756 41756 ⏸️ same
Runtime (sec) 225.037 173.994 βœ… faster
Function Calls 426M 333M βœ… fewer

🧠 Insight: Qalsadi 0.5.0 provides both higher accuracy and better performance on aljazeera.txt, showing it is a clear improvement.


πŸ“„ Dataset: rndlines.txt

Metric Qalsadi 0.4.5 Qalsadi 0.5.0 Improvement
Fully Correct (%) 71.98% 73.88% βœ… +1.9%
Strip Correct (%) 86.77% 87.74% βœ… +0.97%
Fully WER 272044 252337 βœ… lower
Strip WER 137335 127259 βœ… lower
LER 380479 357736 βœ… lower
Total Tokens 1037746 1037746 ⏸️ same
Runtime (sec) 5091.545 3428.477 βœ… faster
Function Calls 10.56B 7.34B βœ… fewer

🧠 Insight: Qalsadi 0.5.0 shows clear improvements in both speed and quality, particularly reducing word and letter errors.


🧾 Overall Conclusion

Aspect Result
Accuracy βœ… Qalsadi 0.5.0 improved Fully Correct by 0.6–1.9%.
Speed βœ… Reduced processing time by ~23% on aljazeera.txt and ~33% on rndlines.txt.
Efficiency βœ… Function calls reduced by 20–30%, indicating better optimization.
Stability βœ… Results are consistent across datasets.

πŸ”§ Recommendations

  1. Adopt Qalsadi 0.5.0 as default β€” it’s better in both accuracy and speed.
  2. Visualize metrics over versions to track progress.
  3. Break down WER/LER by error types for fine-grained evaluation.
  4. Add benchmark suite to the repo to automate these comparisons.

linuxscout avatar Jul 19 '25 15:07 linuxscout