ml-commons icon indicating copy to clipboard operation
ml-commons copied to clipboard

Improve Exception handling messages and stat counter

Open dhrubo-os opened this issue 9 months ago • 1 comments

Summary

ML Commons primarily relies on MLException and its derived classes:

ExecuteException MLLimitExceededException MLResourceNotFoundException MLValidationException

These exceptions define log severity (reference) and contribute to stats updates in the following places:

Stats update code reference 1 Stats update code reference 2

However, this does not cover all exceptions used in ML Commons. The project also frequently uses OpensearchStatusException in multiple places.

This creates an inconsistency, where "not found" exceptions from OpensearchStatusException get incorrectly included in failure stats, even though they should not count as failures.

Problem Statement Incomplete Exception Categorization

MLException is well-structured for handling ML-specific failures, but OpensearchStatusException is used without proper categorization. This leads to misclassification of errors, particularly 404 Not Found, which should not contribute to failure metrics. Inconsistent Logging Severity

ML Commons logs severity based on MLException, but errors from OpensearchStatusException do not follow the same log severity rules. This results in inconsistent error reporting and debugging challenges. Misclassified Stats Updates

The system updates failure stats when an MLException occurs, but some errors from OpensearchStatusException should be excluded. Example: A 404 Not Found from OpensearchStatusException should not be counted as a failure, but it currently is. Proposed Solution Enhance MLExceptionUtils to:

Map OpensearchStatusException properly based on HTTP status codes: 404 Not Found → Should not update failure stats. 500 Internal Server Error → Should still be counted as a failure. Ensure consistent logging levels: Align OpensearchStatusException with MLException log severity rules. Refactor all OpensearchStatusException handling through MLExceptionUtils: Centralize exception handling for unified processing. Expected Impact ✅ More accurate failure statistics: No longer miscounting expected errors (e.g., 404) as system failures. ✅ Consistent log severity levels: Easier debugging and monitoring. ✅ Unified exception handling: Clearer classification between OpenSearch errors and ML Commons-specific errors.

This will improve system reliability, ensure consistent failure tracking, and reduce unnecessary alerts in logs.

Would love feedback before moving forward with implementation!

dhrubo-os avatar Feb 06 '25 21:02 dhrubo-os