k-NN icon indicating copy to clipboard operation
k-NN copied to clipboard

More detailed error messages for KNN model training

Open simonhessner opened this issue 1 year ago • 3 comments

When training a KNN model using the /_plugins/_knn/models/{model_name}/_train POST endpoint, often the error message returned by GET /_plugins/_knn/models/{model_name} just says

Failed to execute training. May be caused by an invalid method definition or not enough memory to perform training.

It took me a while to figure out what exactly the problem was. I found a more detailed error message in the output of the opensearch process running in my docker container:

opensearch | [2023-11-28T10:58:37,331][ERROR][o.o.k.t.TrainingJob ] [7747b1d9c94f] Failed to run training job for model "my_model_name": Error in void faiss::Clustering::train_encoded(faiss::idx_t, const uint8_t*, const faiss::Index*, faiss::Index&, const float*) at /tmp/tmp61nvdqz7/k-NN/jni/external/faiss/faiss/Clustering.cpp:281: Error: 'nx >= k' failed: Number of training points (100) should be at least as large as number of clusters (256)

With this error message I was able to fix the issue very easily. But often I don't have direct access to the opensearch process. For that reason it would be very helpful to have get GET endpoint return error messages that contain all the available information about why the training failed.

Documentation Currently the documentation doesn't mention a minimum requirement of datapoints for training. It would be great to clarify this in the docs in addition to having better error messages.

simonhessner avatar Nov 29 '23 11:11 simonhessner

@simonhessner thanks for creating the issue. But rather than sending out the full message I think the problem which you encountered can be easily solved upfront adding a validation before starting the training around number of documents in the training index > 0.

Please let me know if you think otherwise. Exposing internal details in the API response is not a best practice, but I too agree the error message can be improved.

navneet1v avatar Dec 14 '23 05:12 navneet1v

@navneet1v That would work, however testing for number of docs > 0 is not sufficient. I think the very least (according to the error message I got from the opensearch service) is 256 training docs. Not sure if that depends on any other hyperparameter. And even with 256 I got a warning that only resolved after I had around 10k

simonhessner avatar Dec 14 '23 11:12 simonhessner

@simonhessner thanks for the details. We will prioritize this issue. Will keep posted on the timeline

vamshin avatar Dec 18 '23 19:12 vamshin