elki icon indicating copy to clipboard operation
elki copied to clipboard

Non-Serializable classes

Open valera7979 opened this issue 6 years ago • 3 comments

It would be nice to add serialization in classes. In particular, to save cluster models

valera7979 avatar Mar 27 '18 15:03 valera7979

Actually I do not think there is much use in serialization of cluster models. They are not predictive models that you would "deploy" to a "production pipeline", like a classifier.

But I agree that in general, it would be nice to have efficient serialization support. But this is a lot of very boring work, and we do not have volunteers to do this. So it is of very low priority and is likely not going to happen.

kno10 avatar Mar 27 '18 15:03 kno10

Thanks. About of cluster models serialization. I worked on a task where I had to train a model and then in another task I compared the data with the model created earlier. Because there was no serialization, I had to save the points entering into clusters, and then restore the model to outliers detection. So I think the serialization of the cluster model is also useful.

valera7979 avatar Apr 13 '18 13:04 valera7979

The difficulties with a general solution are that the clusterings do not have the data. They only have the object IDs. And these are not persistent. So any serializer would likely have to "join" the clusters with the original data. At which point it becomes a huge blob to serialize, and for many applications you are much better off with just using your own serialization with exactly the format and data parts (coordinates, labels, identifiers such as file names - there could be arbitrary complex data associated with each object ID) that you need. For many clustering algorithms, you do not have much more than the object IDs (except k-means, where you have cluster means). And this variability makes any generic serialization a real pain to design, and likely to break all the time.

kno10 avatar Apr 13 '18 13:04 kno10