h2o4gpu icon indicating copy to clipboard operation
h2o4gpu copied to clipboard

Tests for model (un)pickle

Open mdymczyk opened this issue 7 years ago • 7 comments

We should add tests for saving and loading pickled models (something like what we already have for XGBoost) for all algorithms (pogs based, kmeans, tsvd, pca) to see if we can actually save and load all our models.

mdymczyk avatar May 04 '18 14:05 mdymczyk

We don't for gpu kmeans and gpu svd. xgboost has a special hook so it copies over (in C) any references in python. We can do the same.

pseudotensor avatar May 04 '18 16:05 pseudotensor

@pseudotensor do we really need that, though? At least for KMeans the only thing we need are the centroids, which we have in Python as a numpy array so we could just pickle and unpickle that. After unpickling we can just pass it (as we do now) to the C backend, no?

mdymczyk avatar May 06 '18 22:05 mdymczyk

For example seems to work out of the box for KMeans (verbose logs to show it's running on GPUs):

>>> import pickle
>>> import h2o4gpu
>>> import numpy as np
>>>
>>> X = np.array([[1.,1.], [1.,4.], [1.,0.]])
>>>
>>> model = h2o4gpu.KMeans(verbose=100, n_clusters=2,random_state=1234).fit(X)

Using GPU KMeans solver with 2 GPUs.

Using h2o4gpu backend.

Using GPU KMeans solver with 2 GPUs.

Detected np.float64 data
2 gpus.
Copying data to device: 1
Copying data to device: 0
Threshold triggered. Terminating early.
  Time fit: 0.00288296 s
Timetransfer: 0.0531921 Timefit: 0.00288296 Timecleanup: 0.00114107
>>> model.cluster_centers_
array([[1., 1.],
       [1., 4.]])
>>>
>>> pickle.dump( model, open( "save.p", "wb" ) )
>>> unpickled_model = pickle.load( open( "save.p", "rb" ) )
>>> unpickled_model.cluster_centers_
array([[1., 1.],
       [1., 4.]])
>>> model.predict(X)

Using GPU KMeans solver with 2 GPUs.

Detected np.float64 data
Detected np.float64 data
2 gpus.
array([1, 0, 0], dtype=int32)
>>> unpickled_model.predict(X)

Using GPU KMeans solver with 2 GPUs.

Detected np.float64 data
Detected np.float64 data
2 gpus.
array([1, 0, 0], dtype=int32)

mdymczyk avatar May 06 '18 22:05 mdymczyk

Yes, should be easy (or already true) that for kmeans easy as only thing fit does is find centroids.

pseudotensor avatar May 06 '18 23:05 pseudotensor

@pseudotensor yes, thought it would work out of the box for all our models since we copy all the necessary data from C to Python, but @wenphan noticed that for POGS based models it was having problems pickling (ask from a potential user). From the log it had to do something with . CDLL and/or ctypes so maybe for POGS we need to do some more work but hopefully kmeans and svd are already good.

mdymczyk avatar May 06 '18 23:05 mdymczyk

We should move forward on dropping pogs anyways. I have a gblinear wrapper we can use as base line that does lambda search with warm start. We can use the rest of the CV fold stuff but do it in python instead of C. Probably easiest.

pseudotensor avatar May 06 '18 23:05 pseudotensor

@pseudotensor yes once @RAMitchell impl is stable enough I'm 100% for removing POGS altogether from the codebase.

mdymczyk avatar May 06 '18 23:05 mdymczyk