h2o4gpu
h2o4gpu copied to clipboard
[WIP] Field-aware factorization machines
Initial implementation of field-aware factorization machines.
Based on these 2 whitepapers:
- https://arxiv.org/pdf/1701.04099.pdf
- https://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf
And the following repositories:
- https://github.com/guestwalk/libffm (original impl)
- https://github.com/alexeygrigorev/libffm-python (Python interfact for it)
- https://github.com/RTBHOUSE/cuda-ffm (CUDA implementation of a simplified method)
Currently only initial GPU implementation as CPU will most probably just be a copy of the original impl (without the SSE alignments for now).
No benchmarks so far as there's still something wrong (getting different results).
Thing to be still done:
- add validation set option and early stopping (FFM seems to need this a lot as it tends to overfit)
- add multi GPU support
- review the data structures used - using an object oriented approach with Dataset/Row/Node hierarchy is good for development but might provide a lot of overhead when copying data to the device, refactoring this into 3 (or more) continuous arrays might provide a lot of speedup
- review the main method
wTx(intrainer.cu) - probably can be rewritten in a more GPU friendly manner - probably something else I'm forgetting
If anyone wants to take it for a spin:
>>> from h2o4gpu.solvers.ffm import FFMH2O
>>> import numpy as np
>>> X = [ [(1, 2, 1), (2, 3, 1), (3, 5, 1)],
... [(1, 0, 1), (2, 3, 1), (3, 7, 1)],
... [(1, 1, 1), (2, 3, 1), (3, 7, 1), (3, 9, 1)] ]
>>>
>>> y = [1, 1, 0]
>>> ffmh2o = FFMH2O(n_gpus=1)
>>> ffmh2o.fit(X,y)
<h2o4gpu.solvers.ffm.FFMH2O object at 0x7f2d30319fd0>
>>> ffmh2o.predict(X)
array([0.7611223 , 0.6475924 , 0.88890105], dtype=float32)
The input format is a list of lists containing fieldIdx:featureIdx:value tuples and a corresponding list of labels (0 or 1) for each row.
So both CPU and GPU implementations are there and working, the only issue left is that GPU batch mode gives slightly different results with same # of iterations (or converges in a much larger number of iterations) compared to GPU batch mode with batch_size=1 and CPU modes. I'm guessing this is because we are using HOGWILD! and the order of computations during gradient update differs (and might not be 100% correct?).
One more thing: this needs to be compared against bigger data (libffm_toy.zip) and the original cpp implementation (https://github.com/guestwalk/libffm - not the Python API). I think the GPU version was getting a bit different results, so needs double checking before merging.