spark-sklearn
spark-sklearn copied to clipboard
[WIP] Converts dataframe to/from named numpy arrays
I found this incredibly convenient to create small dataframes, here is how you can use it:
n = 5
A = rd.rand(n,4)
C = rd.randint(10, size=n)
df = conv.pack_DataFrame(a=A, c=C)
DataFrame[a: vector, c: bigint]
And the other conversion. It properly extracts the proper shape for vectors, matrices, etc.
Z = Converter.df_to_numpy(df)
# Each column is strictly equal to the original.
Z['a'] == A
Z['c'] == C
Currently missing are more tests, better names, and sparse vectors. Not sure how easy it is to support these because they have an irregular shape between row. It is probably easier to prevent it and force users to use the CSC conversion that you already wrote.
Just had a couple more comments.
@jkbradley comments addressed
This PR shoul unskip the following: test_cv_lasso_with_mllib_featurization (spark_sklearn.tests.test_grid_search_2.CVTests) ... SKIP: disable this test until we have numpy <-> dataframe conversion
I'm starting to look through the open PRs to see if we can merge them or whether they're stale -- @thunterdb is this one too old to resurrect?