spark-sklearn icon indicating copy to clipboard operation
spark-sklearn copied to clipboard

[WIP] Converts dataframe to/from named numpy arrays

Open thunterdb opened this issue 10 years ago • 4 comments

I found this incredibly convenient to create small dataframes, here is how you can use it:

n = 5
A = rd.rand(n,4)
C = rd.randint(10, size=n)
df = conv.pack_DataFrame(a=A, c=C)

DataFrame[a: vector, c: bigint]

And the other conversion. It properly extracts the proper shape for vectors, matrices, etc.

Z = Converter.df_to_numpy(df)
# Each column is strictly equal to the original.
Z['a'] == A
Z['c'] == C

Currently missing are more tests, better names, and sparse vectors. Not sure how easy it is to support these because they have an irregular shape between row. It is probably easier to prevent it and force users to use the CSC conversion that you already wrote.

thunterdb avatar Dec 02 '15 00:12 thunterdb

Just had a couple more comments.

jkbradley avatar Dec 16 '15 18:12 jkbradley

@jkbradley comments addressed

thunterdb avatar Dec 21 '15 21:12 thunterdb

This PR shoul unskip the following: test_cv_lasso_with_mllib_featurization (spark_sklearn.tests.test_grid_search_2.CVTests) ... SKIP: disable this test until we have numpy <-> dataframe conversion

vlad17 avatar Jun 28 '16 00:06 vlad17

I'm starting to look through the open PRs to see if we can merge them or whether they're stale -- @thunterdb is this one too old to resurrect?

srowen avatar Dec 07 '18 21:12 srowen