frameless icon indicating copy to clipboard operation
frameless copied to clipboard

Add TypedOneHotEncoder

Open manuzhang opened this issue 5 years ago • 6 comments

This adds a typed API over Spark ML's OneHotEncoderEstimator since Spark 2.3.0

manuzhang avatar Sep 09 '18 08:09 manuzhang

Codecov Report

Merging #322 into master will increase coverage by 1.53%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #322      +/-   ##
==========================================
+ Coverage   94.83%   96.36%   +1.53%     
==========================================
  Files          53       56       +3     
  Lines         968      991      +23     
  Branches        9        9              
==========================================
+ Hits          918      955      +37     
+ Misses         50       36      -14
Impacted Files Coverage Δ
...cala/frameless/ml/feature/TypedOneHotEncoder.scala 100% <100%> (ø)
...main/scala/frameless/ops/RelationalGroupsOps.scala 79.16% <0%> (-18.46%) :arrow_down:
...la/frameless/functions/NonAggregateFunctions.scala 100% <0%> (ø) :arrow_up:
...re/src/main/scala/frameless/CatalystAbsolute.scala
...re/src/main/scala/frameless/CatalystBitShift.scala 100% <0%> (ø)
.../frameless/CatalystNumericWithJavaBigDecimal.scala 100% <0%> (ø)
core/src/main/scala/frameless/CatalystRound.scala 100% <0%> (ø)
...aset/src/main/scala/frameless/ops/GroupByOps.scala 98.36% <0%> (+31.69%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 012f1a1...b51b337. Read the comment docs.

codecov-io avatar Sep 09 '18 11:09 codecov-io

@manuzhang thanks for the PR! Will be looking at this shortly (I hope!).

imarios avatar Sep 11 '18 21:09 imarios

@manuzhang I had a quick look at the diff and it looks good to me, but this is an area for frameless I'm not very familiar with.

@atamborrino Would you have time for a review?

OlivierBlanvillain avatar Sep 17 '18 16:09 OlivierBlanvillain

@manuzhang I am taking a closer look at the PR. I think this implementation can be made better and more type-safe. A nice API would be:

val ds: TypedDataset[(String, Int, Long)] = ...
ds.transformOneHot(ds('_1)): TypedDataset[(String, Int, Long, Vector)]

The above will not compile if you try to do one-hot encoding to any column that is not String or Char, so you can add further type-safety restriction on the type of column you want to transform. Also, note how the type of the resulting dataset correctly has a new column of type vector appended to the end.

imarios avatar Sep 22 '18 12:09 imarios

@imarios TypedOneHotEncoder is a TypedEstimator which generates a type-safe TypedTransformer with fit. The transformer from TypedOneHotEncoder requires Int as inputs and appended Vector in outputs which are checked at compile time.

manuzhang avatar Sep 28 '18 02:09 manuzhang

@manuzhang I totally dropped the ball on this PR. Let me go over this one more time and merge the PR.

imarios avatar Dec 05 '18 06:12 imarios