frameless
frameless copied to clipboard
Add TypedOneHotEncoder
This adds a typed API over Spark ML's OneHotEncoderEstimator
since Spark 2.3.0
Codecov Report
Merging #322 into master will increase coverage by
1.53%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #322 +/- ##
==========================================
+ Coverage 94.83% 96.36% +1.53%
==========================================
Files 53 56 +3
Lines 968 991 +23
Branches 9 9
==========================================
+ Hits 918 955 +37
+ Misses 50 36 -14
Impacted Files | Coverage Δ | |
---|---|---|
...cala/frameless/ml/feature/TypedOneHotEncoder.scala | 100% <100%> (ø) |
|
...main/scala/frameless/ops/RelationalGroupsOps.scala | 79.16% <0%> (-18.46%) |
:arrow_down: |
...la/frameless/functions/NonAggregateFunctions.scala | 100% <0%> (ø) |
:arrow_up: |
...re/src/main/scala/frameless/CatalystAbsolute.scala | ||
...re/src/main/scala/frameless/CatalystBitShift.scala | 100% <0%> (ø) |
|
.../frameless/CatalystNumericWithJavaBigDecimal.scala | 100% <0%> (ø) |
|
core/src/main/scala/frameless/CatalystRound.scala | 100% <0%> (ø) |
|
...aset/src/main/scala/frameless/ops/GroupByOps.scala | 98.36% <0%> (+31.69%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 012f1a1...b51b337. Read the comment docs.
@manuzhang thanks for the PR! Will be looking at this shortly (I hope!).
@manuzhang I had a quick look at the diff and it looks good to me, but this is an area for frameless I'm not very familiar with.
@atamborrino Would you have time for a review?
@manuzhang I am taking a closer look at the PR. I think this implementation can be made better and more type-safe. A nice API would be:
val ds: TypedDataset[(String, Int, Long)] = ...
ds.transformOneHot(ds('_1)): TypedDataset[(String, Int, Long, Vector)]
The above will not compile if you try to do one-hot encoding to any column that is not String or Char, so you can add further type-safety restriction on the type of column you want to transform. Also, note how the type of the resulting dataset correctly has a new column of type vector appended to the end.
@imarios TypedOneHotEncoder
is a TypedEstimator
which generates a type-safe TypedTransformer
with fit
. The transformer from TypedOneHotEncoder
requires Int
as inputs and appended Vector
in outputs which are checked at compile time.
@manuzhang I totally dropped the ball on this PR. Let me go over this one more time and merge the PR.