aardpfark icon indicating copy to clipboard operation
aardpfark copied to clipboard

Complete implementation of remaining Spark transformers

Open MLnick opened this issue 7 years ago • 3 comments

  • [ ] OneHotEncoder
  • [ ] RFormula
  • [ ] PolynomialExpansion
  • [ ] Interaction
  • [ ] Imputer
  • [ ] VectorIndexer
  • [ ] Word2Vec

These require MurmurHash3 to be added as a built-in PFA function (refer to related Hadrian issue):

  • [ ] HashingTF
  • [ ] FeatureHasher

MLnick avatar Jun 08 '18 17:06 MLnick

Hey @MLnick looking into picking up one of these Transforms to start learning more about aardpfark, starting with OneHotEnoder. For OneHotEncoder, looks like it's reliant on a StringIndexer in order to determine the length of output, but the transformer itself doesn't require it in Spark (i.e. the data tells the OneHotEncoder how to transform it, as opposed to being fit).

As of 2.3 it seems this has been addressed with OneHotEncoderEstimator, which has a fit and returns a OneHotEncoderModel with categorySizes https://spark.apache.org/docs/latest/ml-features.html#onehotencoderestimator

Should support be added for 2.3 (i can try and upgrade) and use that instead?

Paxanator avatar Nov 18 '18 23:11 Paxanator

Hi @Paxanator thanks for your interest in Aardpfark!

Yes I agree, OneHotEncoder as from Spark 2.3 would be the best way forward for this transformer. Let me know if you need some assistance.

I'll take a look at upgrading Spark version - hopefully shouldn't be much of a problem.

MLnick avatar Nov 20 '18 20:11 MLnick

Thank you for putting the library together! I'll wait on the Spark Version bump before trying to tackle it

Paxanator avatar Nov 21 '18 02:11 Paxanator