aardpfark
aardpfark copied to clipboard
Complete implementation of remaining Spark transformers
- [ ] OneHotEncoder
- [ ] RFormula
- [ ] PolynomialExpansion
- [ ] Interaction
- [ ] Imputer
- [ ] VectorIndexer
- [ ] Word2Vec
These require MurmurHash3 to be added as a built-in PFA function (refer to related Hadrian issue):
- [ ] HashingTF
- [ ] FeatureHasher
Hey @MLnick looking into picking up one of these Transforms to start learning more about aardpfark, starting with OneHotEnoder. For OneHotEncoder, looks like it's reliant on a StringIndexer in order to determine the length of output, but the transformer itself doesn't require it in Spark (i.e. the data tells the OneHotEncoder how to transform it, as opposed to being fit).
As of 2.3 it seems this has been addressed with OneHotEncoderEstimator, which has a fit and returns a OneHotEncoderModel with categorySizes
https://spark.apache.org/docs/latest/ml-features.html#onehotencoderestimator
Should support be added for 2.3 (i can try and upgrade) and use that instead?
Hi @Paxanator thanks for your interest in Aardpfark!
Yes I agree, OneHotEncoder as from Spark 2.3 would be the best way forward for this transformer. Let me know if you need some assistance.
I'll take a look at upgrading Spark version - hopefully shouldn't be much of a problem.
Thank you for putting the library together! I'll wait on the Spark Version bump before trying to tackle it