SynapseML
SynapseML copied to clipboard
VowpalWabbit - Train with Synapse on databricks and inference via native VW CLI
Hi,
I am training my VW CB model using synapseml on databricks.
Code is same as in: https://microsoft.github.io/SynapseML/docs/features/vw/Vowpal%20Wabbit%20-%20Overview/#vw-contextual-bandit
the model and featurizers/zipper pipeline are saved once training is complete.
now I want to do the inference using the native VW CLI. This is because the inference environment does not support spark.
can someone please shed light on how this can be done?
Adding @jackgerrits and @eisber for visibility here
you should be able to get the binary model saved: https://github.com/microsoft/SynapseML/blob/master/vw/src/main/scala/com/microsoft/azure/synapse/ml/vw/VowpalWabbitBaseModel.scala#L111
the tricky piece is to get the featurization right. did you use the VW featurizer or MLSpark featurize?
I used the VW featurizer as detailed here: https://microsoft.github.io/SynapseML/docs/features/vw/Vowpal%20Wabbit%20-%20Overview/#vw-contextual-bandit
I am able to save the binary model but issue lies with the featurizer (as was mentioned above as well).
The input for the featurizer is the follow table (except the target column) Data types are either int or float, thus the NumericFeaturizer is used.
VW Featurizer outputColumn is used as namespace, column names are feature names, values are used as feature values. Thus, for the example you reference this should result in
|features age:63 sex:1 cp:3 tresetbps:233 ...
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
| 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
| 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
| 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
| 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
From an ML perspective this featurization isn't ideal for VW, especially for low cardinality categorical I'd suggest to stringify them to get individual weights.
Assuming you change the data type of cp to string
|features age:63 sex:1 cp3:1 tresetbps:233 ...
|features age:37 sex:1 cp2:1 tresetbps:250 ...
If you want to make use of VWs interaction feature, you'll have to produce multiple feature vectors using different targetCols.
I understand. But how should I port a model trained on databricks using synapseml (python) with featurizers and zippers to VW CLI. Please refer below for further explanation:
I am training my VW Contextual Bandits model using synapseml on databricks.
Code is same as in: https://microsoft.github.io/SynapseML/docs/features/vw/Vowpal%20Wabbit%20-%20Overview/#vw-contextual-bandit
the model and featurizers/zipper pipeline are saved once training is complete.
now I want to do the inference using the native VW CLI. This is because the inference environment does not support spark.
can someone please shed light on how this can be done?
I'm not sure I follow how you plan to flow your data to VW cli - it only accepts text and json as input.
As of today the VW Spark featurizers don't support a serialization format outside of the spark eco-system.