SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

VowpalWabbit - Train with Synapse on databricks and inference via native VW CLI

Open harresbintariq opened this issue 3 years ago • 7 comments

Hi,

I am training my VW CB model using synapseml on databricks.

Code is same as in: https://microsoft.github.io/SynapseML/docs/features/vw/Vowpal%20Wabbit%20-%20Overview/#vw-contextual-bandit

the model and featurizers/zipper pipeline are saved once training is complete.

now I want to do the inference using the native VW CLI. This is because the inference environment does not support spark.

can someone please shed light on how this can be done?

harresbintariq avatar Feb 03 '22 15:02 harresbintariq

Adding @jackgerrits and @eisber for visibility here

mhamilton723 avatar Feb 24 '22 18:02 mhamilton723

you should be able to get the binary model saved: https://github.com/microsoft/SynapseML/blob/master/vw/src/main/scala/com/microsoft/azure/synapse/ml/vw/VowpalWabbitBaseModel.scala#L111

the tricky piece is to get the featurization right. did you use the VW featurizer or MLSpark featurize?

eisber avatar Feb 24 '22 21:02 eisber

I used the VW featurizer as detailed here: https://microsoft.github.io/SynapseML/docs/features/vw/Vowpal%20Wabbit%20-%20Overview/#vw-contextual-bandit

harresbintariq avatar Feb 28 '22 09:02 harresbintariq

I am able to save the binary model but issue lies with the featurizer (as was mentioned above as well).

harresbintariq avatar Feb 28 '22 09:02 harresbintariq

The input for the featurizer is the follow table (except the target column) Data types are either int or float, thus the NumericFeaturizer is used.

VW Featurizer outputColumn is used as namespace, column names are feature names, values are used as feature values. Thus, for the example you reference this should result in

|features age:63 sex:1 cp:3 tresetbps:233 ...
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
57 0 0 120 354 0 1 163 1 0.6 2 0 2 1

From an ML perspective this featurization isn't ideal for VW, especially for low cardinality categorical I'd suggest to stringify them to get individual weights.

Assuming you change the data type of cp to string

|features age:63 sex:1 cp3:1 tresetbps:233 ...
|features age:37 sex:1 cp2:1 tresetbps:250 ...

If you want to make use of VWs interaction feature, you'll have to produce multiple feature vectors using different targetCols.

eisber avatar Mar 01 '22 07:03 eisber

I understand. But how should I port a model trained on databricks using synapseml (python) with featurizers and zippers to VW CLI. Please refer below for further explanation:

I am training my VW Contextual Bandits model using synapseml on databricks.

Code is same as in: https://microsoft.github.io/SynapseML/docs/features/vw/Vowpal%20Wabbit%20-%20Overview/#vw-contextual-bandit

the model and featurizers/zipper pipeline are saved once training is complete.

now I want to do the inference using the native VW CLI. This is because the inference environment does not support spark.

can someone please shed light on how this can be done?

harresbintariq avatar Mar 01 '22 10:03 harresbintariq

I'm not sure I follow how you plan to flow your data to VW cli - it only accepts text and json as input.

As of today the VW Spark featurizers don't support a serialization format outside of the spark eco-system.

eisber avatar Mar 01 '22 12:03 eisber