SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

Vowpal Wabbit - Large cardinality

Open AllardJM opened this issue 10 months ago • 0 comments

Spark: 3.2 synapse: com.microsoft.azure:synapseml_2.12:0.11.3

I have a large dataset (~19 billion rows). If I run VW on the data, using 8-10 columns (all but 1 are not numeric), the process completes in about 9 minutes, even with multiple quadratic terms (not shown) in the pass through args.

model = VowpalWabbitGeneric(
    numPasses=1,
    numBits = 18,
    useBarrierExecutionMode=False,
    passThroughArgs=" --loss_function logistic --link logistic --l1 0.000001  "
).fit(sdf)

However, if I take the same data and hash the 8-10 columns so that the resulting feature has ~5.5million distinct values and run the above, it runs forever (I killed the process after 10 hours).

Is there anything to know in terms of running VW on Spark when a name space has a very large and potentially sparse cardinality?

AllardJM avatar Jan 15 '25 13:01 AllardJM