hummingbird icon indicating copy to clipboard operation
hummingbird copied to clipboard

Should SKLearn operators be assumed to produce a single output?

Open stillmatic opened this issue 3 years ago • 7 comments

See https://github.com/microsoft/hummingbird/blob/main/hummingbird/ml/_parse.py#L256

Consider models which implement predict and predict_proba functions. These return both label and probabilities as outputs. The current logic means that we cannot name the outputs in the hummingbird conversion step (ie with output_names argument to extra_config) and instead have to perform some ONNX graph surgery afterwards.

stillmatic avatar Nov 25 '22 18:11 stillmatic

took an extremely silly hard coded approach here: https://github.com/stillmatic/hummingbird/commit/0a2fc96bdb1112607ae61719555424ce1ba80255

stillmatic avatar Jan 17 '23 17:01 stillmatic

Hey let me take a stab at this. It requires several changes in order to make it generic, but I agree that it could be the source of the error behind #676.

interesaaat avatar Jan 17 '23 17:01 interesaaat

Of course unless you want to try to fix this yourself!

interesaaat avatar Jan 17 '23 17:01 interesaaat

yeah, the generic case is quite tricky, hard coding it is much easier ;)

tried to do a cleaner implementation by adding XGBClassifier to the supported sklearn operator map, but that requires changes to onnxmltools upstream, probably. for right now, happy with https://github.com/stillmatic/hummingbird/commit/0a2fc96bdb1112607ae61719555424ce1ba80255 for myself, it's probably too hacky to upstream though.

I don't think it's the problem behind #676, after debugging more, that problem appears to be specifically in the GEMM -> ONNX conversion.

stillmatic avatar Jan 17 '23 18:01 stillmatic

@stillmatic can you check if this repo will solve your problem?

interesaaat avatar Jan 17 '23 23:01 interesaaat

interesting, this definitely looks like it is properly generating the two variables. on one of my internal models, I'm seeing this output

In [10]: hmclf.model.graph.output
Out[10]:
[name: "variable1"
type {
  tensor_type {
    elem_type: 7
    shape {
      dim {
        dim_param: "sym"
      }
    }
  }
}
, name: "variable2"
type {
  tensor_type {
    elem_type: 11
    shape {
      dim {
        dim_param: "sym"
      }
      dim {
        dim_param: "Concatvariable2_dim_1"
      }
    }
  }
}
]

though I can't generate a reproducible example. it doesn't seem to affect prediction, onnx is happy with it. let me dig a bit more over the next few days!

stillmatic avatar Jan 18 '23 22:01 stillmatic

@interesaaat -- sorry for delayed response. I have tested your branch with internal models and everything seems to be working, so think it could be upstreamed. Thanks for the help!

stillmatic avatar Feb 02 '23 23:02 stillmatic