NimbusML icon indicating copy to clipboard operation
NimbusML copied to clipboard

Document the internal JSON representation of the pipelines

Open mitar opened this issue 4 years ago • 8 comments

Is your feature request related to a problem? Please describe.

As described in the issue #334, we are building an AutoML system in Python. We have our own pipeline representation and would like to use NimbusML operators as operators in our pipelines.

Describe the solution you'd like

To me it looks like the best approach is if you document the JSON representation of pipelines and then that we can directly use that JSON representation. We could then make a converter from our pipeline language to that JSON representation and then execute it by providing you with JSON representation directly.

So, two feature requests:

  • Document JSON representation.
  • Provide public API that I can provide JSON representation and run it. Ideally in two phases, fit and predict. Optionally obtaining output values after every operator, but at least the final output value(s) of the pipeline.

Describe alternatives you've considered

One way to do that is to use sklearn-compatibility classes directly, but that then goes through multiple levels of abstraction: we have to convert our pipelnes into sklearn pipelines so that NimbusML can convert it to JSON and then send it over to .NET to run it. I think it is much easier if we can directly provide the JSON.

mitar avatar Oct 16 '19 11:10 mitar

Hi,

Please see the entrypoint graph documentation for ML.NET (https://github.com/dotnet/machinelearning/blob/master/docs/code/EntryPoints.md) and also some brief discussion here (https://github.com/microsoft/NimbusML/blob/341e01ab8d97af2ca8408dacf0b169f6d219d4c0/docs/developers/entrypoints.md) in this repo.

The usual way that I examine this entrypoint graph is to look at the input of px_call (see here https://github.com/microsoft/NimbusML/blob/15f12859273ea0b38bccbf5e7699bfb51c997013/src/python/nimbusml/internal/utils/entrypoints.py#L269).

Hope this helps.

zyw400 avatar Oct 17 '19 12:10 zyw400

I see. Thanks. So if I create a JSON to represent the pipeline, there is no public API yet to run that? Or am I missing anything?

mitar avatar Oct 17 '19 12:10 mitar

Right. It is not exposed in python yet.

zyw400 avatar Oct 17 '19 12:10 zyw400

One more question. In this documentation, KMeansPlusPlus is used as an example. What I do not understand is why normalize parameter is defined in docstring and not auto-generated? Other parameters seem to be auto-generated.

mitar avatar Oct 17 '19 13:10 mitar

Most likely we don't like the autogenerated ones and want to add some more details to it. See an example that we keep the auto-generated text for the same parameter: https://github.com/microsoft/NimbusML/blob/master/src/python/nimbusml/ensemble/fastforestbinaryclassifier.py

The content in the docstring is "patched" to the generated python classes by the auto-gen program ( https://github.com/microsoft/NimbusML/blob/46a14e6ddb921a243f269cdc56bc3fda05e13fa1/src/python/tools/entrypoint_compiler.py#L417).

zyw400 avatar Oct 17 '19 13:10 zyw400

Hm, but why is then normalize parameter not present twice in KMeansPlusPlus?

mitar avatar Oct 17 '19 13:10 mitar

If the parameter description exists in the autogenerated files, the old contents will be updated by the doc string. https://github.com/microsoft/NimbusML/blob/46a14e6ddb921a243f269cdc56bc3fda05e13fa1/src/python/tools/doc_builder.py#L124

zyw400 avatar Oct 17 '19 13:10 zyw400

I see nice. Thanks for pointing to this.

mitar avatar Oct 17 '19 13:10 mitar