NimbusML
NimbusML copied to clipboard
Document the internal JSON representation of the pipelines
Is your feature request related to a problem? Please describe.
As described in the issue #334, we are building an AutoML system in Python. We have our own pipeline representation and would like to use NimbusML operators as operators in our pipelines.
Describe the solution you'd like
To me it looks like the best approach is if you document the JSON representation of pipelines and then that we can directly use that JSON representation. We could then make a converter from our pipeline language to that JSON representation and then execute it by providing you with JSON representation directly.
So, two feature requests:
- Document JSON representation.
- Provide public API that I can provide JSON representation and run it. Ideally in two phases, fit and predict. Optionally obtaining output values after every operator, but at least the final output value(s) of the pipeline.
Describe alternatives you've considered
One way to do that is to use sklearn-compatibility classes directly, but that then goes through multiple levels of abstraction: we have to convert our pipelnes into sklearn pipelines so that NimbusML can convert it to JSON and then send it over to .NET to run it. I think it is much easier if we can directly provide the JSON.
Hi,
Please see the entrypoint graph documentation for ML.NET (https://github.com/dotnet/machinelearning/blob/master/docs/code/EntryPoints.md) and also some brief discussion here (https://github.com/microsoft/NimbusML/blob/341e01ab8d97af2ca8408dacf0b169f6d219d4c0/docs/developers/entrypoints.md) in this repo.
The usual way that I examine this entrypoint graph is to look at the input of px_call (see here https://github.com/microsoft/NimbusML/blob/15f12859273ea0b38bccbf5e7699bfb51c997013/src/python/nimbusml/internal/utils/entrypoints.py#L269).
Hope this helps.
I see. Thanks. So if I create a JSON to represent the pipeline, there is no public API yet to run that? Or am I missing anything?
Right. It is not exposed in python yet.
One more question. In this documentation, KMeansPlusPlus is used as an example. What I do not understand is why normalize parameter is defined in docstring and not auto-generated? Other parameters seem to be auto-generated.
Most likely we don't like the autogenerated ones and want to add some more details to it. See an example that we keep the auto-generated text for the same parameter: https://github.com/microsoft/NimbusML/blob/master/src/python/nimbusml/ensemble/fastforestbinaryclassifier.py
The content in the docstring is "patched" to the generated python classes by the auto-gen program ( https://github.com/microsoft/NimbusML/blob/46a14e6ddb921a243f269cdc56bc3fda05e13fa1/src/python/tools/entrypoint_compiler.py#L417).
Hm, but why is then normalize parameter not present twice in KMeansPlusPlus?
If the parameter description exists in the autogenerated files, the old contents will be updated by the doc string. https://github.com/microsoft/NimbusML/blob/46a14e6ddb921a243f269cdc56bc3fda05e13fa1/src/python/tools/doc_builder.py#L124
I see nice. Thanks for pointing to this.