rikai icon indicating copy to clipboard operation
rikai copied to clipboard

Defining the behavior of create model while using mlflow as catalog and storage.

Open Renkai opened this issue 3 years ago • 8 comments

While using local memory as the catalog and using Mlflow as storage, we can define a model like this:

CREATE OR REPLACE MODEL yolov5s
FLAVOR yolov5
PREPROCESSOR 'rikai.contrib.yolov5.transforms.pre_processing'
POSTPROCESSOR 'rikai.contrib.yolov5.transforms.post_processing'
USING 'mlflow:///da-yolov5s-model';

Because we have different places to save the model name in Rikai or in Mlflow. In Rikai it's yolov5s, in Mlflow it's da-yolov5s-model.

But while we are using Mlflow for both two reasons, the game changed. We only have one place to save names, so the model name in Rikai should always be the same as the model name in Mlflow.

In this condition, I suggest making these two ways as a valid definition:

One:

CREATE OR REPLACE MODEL yolov5s
FLAVOR yolov5
PREPROCESSOR 'rikai.contrib.yolov5.transforms.pre_processing'
POSTPROCESSOR 'rikai.contrib.yolov5.transforms.post_processing'
USING 'mlflow:///yolov5s';

Two:

CREATE OR REPLACE MODEL yolov5s
FLAVOR yolov5
PREPROCESSOR 'rikai.contrib.yolov5.transforms.pre_processing'
POSTPROCESSOR 'rikai.contrib.yolov5.transforms.post_processing'
//Fill mlflow URL automatically while using Mlflow catalog

If the model name has a conflict with the model URL, it should throw an exception:

CREATE OR REPLACE MODEL yolov5s
FLAVOR yolov5
PREPROCESSOR 'rikai.contrib.yolov5.transforms.pre_processing'
POSTPROCESSOR 'rikai.contrib.yolov5.transforms.post_processing'
USING 'mlflow:///da-yolov5s-model'; //Throw name conflicting exception.

Renkai avatar Jan 10 '22 02:01 Renkai

If you're using the mlflow catalog, one use case for explicit "create model" is if you wanted to use a previous version. In that case, the model name should be different to distinguish from the latest version no?

changhiskhan avatar Jan 10 '22 03:01 changhiskhan

If you're using the mlflow catalog, one use case for explicit "create model" is if you wanted to use a previous version. In that case, the model name should be different to distinguish from the latest version no?

That's a reasonable requirement, but if we need this, we need to save the Rikai model name somewhere else, not just in mlflow, where should be the best candidate?

Renkai avatar Jan 10 '22 03:01 Renkai

If we are using mlflow as catalog, supposely there should be no need to use mlflow as registry?

ie., SHOW MODELS should actually list out all the compatible models in mlflow, so once a Spark cluster starts, and pointed to mlflow as catalog service, users dont need to CREATE MODEL ... USING 'mlflow://...' anymore.

eddyxu avatar Jan 10 '22 04:01 eddyxu

If we are using mlflow as catalog, supposely there should be no need to use mlflow as registry?

ie., SHOW MODELS should actually list out all the compatible models in mlflow, so once a Spark cluster starts, and pointed to mlflow as catalog service, users dont need to CREATE MODEL ... USING 'mlflow://...' anymore.

For the current implementation, create model is an action that will make non-compatible models to compatible models like add Rikai related tags to a model, then it will show up in the list model command, so I guess we still need it?

Renkai avatar Jan 10 '22 05:01 Renkai

CREATE OR REPLACE MODEL yolov5s_1
OPTIONS (threshold_a 1)
FLAVOR yolov5
USING 'mlflow:///da-yolov5s-model';
;

CREATE OR REPLACE MODEL yolov5s_2
OPTIONS (threshold_a 2)
FLAVOR yolov5
USING 'mlflow:///da-yolov5s-model';
;

What if we want to derive from registered model using different options as showed above?

da-liii avatar Jan 10 '22 08:01 da-liii

Isnt that the difference between a Catalog and a Registry?

A Catalog maintains a discoverable namespace for a DBMS, in this case, ModelCatalog maintains the namespace of models that can be used by ML_PREDICT, where the SQL parser relies on to make logic plan for ML_PREDICT(foo, bar).

To this extend, it sounds wrong to me if MLflowCatalog is configured as the catalog for rikai, but the names in mlflow are not visible directly to Sql.

The relationship between Catalog and Registry is very similar to the relationship of a Table and a Location of the actual data. Table is a logic concept that are visible in the SQL parser, and managed by the Catalog service in any db, while Location (i.e., table foo.bar 's location is s3://bucket/warehouse/foo/bar) is where the table physically is, and s3:// is the instruction to tell Spark how to load data (reading via hadoop-aws.jar => aws-sdk.jar => aws-s3.jar).

In your example above, that's just three models in mlflow (similarly three tables in Hive metadata store?). If the purpose of registering three different models was merely change the configuration, we can have other means to do it, which will be an improvement of the current implementation.

For example, if we just want to dynamically change the runtime parameters, we can use SQL 's SET

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-conf-mgmt-set.html

eddyxu avatar Jan 10 '22 16:01 eddyxu

I think MLflowCatalog should be removed!

The same behavior can be changed to if we are using the MLflow registery, we could automatically create all models found on the MLflow registery to the in-memory SimpleCatalog.

In this way, there will be no scans to MLflow when we:

show models

da-liii avatar Jan 11 '22 02:01 da-liii

Shouldnt there a persistent model catalog (whether it was backed by mlflow or not) in Rikai, that can offer a consistent model namespace between users, sessions, departments?

In this way, there will be no scans to MLflow when we: show models

Is your objection to a persistent catalog to be performance issue? What is performance different between SHOW models and SHOW TABLES in a production?

The same behavior can be changed to if we are using the MLflow registery, we could automatically create all models found on the MLflow registery to the in-memory SimpleCatalog.

isnt this a scan guaranteed each time when a SparkSession starts? And in this case, say we have a bunch of long running Spark sessions, i.e., ThriftServer, for interactive queries, does that mean everytime a user register an mode, all spark clusters need to restart to use the new model? Or someone needs to manually apply CREATE MODEL to every cluster?

eddyxu avatar Jan 11 '22 20:01 eddyxu