rikai
rikai copied to clipboard
Defining the behavior of create model while using mlflow as catalog and storage.
While using local memory as the catalog and using Mlflow as storage, we can define a model like this:
CREATE OR REPLACE MODEL yolov5s
FLAVOR yolov5
PREPROCESSOR 'rikai.contrib.yolov5.transforms.pre_processing'
POSTPROCESSOR 'rikai.contrib.yolov5.transforms.post_processing'
USING 'mlflow:///da-yolov5s-model';
Because we have different places to save the model name in Rikai or in Mlflow.
In Rikai it's yolov5s
, in Mlflow it's da-yolov5s-model
.
But while we are using Mlflow for both two reasons, the game changed. We only have one place to save names, so the model name in Rikai should always be the same as the model name in Mlflow.
In this condition, I suggest making these two ways as a valid definition:
One:
CREATE OR REPLACE MODEL yolov5s
FLAVOR yolov5
PREPROCESSOR 'rikai.contrib.yolov5.transforms.pre_processing'
POSTPROCESSOR 'rikai.contrib.yolov5.transforms.post_processing'
USING 'mlflow:///yolov5s';
Two:
CREATE OR REPLACE MODEL yolov5s
FLAVOR yolov5
PREPROCESSOR 'rikai.contrib.yolov5.transforms.pre_processing'
POSTPROCESSOR 'rikai.contrib.yolov5.transforms.post_processing'
//Fill mlflow URL automatically while using Mlflow catalog
If the model name has a conflict with the model URL, it should throw an exception:
CREATE OR REPLACE MODEL yolov5s
FLAVOR yolov5
PREPROCESSOR 'rikai.contrib.yolov5.transforms.pre_processing'
POSTPROCESSOR 'rikai.contrib.yolov5.transforms.post_processing'
USING 'mlflow:///da-yolov5s-model'; //Throw name conflicting exception.
If you're using the mlflow catalog, one use case for explicit "create model" is if you wanted to use a previous version. In that case, the model name should be different to distinguish from the latest version no?
If you're using the mlflow catalog, one use case for explicit "create model" is if you wanted to use a previous version. In that case, the model name should be different to distinguish from the latest version no?
That's a reasonable requirement, but if we need this, we need to save the Rikai model name somewhere else, not just in mlflow, where should be the best candidate?
If we are using mlflow as catalog, supposely there should be no need to use mlflow
as registry?
ie., SHOW MODELS
should actually list out all the compatible models in mlflow
, so once a Spark cluster starts, and pointed to mlflow as catalog service, users dont need to CREATE MODEL ... USING 'mlflow://...'
anymore.
If we are using mlflow as catalog, supposely there should be no need to use
mlflow
as registry?ie.,
SHOW MODELS
should actually list out all the compatible models inmlflow
, so once a Spark cluster starts, and pointed to mlflow as catalog service, users dont need toCREATE MODEL ... USING 'mlflow://...'
anymore.
For the current implementation, create model
is an action that will make non-compatible models to compatible models like add Rikai related tags to a model, then it will show up in the list model command, so I guess we still need it?
CREATE OR REPLACE MODEL yolov5s_1
OPTIONS (threshold_a 1)
FLAVOR yolov5
USING 'mlflow:///da-yolov5s-model';
;
CREATE OR REPLACE MODEL yolov5s_2
OPTIONS (threshold_a 2)
FLAVOR yolov5
USING 'mlflow:///da-yolov5s-model';
;
What if we want to derive from registered model using different options as showed above?
Isnt that the difference between a Catalog
and a Registry
?
A Catalog maintains a discoverable namespace for a DBMS, in this case, ModelCatalog
maintains the namespace of models that can be used by ML_PREDICT
, where the SQL parser relies on to make logic plan for ML_PREDICT(foo, bar)
.
To this extend, it sounds wrong to me if MLflowCatalog is configured as the catalog for rikai, but the names in mlflow are not visible directly to Sql.
The relationship between Catalog
and Registry
is very similar to the relationship of a Table
and a Location
of the actual data. Table
is a logic concept that are visible in the SQL parser, and managed by the Catalog service in any db, while Location
(i.e., table foo.bar
's location is s3://bucket/warehouse/foo/bar
) is where the table physically is, and s3://
is the instruction to tell Spark how to load data (reading via hadoop-aws.jar => aws-sdk.jar => aws-s3.jar
).
In your example above, that's just three models in mlflow (similarly three tables in Hive metadata store?). If the purpose of registering three different models was merely change the configuration, we can have other means to do it, which will be an improvement of the current implementation.
For example, if we just want to dynamically change the runtime parameters, we can use SQL 's SET
https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-conf-mgmt-set.html
I think MLflowCatalog should be removed!
The same behavior can be changed to if we are using the MLflow registery, we could automatically create all models found on the MLflow registery to the in-memory SimpleCatalog.
In this way, there will be no scans to MLflow when we:
show models
Shouldnt there a persistent model catalog (whether it was backed by mlflow or not) in Rikai, that can offer a consistent model namespace between users, sessions, departments?
In this way, there will be no scans to MLflow when we: show models
Is your objection to a persistent catalog to be performance issue? What is performance different between SHOW models
and SHOW TABLES
in a production?
The same behavior can be changed to if we are using the MLflow registery, we could automatically create all models found on the MLflow registery to the in-memory SimpleCatalog.
isnt this a scan
guaranteed each time when a SparkSession starts? And in this case, say we have a bunch of long running Spark sessions, i.e., ThriftServer, for interactive queries, does that mean everytime a user register an mode, all spark clusters need to restart to use the new model? Or someone needs to manually apply CREATE MODEL
to every cluster?