dask-sql Improve ML implementation

The ML implementation is still a bit experimental - we can improve on this:

[x] SHOW MODELS and DESCRIBE MODEL
[x] Hyperparameter optimizations, AutoML-like behaviour
[x] @romainr brought up the idea of exporting models (#191, still missing: onnx - see discussion in the PR by @rajagurunath)
[ ] and some more showcases and examples

Feb 05 '21 19:02 nils-braun

Hey @nils-braun Can you give some more explanation regarding what to do?

Feb 06 '21 03:02 ckmganesh

Definitely! Which of the point are you most interested in?

Feb 06 '21 07:02 nils-braun

Show and describe models

Feb 06 '21 08:02 ckmganesh

Ok! The method SHOW MODELS should give back a list of all registered models. It can be quite similar to the SHOW TABLES function and just give the names. The method DESCRIBE MODEL <NAME> should list some parameters of the model. Might be ok to just give back the string representation of the Python model which typically already has the parameters listed or do some more advanced things, both would be fine.

In terms of implementation, three things need to be done:

Add it to the parser files (in planner/codegen/models) similar to the other SHOW functions
Add a new Java class to handle the case, again similar to the other show and describe classes (in planner/java)
And then make a conversion plugin for those new classes

I guess one can copy a lot from the other show functionality. Feel free to open a partial PR if you need more help!

I am very happy for your support.

Feb 06 '21 10:02 nils-braun

Thanks for the explanation It could take some time for me to go through the code and implement this. I'll try my best to do this. Thanks

Feb 06 '21 12:02 ckmganesh

Is this okay for Show models? Also, I'm facing some difficulty to understand the code parsing works, like how to parse each and every component of the query are been parsed and the corresponding action is taken.

Sorry if my doubt is silly, I really want to learn and contribute to dask-sql

Feb 06 '21 16:02 ckmganesh

Thanks for asking! Two comments to your code:

You probably want to do this anyway, but you need to rename the class to something like SqlShowModels
You do not need the schemaName (actually would be good to not have it at all). So it would be easier if you derive from SqlCall (have a look into SqlShowSchemas). You might need to add

    public SqlOperator getOperator() {
        throw new UnsupportedOperationException();
    }

    public List<SqlNode> getOperandList() {
        ArrayList<SqlNode> operandList = new ArrayList<SqlNode>();
        return operandList;
    }

Could you try to re-formulate a question on the parsing part? I think I did not query underand your question... I am happy to help!

Feb 06 '21 21:02 nils-braun

By the way, you question is really not silly - the documentation in this is really sparse right now (help is welcome!)

Feb 06 '21 22:02 nils-braun

Thanks for asking! Two comments to your code:

You probably want to do this anyway, but you need to rename the class to something like SqlShowModels

You do not need the schemaName (actually would be good to not have it at all). So it would be easier if you derive from SqlCall (have a look into SqlShowSchemas). You might need to add
    public SqlOperator getOperator() {
        throw new UnsupportedOperationException();
    }

    public List<SqlNode> getOperandList() {
        ArrayList<SqlNode> operandList = new ArrayList<SqlNode>();
        return operandList;
    }
Could you try to re-formulate a question on the parsing part? I think I did not query underand your question... I am happy to help!

Yeah thanks I just forgot to change the class name. Can I know why we use SQL operator here rather in the case of show tables we just use schemas

The thing about query is that what does the left prec and right prec do in each case, because only if I understand that I would be able to do the describe part

For example for a query like this CREATE TABLE Persons ( PersonID int, LastName varchar(255), FirstName varchar(255), Address varchar(255), City varchar(255) ); What happens internally and which part of the query goes to which part of the create table schema

Any help would be really helpful. I kinda have difficulty in understanding this patt

Feb 07 '21 02:02 ckmganesh

Just as a side note, I have pushed some more documentation in https://github.com/nils-braun/dask-sql/pull/132 - maybe that also helps a bit. Maybe you want to have a look there before getting back to my answers (once it is merged, this will also be shown on the main documentation page).

Can I know why we use SQL operator here rather in the case of show tables we just use schemas

In general, during parsing, Apache Calcite will generate SqlNode objects out of different parts of the SQL statement (similar to a syntax tree). The Apache Calcite class SqlNode is quite "basic", so you basically always want to use any of the provided derived classes for a bit more functionality. Which class to choose depends on the SQL statement (part) you would like to represent with it. SqlDescribeSchema is a class (you can have a look into the Apache Calcite source code), which has a member variable for a schema identifier to store. In your SQL statement SHOW MODELS there is no schema identifier, so it would not make sense to choose a base class that wants to store one. Therefore, I proposed to use the more basic SqlCall, which does not store any additional variables. (in principle, SqlDescribeSchema is a SqlCall with the additional schema member and some predefined methods). Please note, that these things are all defined in the Apache Calcite library (in this case org.apache.calcite.sql), so you will need to have a look there. As a start, it might be enough to have a look on how dask-sql uses those base classes.

What happens internally and which part of the query goes to which part of the create table schema

Please note that the example you gave here is not a valid SQL statement for dask-sql, so nothing of this is parsed (and it is hard to explain it using this example). Please check out the SQL reference for the statements dask-sql understands.

But let me give you another example, that is also easier for your use case

SHOW SCHEMAS

Please also note, that the next descriptions are javacc and ftl descriptions and are not by itself dask-sql specific (so you might find more information outside of dask-sql).

First, Apache Calcite will try to parse the statement using any of its registered statementParserMethods (see config.fmpp for the custom parser in dask-sql additional to the own Apache Calcite methods). It will find out, that the parser method SqlShowSchemas() is the only one that fits (Attention: that is not the java class SqlShowSchemas, but the parser function defined in show.ftl). I have copied it here for reference:

SqlNode SqlShowSchemas() :
{
    final Span s;
    SqlIdentifier catalog = null;
    SqlNode like = null;
}
{
    <SHOW> { s = span(); } <SCHEMAS>
    (
        <FROM> catalog = SimpleIdentifier()
    )?
    (
        <LIKE> like = Literal()
    )?
    {
        return new SqlShowSchemas(s.end(this), catalog, like);
    }
}

I do not know if you understand javacc well, but it basically describes some SQL language feature (in this case SHOW SCHEMAS with optional FROM and optional LIKE). As a return, it will return an instance of the Java class SqlShowSchemas with the parser position for reference (s.end(this)), the catalog and the like. You can compare with the constructor of the java class SqlShowSchemas (which is now defined in dask-sql).

The SqlShowSchemas java class needs some boilerplate code, including a getOperator (always not implemented, you can ignore it) and the getOperandList (always empty, you can ignore it) as well as a unparse function. This function is used to re-produce the SQL string from the extracted information (needed e.g. for error messages). For reference, I include the one from SqlShowSchemas:

public void unparse(SqlWriter writer, int leftPrec, int rightPrec) {
        writer.keyword("SHOW");
        writer.keyword("SCHEMAS");
        if (this.catalog != null) {
            writer.keyword("FROM");
            this.catalog.unparse(writer, leftPrec, rightPrec);
        }
        if (this.like != null) {
            writer.keyword("LIKE");
            this.like.unparse(writer, leftPrec, rightPrec);
        }
    }

We want to get back the "SHOW SCHEMAS FROM catalog LIKE something" and this function does exactly this. The leftPrec and rightPrec are internals of Apache Calcite, which are used to describe the precedence of the statements right and left of this SQL operation (so there might need to be paranthesis). In 99% of the cases and especially for your example, you can ignore it (there can not be any operation right or left of your SHOW MODELS. The precedence is only interesting in operations such as "*" or "-").

Hope that helped!

Feb 07 '21 10:02 nils-braun

Thank you so much @nils-braun This is a perfect explanation, exactly what I wanted to know. I'll try to work on the solution and let you know as soon I can. Again, Thanks for taking the time to explain this. This is very much helpful :)

Feb 07 '21 16:02 ckmganesh

Hi @nils-braun and @ckmganesh , Thanks for in detailed discussion really helped me understand the functionality.

I have tried SHOW MODELS AND DESCRIBE MODEL , and able to achieve the functionality, but not sure if everything is in correct place and adhere to best practice. can you please review these forked repository/ commit (shared below) and guide me the path to open a PR .(since This is my very first commit for open source, help/guidance will be much appreciated )

Forked Repo: https://github.com/rajagurunath/dask-sql/commit/4a9ee68cc121a61406f909d5dd9ca6c18f8a827b

Thanks.

May 24 '21 06:05 rajagurunath

Hi @rajagurunath. First of all let me thank you for taking the challenge and doing your first code contribution to an open-source project. Welcome, I hope many more will follow!

I had a look into the code. It is really well done and everything is at its right place. I might have some detailed comments here and there, but they are easier to give once you have created you first PR. Which brings me to the next point:

If you want to create a pull request, all you need to do is to go to the GitHub page of the Dask-SQL repository, open the PR tab and create a new one. GitHub will let you choose your forked repository and your branch. You should fill in a title and some description and that's it! The rest will be taken care of by the maintainers.

Happy to have you on board!

May 24 '21 15:05 nils-braun

Thanks a lot @nils-braun I have created a pull request here : https://github.com/nils-braun/dask-sql/pull/185

I am very much interested to understand the framework in more detail and contribute some meaningful work.

And I must say , I enjoyed my weekend working for this functionality, the developer workflow was straight forward , Developer just need to add their logic, and rest all of build and testing was in place, switching between Java build and python are also seamless and working fine.

Thanks a lot for such an amazing work.

May 24 '21 16:05 rajagurunath

Hi @nils-braun @romainr

Currently I am working on EXPORT MODELS functionality and having following doubts (since there are lot of possibilities to implement this functionality)

initial sql syntax , EXPORT MODEL <model_name> INTO filename.pkl

current sql syntax, EXPORT MODEL <model_name> WITH (model_serde = pickle|joblib|mlflow|onxx, location = "filename.pkl") let me know if this SQL syntax good to proceed.

what are all the Serilization (model serDe) needs to be supported . currently pickle and joblib implemented
Can we integrate with mlflow for serilizaition which has lots of flavours to save and load ml models across frameworks and has the capability to replicate the same (conda) Environment while loading the model. Partially implemented some mlflow functionality need some input here because it may broke with import error, Any suggestions to avoid import Error or to improve this functionalities please let me know. For more info about mlflow models : https://mlflow.org/docs/latest/models.html
Is ONNX format serialisation needed, currently it needs model along with the example data for conversion. currently not implemented .
Do we need LOAD MODEL command also ? which loads the already saved model into context.models ?

Worked on some parts of this functionality in this branch https://github.com/rajagurunath/dask-sql/tree/feature/export-model please have a look and let me know is this the right way to proceed for this functionality ? and let me know your thoughts on this .

Thanks

May 30 '21 18:05 rajagurunath

Hi @rajagurunath sorry for the late response. Thank you very much for your hard work and your thoughts on this!

You are raising very valid questions here. I think the best thing would be to start with a small subset of capabilities. Serialisation into pickle and joblib are already very good. mlflow is a very nice addition, but if we can not make it work, I would propose to move this to a second PR.

I like your changes already quite a lot. Would you be willing to do a PR (maybe, if the mlflow part is not working, without it at first)? Could you also include some small sentences into the ml.rst documentation about the new functionality?

PS: I did not test your new export class so far (will do once you open a PR), but could you maybe describe, where exactly you would need some input for the mlflow part? In principle, your code looks reasonable.

Jun 04 '21 13:06 nils-braun

Hi @nils-braun ,

Thanks for your guidance/inputs. Sorry I was little bit occupied last week.

As you suggested, We can divide it into two PRs may be , PR 1. Introduces Export MODELs using pickle, joblib,mlflow PR 2. We can implement Loading the saved model, improve the mlflow model serilization, implementing ONNX format (may be)

And regarding the doubt, Actually exporting model into mlflow format is working, but where I felt little less confident is User may get importError, if the conda env (what Dask-sql uses) doesn't have mlflow, lightgbm,catboost dependencies. But I think in Future we can add more frameworks.

Added comments on machine learning.rst , Please let me know if anything needs to be changed in the added sentences.

Tried to make a PR , But its says not able to merge, is this okey ? please let me if something needs to be changed from myside .https://github.com/nils-braun/dask-sql/compare/main...rajagurunath:main

Thanks

Jun 10 '21 15:06 rajagurunath

Hi @rajagurunath - the message does not prevent you from opening a PR, but it will prevent you (or me) from merging it. To get rid of it, you need to merge the current main branch from the main repository into your branch, fix any occuring merge conflicts and the push again.

Jun 10 '21 19:06 nils-braun

Sure thanks for the input @nils-braun

Merged and resolved the conflicts, Have raised the PR here https://github.com/nils-braun/dask-sql/pull/191 Please review and let me know if any changes are needed.

Thanks

Jun 11 '21 05:06 rajagurunath

@rajagurunath - looking at the new tests for mlflow and model export, I am wondering if we should maybe use the xgboost.dask.DaskXGBClassifier instead of the normal one (as mentioned here) in the test. What do you think?

And it would be super cool to have the EXPORT MODEL mentioned in the jupyter notebook also. If you like, you can add it to the Feature Overview notebook (and we can port it to [here](https://github.com/nils-braun/dask-sql-binder] later).

Jun 21 '21 19:06 nils-braun

Hi @nils-braun,

Seems like a good idea, I tried using xgboost.dask.DaskXGBClassifier following the documentation mentioned above, but the problem is we need to specify /provide dask.distributed.Client for the DaskXGBClassifier model to distribute the work across Dask workers.

I attempted something in create_model.py with the following code

if "dask_scheduler_address" in kwargs:
      ipaddress = kwargs.pop("dask_scheduler_address")
      from dask.distributed import Client
      dask_client  = Client(ipaddress)
      model.client = dask_client

model.fit(X, y, **fit_kwargs)

which creates the client based on user-provided dask_scheduler_address which I am getting from create model statement.

But facing this Error socket.gaierror: [Errno 8] nodename nor servname provided, or not known while running with pytest and I am not able to resolve it till now. (Error discussed dask-issue board here: https://github.com/dask/distributed/issues/3916) Any idea how to proceed with this? And is this implementation looks descent? or any other suggestions?

Jun 22 '21 11:06 rajagurunath

Hi @nils-braun

Tried implementing Hyperparameter tunning and auto ml-like behavior functionality here, https://github.com/nils-braun/dask-sql/compare/main...rajagurunath:feature/automl, Please review and let me if these changes are good to raise the PR, then will change the code if necessary, update the docs and raise the PR

Thanks in Advance

Jun 26 '21 17:06 rajagurunath

Your changes get really better with each PR you are doing, @rajagurunath - congratulations! I am not an expert in these automl packages, but this looks already really good. You can open a PR with those changes, I think I do not have many comments this time (as it looks already quite good). Good work!

Any idea how to proceed with this?

Sorry, I missed sending an answer to this. You can create a dask client by yourself (as you have pointed out in your comment), or you can rely on dask's "auto-client" feature. If there is a client set up before, it will automatically pick it up. So basically all you need to do in your tests is to use the "client" pytest fixture (which is from the dask.distributed package, which is already imported). This will set up a valid dask client for you and the XGBoost implementation will pick it up automatically.

I think we should not create our own parameter for this and better use this "auto-client" feature, what do you think? (if you are wondering how this will work when users run dask-sql in SQL-only mode: the sql-server automatically sets up a dask client already, so this is not a problem).

Jun 28 '21 19:06 nils-braun

Thanks a lot @nils-braun its all because of your valuable feedback each time (you tuned my hyperparameters 😄 ).

Sorry, last week I was a little bit occupied, I created a PR and detailed some dependency problems I have faced along the way, Please review and let me know your feedback .

Sorry, I missed sending an answer to this. You can create a dask client by yourself (as you have pointed out in your comment), or you can rely on dask's "auto-client" feature. If there is a client set up before, it will automatically pick it up. So basically all you need to do in your tests is to use the "client" pytest fixture (which is from the dask.distributed package, which is already imported). This will set up a valid dask client for you and the XGBoost implementation will pick it up automatically.

I think we should not create our own parameter for this and better use this "auto-client" feature, what do you think? (if you are wondering how this will work when users run dask-sql in SQL-only mode: the sql-server automatically sets up a dask client already, so this is not a problem).

got it, "auto-client" feature seems a really interesting feature, makes sense we will use the auto-client feature itself instead of creating a new parameter👍 .

But there is a catch as of now, I tried to changing the test case to make use of xgboost.dask.DaskXGBClassifier but this class is part of the latest xgboost package 1.1.0. but we have a dependency conflict with dask_ml as mentioned in the https://github.com/nils-braun/dask-sql/pull/199 . so currently not able to implement this, any thoughts to avoid this ?

Jul 03 '21 13:07 rajagurunath

Thanks for trying that out. That might be something to report directly to the dask-ml repository. Honestly, I do not think we need dask-xgboost anymore - as we are now using xgboost directly (without dask-ml).

One question though: where did you find the dependency of dask-ml on dask-xgboost? I tried to find it in pip (via pipdeptree -p dask-ml) or conda (here), but could not find it. The only dependency that I see in the setup.py of dask-ml is an extra dependency on xgboost (which we do not need necessarily). Maybe I missed something?

Jul 04 '21 20:07 nils-braun

dask-sql dask-sql copied to clipboard

Improve ML implementation

dask-sql
dask-sql copied to clipboard