sqlflow
sqlflow copied to clipboard
[Discussion] Remove some global variables in attribute.go
This issue further addresses: https://github.com/sql-machine-learning/playground/issues/42#issuecomment-645666536
Where are the global variables
-
Exported global map which stores the parameter docs of models, including https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/attribute.go#L219-L226 https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/model_parameters.go#L18-L20 https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/model_parameters.go#L227-L229 https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/xgboost_objective_params.go#L18-L20
-
Exported parameter type definitions, including https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/attribute.go#L32-L45
How to remove the exported global map which stores the parameter docs of models
Some of these global variables share the same information. For example:
-
PremadeModelParamsDocsis deserialized fromModelParameterJSON. So we can removeModelParameterJSON. Furthermore,ModelParameterJSONis a variable which can be auto generated bypython extract_docstring.py > model_parameters.gocommand. We can do this auto generation usinggo generateto generatePremadeModelParamsDocsautomatically too. -
OptimizerParameterDocsis deserialized fromOptimizerParameterJSON. So we can removeOptimizerParameterJSON. Furthermore,OptimizerParameterJSONcan also be auto generated bypython extract_docstring.py > model_parameters.gousinggo generate. -
XGBoostObjectiveDocsis deserialized fromXGBoostObjectiveJSON. So we can removeXGBoostObjectiveJSONand keepXGBoostObjectiveDocs.
Note that PremadeModelParamsDocs, OptimizerParameterDocs and XGBoostObjectiveDocs are also used in cli prompt suggestion (see 1, 2, 3). So we cannot remove these 3 variables or hide them.
How to remove exported parameter type definitions
https://github.com/sql-machine-learning/playground/issues/42#issue-640787739 suggests to enhance compile time data checking using the following ways.
var distributedTrainingAttributes = attribute.Dictionary{}.
Int("train.num_ps", 0, "", nil).
Int("train.num_workers", 1, "", nil).
Int("train.worker_cpu", 400, "", nil)
In this way, attribute.Description, attribute.Int, attribute.Float, etc can be hidden. The signature of Dictionary.Int would be:
func (d Dictionary) Int(name string, value int, doc string, checker func(int) error)
The only concern of this method is that we cannot support nil default value. Some of the models may have attributes with nil default values. For example, the default value of num_class in XGBoost model is nil(see here), because only multi-class (>2) classification models need num_class while the other models do not need num_class. And once num_class is provided in SQL WITH statement, it must be an integer number, so the data type of num_class attribute should be int. The meaning of nil default value in SQLFlow is:
- If the attribute is provided in
WITHstatement, SQLFlow would check whether its type is right and callDescription.Checker()to check whether it is valid. For example, ifnum_classis provided in SQLWITHstatement, SQLFlow would first check whether the value ofnum_classis an integer, and callDescription.Checker()to check whether it is a positive number. - If the attribute is not provided in
WITHstatement, nothing would be checked.
We can enhance this method to support nil default value.
- Add
Defaultmethod to set default value. The code is something like: https://github.com/sql-machine-learning/playground/issues/42#issuecomment-645853937 .
NewDictionary().Int("num_class", "Number of classes").Default(5)
Since the Default method can be used both after Int(...) and Float(...), the input parameter type of Default must be interface{}. Therefore, the signature of Default method should be
func (d Dictionary) Default(interface{}) Dictionary
In this way, we cannot check whether the default value is of the right type in compile time.
- Add both
Dictionary.IntandDictionary.IntOrNilmethod. The signature ofDictionary.IntOrNilmethod is like:
func (d Dictionary) IntOrNil(name string, doc string, checker func(int) error).
In this way, we would double the APIs to Dictionary.
- Use optional package to make
Dictionary.Intmethod accepts both int default value and nil default value. The signature ofDictionary.Intwould be
func (d Dictionary) Int(name string, defaultValue optional.Int,
doc string, checker func(int) error) {
...
}
var dict = Dictionary{}.
Int("num_class", optional.Int{}, "doc1", checker1). // default value is nil
Int("attr_with_default_value", optional.NewInt(0), "doc2", checker2) // default value is 0
@sneaxiy This issue mixed up two topics -- (1) removing global variables and (2) redesign the API. Let us keep (2) at https://github.com/sql-machine-learning/playground/issues/42 and make this issue focusing on (1).
For (1), I don't see a conclusion -- can we move some of the global variables or not? and why?
I guess the lack of a conclusion is (partly) due to an incomplete overview of the source code. For example, ModelParameterJSON contains attribute names and explanations. attribute.Dictionary also contains such information. Why do we need redundant form of the same piece of information in our code? And, what is the relationship between extract_docstring.py, the Go global variables, and cmd/docgen.go?
@wangkuiyi The conclusion is that we should:
- keep
PremadeModelParamsDocs,OptimizerParameterDocsandXGBoostObjectiveDocs. - remove
ModelParameterJSON,OptimizerParameterJSONandXGBoostObjectiveJSON. - (optional) after we have removed
*JSONobjects, we can generatePremadeModelParamsDocsandOptimizerParameterDocsautomatically instead of hard coding in*.gofiles.
The reasons are as follows:
-
Why should we keep
PremadeModelParamsDocs,OptimizerParameterDocsandXGBoostObjectiveDocs? Theseattribute.Dictionaryobjects are not only used inattributepackage but also cli prompt suggestion (see 1, 2, 3). So we cannot remove theseattribute.Dictionaryobjects. -
Why should we remove
ModelParameterJSON,OptimizerParameterJSONandXGBoostObjectiveJSON? As you have mentioned, these JSON strings would be only used to be deserialized to beattribute.Dictionary(PremadeModelParamsDocs,OptimizerParameterDocsandXGBoostObjectiveDocs) insideattribute package.xxxDocs>=xxxJSON, becausexxxJSONonly contains docs of TensorFlow and XGBoost models, butxxxDocsmay contain models from TensorFlow, XGBoost andsqlflow_models. We do not need redundant information. -
[Optional] Why should we generate
PremadeModelParamsDocsandOptimizerParameterDocsautomatically? As discussed above,PremadeModelParamsDocsandOptimizerParameterDocsare deserialized fromModelParameterJSONandOptimizerParameterJSON.ModelParameterJSONandOptimizerParameterJSONare the doc strings of TensorFlow and XGBoost model APIs, which can be extracted byextract_docstring.py.extract_docstring.pywould scan the__doc__of constructor (or callable model method) of each models from Python source code automatically, and print them into a file (we have printed them intomodel_parameters.gobeforehand). Please see the snapshot ofmodel_parameters.go. That is to say, the hard codingModelParameterJSONandOptimizerParameterJSON(or say, the hard codingPremadeModelParamsDocsandOptimizerParameterDocsif we have removedModelParameterJSONandOptimizerParameterJSON) are also redundant codes because they can be generated automatically byextract_docstring.py.
https://github.com/sql-machine-learning/sqlflow/blob/56868d3cfe274519be8b045e76966f943b031dd0/pkg/attribute/model_parameters.go#L14-L20