sqlflow
sqlflow copied to clipboard
[Discussion] Remove some global variables in attribute.go
This issue further addresses: https://github.com/sql-machine-learning/playground/issues/42#issuecomment-645666536
Where are the global variables
-
Exported global map which stores the parameter docs of models, including https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/attribute.go#L219-L226 https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/model_parameters.go#L18-L20 https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/model_parameters.go#L227-L229 https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/xgboost_objective_params.go#L18-L20
-
Exported parameter type definitions, including https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/attribute.go#L32-L45
How to remove the exported global map which stores the parameter docs of models
Some of these global variables share the same information. For example:
-
PremadeModelParamsDocs
is deserialized fromModelParameterJSON
. So we can removeModelParameterJSON
. Furthermore,ModelParameterJSON
is a variable which can be auto generated bypython extract_docstring.py > model_parameters.go
command. We can do this auto generation usinggo generate
to generatePremadeModelParamsDocs
automatically too. -
OptimizerParameterDocs
is deserialized fromOptimizerParameterJSON
. So we can removeOptimizerParameterJSON
. Furthermore,OptimizerParameterJSON
can also be auto generated bypython extract_docstring.py > model_parameters.go
usinggo generate
. -
XGBoostObjectiveDocs
is deserialized fromXGBoostObjectiveJSON
. So we can removeXGBoostObjectiveJSON
and keepXGBoostObjectiveDocs
.
Note that PremadeModelParamsDocs
, OptimizerParameterDocs
and XGBoostObjectiveDocs
are also used in cli prompt suggestion (see 1, 2, 3). So we cannot remove these 3 variables or hide them.
How to remove exported parameter type definitions
https://github.com/sql-machine-learning/playground/issues/42#issue-640787739 suggests to enhance compile time data checking using the following ways.
var distributedTrainingAttributes = attribute.Dictionary{}.
Int("train.num_ps", 0, "", nil).
Int("train.num_workers", 1, "", nil).
Int("train.worker_cpu", 400, "", nil)
In this way, attribute.Description
, attribute.Int
, attribute.Float
, etc can be hidden. The signature of Dictionary.Int
would be:
func (d Dictionary) Int(name string, value int, doc string, checker func(int) error)
The only concern of this method is that we cannot support nil
default value. Some of the models may have attributes with nil
default values. For example, the default value of num_class
in XGBoost model is nil
(see here), because only multi-class (>2) classification models need num_class
while the other models do not need num_class
. And once num_class
is provided in SQL WITH
statement, it must be an integer number, so the data type of num_class
attribute should be int
. The meaning of nil
default value in SQLFlow is:
- If the attribute is provided in
WITH
statement, SQLFlow would check whether its type is right and callDescription.Checker()
to check whether it is valid. For example, ifnum_class
is provided in SQLWITH
statement, SQLFlow would first check whether the value ofnum_class
is an integer, and callDescription.Checker()
to check whether it is a positive number. - If the attribute is not provided in
WITH
statement, nothing would be checked.
We can enhance this method to support nil
default value.
- Add
Default
method to set default value. The code is something like: https://github.com/sql-machine-learning/playground/issues/42#issuecomment-645853937 .
NewDictionary().Int("num_class", "Number of classes").Default(5)
Since the Default
method can be used both after Int(...)
and Float(...)
, the input parameter type of Default
must be interface{}
. Therefore, the signature of Default
method should be
func (d Dictionary) Default(interface{}) Dictionary
In this way, we cannot check whether the default value is of the right type in compile time.
- Add both
Dictionary.Int
andDictionary.IntOrNil
method. The signature ofDictionary.IntOrNil
method is like:
func (d Dictionary) IntOrNil(name string, doc string, checker func(int) error).
In this way, we would double the APIs to Dictionary
.
- Use optional package to make
Dictionary.Int
method accepts both int default value and nil default value. The signature ofDictionary.Int
would be
func (d Dictionary) Int(name string, defaultValue optional.Int,
doc string, checker func(int) error) {
...
}
var dict = Dictionary{}.
Int("num_class", optional.Int{}, "doc1", checker1). // default value is nil
Int("attr_with_default_value", optional.NewInt(0), "doc2", checker2) // default value is 0
@sneaxiy This issue mixed up two topics -- (1) removing global variables and (2) redesign the API. Let us keep (2) at https://github.com/sql-machine-learning/playground/issues/42 and make this issue focusing on (1).
For (1), I don't see a conclusion -- can we move some of the global variables or not? and why?
I guess the lack of a conclusion is (partly) due to an incomplete overview of the source code. For example, ModelParameterJSON
contains attribute names and explanations. attribute.Dictionary
also contains such information. Why do we need redundant form of the same piece of information in our code? And, what is the relationship between extract_docstring.py
, the Go global variables, and cmd/docgen.go
?
@wangkuiyi The conclusion is that we should:
- keep
PremadeModelParamsDocs
,OptimizerParameterDocs
andXGBoostObjectiveDocs
. - remove
ModelParameterJSON
,OptimizerParameterJSON
andXGBoostObjectiveJSON
. - (optional) after we have removed
*JSON
objects, we can generatePremadeModelParamsDocs
andOptimizerParameterDocs
automatically instead of hard coding in*.go
files.
The reasons are as follows:
-
Why should we keep
PremadeModelParamsDocs
,OptimizerParameterDocs
andXGBoostObjectiveDocs
? Theseattribute.Dictionary
objects are not only used inattribute
package but also cli prompt suggestion (see 1, 2, 3). So we cannot remove theseattribute.Dictionary
objects. -
Why should we remove
ModelParameterJSON
,OptimizerParameterJSON
andXGBoostObjectiveJSON
? As you have mentioned, these JSON strings would be only used to be deserialized to beattribute.Dictionary
(PremadeModelParamsDocs
,OptimizerParameterDocs
andXGBoostObjectiveDocs
) insideattribute package
.xxxDocs
>=xxxJSON
, becausexxxJSON
only contains docs of TensorFlow and XGBoost models, butxxxDocs
may contain models from TensorFlow, XGBoost andsqlflow_models
. We do not need redundant information. -
[Optional] Why should we generate
PremadeModelParamsDocs
andOptimizerParameterDocs
automatically? As discussed above,PremadeModelParamsDocs
andOptimizerParameterDocs
are deserialized fromModelParameterJSON
andOptimizerParameterJSON
.ModelParameterJSON
andOptimizerParameterJSON
are the doc strings of TensorFlow and XGBoost model APIs, which can be extracted byextract_docstring.py
.extract_docstring.py
would scan the__doc__
of constructor (or callable model method) of each models from Python source code automatically, and print them into a file (we have printed them intomodel_parameters.go
beforehand). Please see the snapshot ofmodel_parameters.go
. That is to say, the hard codingModelParameterJSON
andOptimizerParameterJSON
(or say, the hard codingPremadeModelParamsDocs
andOptimizerParameterDocs
if we have removedModelParameterJSON
andOptimizerParameterJSON
) are also redundant codes because they can be generated automatically byextract_docstring.py
.
https://github.com/sql-machine-learning/sqlflow/blob/56868d3cfe274519be8b045e76966f943b031dd0/pkg/attribute/model_parameters.go#L14-L20