OpenML
OpenML copied to clipboard
Database error - on upload of a run from OpenML-Python
Description
An OpenMLServerException occurs for a particular unit test which executes a scikit-learn model on a task and tries to publish. The raised exception:
openml.exceptions.OpenMLServerException: Database error. Setup search query failed - None
A post request is made which receives a 412 response and therefore an exception is raised during the check where the status code is not 200.
Steps/Code to Reproduce
Given the unit test invokes a series of locally defined functions, one simple way to run this single unit test could be to clone the OpenML-Python repo and run:
pytest -k 'test_run_and_upload_decision_tree_pipeline
Given that the execution stack for the unit test case appears to run without errors, the actual post request made during publish can be found here. The data and file_elements packaged in the call are attached as pickle files for reference, while the URL used for the post request is: https://test.openml.org/api/v1/xml/run/
I did a clean install, but still get this error
ModuleNotFoundError: No module named 'pytest_remotedata.plugin'
Missing dependency?
Did you install the test dependencies? This can be done with pip install -e '.[test]'
and I think the missing dependencies can also be installad with pip install pytest pytest-xdist pytest-timeout flaky
.
Yes, I did pip install -e '.[test]'
I also just did pip install pytest pytest-xdist pytest-timeout flaky
But I'm still getting the error
I also tried 'pip install pytest-remotedata' and 'pip install -U pytest-remotedata', but still get the error. Weird.
I installed everything from scratch again, now I get:
ModuleNotFoundError: No module named 'pytest_openfiles.plugin'
Upgraded pytest-openfiles to version 0.5.0 Also had to pip install babel
running now
The test passes. Weirdly, I'm not seeing the run on the test server. What am I missing?
$ pytest -k 'test_run_and_upload_decision_tree_pipeline'
/Users/joa/anaconda3/lib/python3.7/site-packages/dash/testing/plugin.py:17: UserWarning:
run `pip install dash[testing]` if you need dash.testing
============================================= test session starts =============================================
platform darwin -- Python 3.7.3, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /Users/joa/Code/openml-python, inifile: setup.cfg
plugins: Flask-Dance-3.0.0, timeout-1.4.1, forked-1.2.0, typeguard-2.7.1, cov-2.10.0, flaky-3.6.1, openfiles-0.5.0, xdist-1.32.0, dash-1.3.0
collected 310 items / 309 deselected / 1 selected
tests/test_runs/test_run_functions.py . [100%]INFO:unit_tests:Deleted (run, 76815)
INFO:unit_tests:Deleting datas...
INFO:unit_tests:Deleting flows...
INFO:unit_tests:Deleted (flow, 201746)
INFO:unit_tests:Deleting tasks...
INFO:unit_tests:Deleting studys...
INFO:unit_tests:Collecting file lists from: /Users/joa/Code/openml-python/tests/../tests/files/
INFO:unit_tests:Local files deleted
INFO:unit_tests:master is killed
===================================== 1 passed, 309 deselected in 11.25s
Did you install the development branch or this branch? Also, the error appears to depend on the python version and/or scikit-learn version. The tests fail with scikit-learn 0.21.X (and we don't test 0.22.X yet).
What do you mean with 'install the development branch'? I cloned from github, did checkout develop, and installed via pip install -e '.[test]
Currently installed version is openml-0.11.0.dev0
I had scikit-learn 0.23.X installed, but now reverted to 0.21.3. The test still passes.
I am more concerned about the fact that, even though the test passes, the run is not appearing on the test server, and I'm not seeing any errors. Am I testing using some kind of mocking setup?
Is this the right test to run?
pytest -k 'test_run_and_upload_decision_tree_pipeline'
The run itself should appear on the test server and be deleted afterwards to keep the test server clean.
You need a branch from @amueller's fork of OpenML-Python which you can get with:
git checkout -b amueller-dataframe_run_on_task develop
git pull https://github.com/amueller/openml-python.git dataframe_run_on_task
sorry for not having that in the original description.
Oh yeah, it does! Previously it was deleted too fast for me to see it :). What are the 'Test Test' runs I see appearing on the test server?
I tried pulling the PR, but get a lot of merge conflicts:
Successfully installed openml
$ git checkout -b amueller-dataframe_run_on_task develop
Switched to a new branch 'amueller-dataframe_run_on_task'
$ git pull https://github.com/amueller/openml-python.git dataframe_run_on_task
remote: Enumerating objects: 183, done.
remote: Counting objects: 100% (183/183), done.
remote: Compressing objects: 100% (41/41), done.
remote: Total 206 (delta 146), reused 179 (delta 142), pack-reused 23
Receiving objects: 100% (206/206), 85.67 KiB | 731.00 KiB/s, done.
Resolving deltas: 100% (147/147), completed with 43 local objects.
From https://github.com/amueller/openml-python
* branch dataframe_run_on_task -> FETCH_HEAD
Auto-merging tests/test_study/test_study_examples.py
CONFLICT (content): Merge conflict in tests/test_study/test_study_examples.py
Auto-merging tests/test_runs/test_run_functions.py
CONFLICT (content): Merge conflict in tests/test_runs/test_run_functions.py
Auto-merging tests/test_extensions/test_sklearn_extension/test_sklearn_extension.py
CONFLICT (content): Merge conflict in tests/test_extensions/test_sklearn_extension/test_sklearn_extension.py
Auto-merging openml/testing.py
CONFLICT (content): Merge conflict in openml/testing.py
Auto-merging openml/runs/functions.py
CONFLICT (content): Merge conflict in openml/runs/functions.py
Auto-merging openml/extensions/sklearn/extension.py
CONFLICT (content): Merge conflict in openml/extensions/sklearn/extension.py
Auto-merging openml/__init__.py
CONFLICT (content): Merge conflict in openml/__init__.py
Auto-merging examples/30_extended/datasets_tutorial.py
Automatic merge failed; fix conflicts and then commit the result.
Hm, not sure how to actually merge this easily. Is it possible that you clone the repo again from github.com/amueller/openml-python and move to the branch dataframe_run_on_task and go on from there?
What are the 'Test Test' runs I see appearing on the test server?
I don't know from the top of my head, sorry.
I could try, but is that what you do in the unit tests? I was trying to reproduce the error you got in the unit tests.
The unit test only appears on that specific branch.
I did the following:
git clone https://www.github.com/amueller/openml-python
cd openml-python
git checkout develop
pip install -e '.[test]'
pytest -k 'test_run_and_upload_decision_tree_pipeline'
And now it can't find that test:
$ pytest -k 'test_run_and_upload_decision_tree_pipeline'
======================================= test session starts
platform darwin -- Python 3.7.3, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /Users/joa/Code/openml-python
plugins: Flask-Dance-3.0.0, timeout-1.4.1, typeguard-2.9.1, forked-1.2.0, mock-3.1.1, cov-2.10.0, flaky-3.6.1, openfiles-0.5.0, xdist-1.32.0
collected 72 items / 72 deselected
If I run all the tests in that repo, 20 of them fail.
I'm still not sure why you want me to run the tests in the other repo. Is this still relevant to the server error you're getting?
Yes, for reasons how that PR was created it's located in Andy's repository and we're working on it there. You need to go to the feature branch via git checkout dataframe_run_on_task
to have access to the test. Hopefully it fails then.
Yep, it fails:
Too many tables; MySQL can only use 61 tables in a join.
What kind of setup are you guys trying to store? This is the query. Does that make sense?
SELECT `sid`,`implementation_id`, `i1416459`.`value` AS `i1416459`, `i1416460`.`value` AS `i1416460`,
`i1416461`.`value` AS `i1416461`, `i1416462`.`value` AS `i1416462`, `i1416463`.`value` AS `i1416463`,
`i1416464`.`value` AS `i1416464`, `i1416465`.`value` AS `i1416465`, `i1416466`.`value` AS `i1416466`,
`i1416467`.`value` AS `i1416467`, `i1416468`.`value` AS `i1416468`, `i1416469`.`value` AS `i1416469`,
`i1416470`.`value` AS `i1416470`, `i1416471`.`value` AS `i1416471`, `i1416472`.`value` AS `i1416472`, `i1416473`.`value`
AS `i1416473`, `i1416474`.`value` AS `i1416474`, `i1416475`.`value` AS `i1416475`, `i1416476`.`value` AS `i1416476`,
`i1416477`.`value` AS `i1416477`, `i1416478`.`value` AS `i1416478`, `i1416479`.`value` AS `i1416479`,
`i1416480`.`value` AS `i1416480`, `i1416481`.`value` AS `i1416481`, `i1416482`.`value` AS `i1416482`,
`i1416483`.`value` AS `i1416483`, `i1416484`.`value` AS `i1416484`, `i1416485`.`value` AS `i1416485`,
`i1416486`.`value` AS `i1416486`, `i1416487`.`value` AS `i1416487`, `i1416488`.`value` AS `i1416488`,
`i1416489`.`value` AS `i1416489`, `i1416490`.`value` AS `i1416490`, `i1416491`.`value` AS `i1416491`,
`i1416492`.`value` AS `i1416492`, `i1416493`.`value` AS `i1416493`, `i1416494`.`value` AS `i1416494`,
`i1416495`.`value` AS `i1416495`, `i1416496`.`value` AS `i1416496`, `i1416497`.`value` AS `i1416497`,
`i1416498`.`value` AS `i1416498`, `i1416500`.`value` AS `i1416500`, `i1416501`.`value` AS `i1416501`,
`i1416502`.`value` AS `i1416502`, `i1416503`.`value` AS `i1416503`, `i1416504`.`value` AS `i1416504`,
`i1416505`.`value` AS `i1416505`, `i1416506`.`value` AS `i1416506`, `i1416507`.`value` AS `i1416507`,
`i1416508`.`value` AS `i1416508`, `i1416509`.`value` AS `i1416509`, `i1416510`.`value` AS `i1416510`, `i1416511`.`value`
AS `i1416511`, `i1416512`.`value` AS `i1416512`, `i1416513`.`value` AS `i1416513`, `i1416514`.`value` AS `i1416514`,
`i1416515`.`value` AS `i1416515`, `i1416516`.`value` AS `i1416516`, `i1416517`.`value` AS `i1416517`, `i1416518`.`value`
AS `i1416518`, `i1416519`.`value` AS `i1416519`, `i1416520`.`value` AS `i1416520`, `i1416521`.`value` AS `i1416521`,
`i1416522`.`value` AS `i1416522`FROM `algorithm_setup` AS `s` LEFT JOIN `input_setting` `i1416459` ON
`i1416459`.`setup` = `s`.`sid` AND `i1416459`.`input_id` = "1416459" LEFT JOIN `input_setting` `i1416460` ON
`i1416460`.`setup` = `s`.`sid` AND `i1416460`.`input_id` = "1416460" LEFT JOIN `input_setting` `i1416461` ON
`i1416461`.`setup` = `s`.`sid` AND `i1416461`.`input_id` = "1416461" LEFT JOIN `input_setting` `i1416462` ON
`i1416462`.`setup` = `s`.`sid` AND `i1416462`.`input_id` = "1416462" LEFT JOIN `input_setting` `i1416463` ON
`i1416463`.`setup` = `s`.`sid` AND `i1416463`.`input_id` = "1416463" LEFT JOIN `input_setting` `i1416464` ON
`i1416464`.`setup` = `s`.`sid` AND `i1416464`.`input_id` = "1416464" LEFT JOIN `input_setting` `i1416465` ON
`i1416465`.`setup` = `s`.`sid` AND `i1416465`.`input_id` = "1416465" LEFT JOIN `input_setting` `i1416466` ON
`i1416466`.`setup` = `s`.`sid` AND `i1416466`.`input_id` = "1416466" LEFT JOIN `input_setting` `i1416467` ON
`i1416467`.`setup` = `s`.`sid` AND `i1416467`.`input_id` = "1416467" LEFT JOIN `input_setting` `i1416468` ON
`i1416468`.`setup` = `s`.`sid` AND `i1416468`.`input_id` = "1416468" LEFT JOIN `input_setting` `i1416469` ON
`i1416469`.`setup` = `s`.`sid` AND `i1416469`.`input_id` = "1416469" LEFT JOIN `input_setting` `i1416470` ON
`i1416470`.`setup` = `s`.`sid` AND `i1416470`.`input_id` = "1416470" LEFT JOIN `input_setting` `i1416471` ON
`i1416471`.`setup` = `s`.`sid` AND `i1416471`.`input_id` = "1416471" LEFT JOIN `input_setting` `i1416472` ON
`i1416472`.`setup` = `s`.`sid` AND `i1416472`.`input_id` = "1416472" LEFT JOIN `input_setting` `i1416473` ON
`i1416473`.`setup` = `s`.`sid` AND `i1416473`.`input_id` = "1416473" LEFT JOIN `input_setting` `i1416474` ON
`i1416474`.`setup` = `s`.`sid` AND `i1416474`.`input_id` = "1416474" LEFT JOIN `input_setting` `i1416475` ON
`i1416475`.`setup` = `s`.`sid` AND `i1416475`.`input_id` = "1416475" LEFT JOIN `input_setting` `i1416476` ON
`i1416476`.`setup` = `s`.`sid` AND `i1416476`.`input_id` = "1416476" LEFT JOIN `input_setting` `i1416477` ON
`i1416477`.`setup` = `s`.`sid` AND `i1416477`.`input_id` = "1416477" LEFT JOIN `input_setting` `i1416478` ON
`i1416478`.`setup` = `s`.`sid` AND `i1416478`.`input_id` = "1416478" LEFT JOIN `input_setting` `i1416479` ON
`i1416479`.`setup` = `s`.`sid` AND `i1416479`.`input_id` = "1416479" LEFT JOIN `input_setting` `i1416480` ON
`i1416480`.`setup` = `s`.`sid` AND `i1416480`.`input_id` = "1416480" LEFT JOIN `input_setting` `i1416481` ON
`i1416481`.`setup` = `s`.`sid` AND `i1416481`.`input_id` = "1416481" LEFT JOIN `input_setting` `i1416482` ON
`i1416482`.`setup` = `s`.`sid` AND `i1416482`.`input_id` = "1416482" LEFT JOIN `input_setting` `i1416483` ON
`i1416483`.`setup` = `s`.`sid` AND `i1416483`.`input_id` = "1416483" LEFT JOIN `input_setting` `i1416484` ON
`i1416484`.`setup` = `s`.`sid` AND `i1416484`.`input_id` = "1416484" LEFT JOIN `input_setting` `i1416485` ON
`i1416485`.`setup` = `s`.`sid` AND `i1416485`.`input_id` = "1416485" LEFT JOIN `input_setting` `i1416486` ON
`i1416486`.`setup` = `s`.`sid` AND `i1416486`.`input_id` = "1416486" LEFT JOIN `input_setting` `i1416487` ON
`i1416487`.`setup` = `s`.`sid` AND `i1416487`.`input_id` = "1416487" LEFT JOIN `input_setting` `i1416488` ON
`i1416488`.`setup` = `s`.`sid` AND `i1416488`.`input_id` = "1416488" LEFT JOIN `input_setting` `i1416489` ON
`i1416489`.`setup` = `s`.`sid` AND `i1416489`.`input_id` = "1416489" LEFT JOIN `input_setting` `i1416490` ON
`i1416490`.`setup` = `s`.`sid` AND `i1416490`.`input_id` = "1416490" LEFT JOIN `input_setting` `i1416491` ON
`i1416491`.`setup` = `s`.`sid` AND `i1416491`.`input_id` = "1416491" LEFT JOIN `input_setting` `i1416492` ON
`i1416492`.`setup` = `s`.`sid` AND `i1416492`.`input_id` = "1416492" LEFT JOIN `input_setting` `i1416493` ON
`i1416493`.`setup` = `s`.`sid` AND `i1416493`.`input_id` = "1416493" LEFT JOIN `input_setting` `i1416494` ON
`i1416494`.`setup` = `s`.`sid` AND `i1416494`.`input_id` = "1416494" LEFT JOIN `input_setting` `i1416495` ON
`i1416495`.`setup` = `s`.`sid` AND `i1416495`.`input_id` = "1416495" LEFT JOIN `input_setting` `i1416496` ON
`i1416496`.`setup` = `s`.`sid` AND `i1416496`.`input_id` = "1416496" LEFT JOIN `input_setting` `i1416497` ON
`i1416497`.`setup` = `s`.`sid` AND `i1416497`.`input_id` = "1416497" LEFT JOIN `input_setting` `i1416498` ON
`i1416498`.`setup` = `s`.`sid` AND `i1416498`.`input_id` = "1416498" LEFT JOIN `input_setting` `i1416500` ON
`i1416500`.`setup` = `s`.`sid` AND `i1416500`.`input_id` = "1416500" LEFT JOIN `input_setting` `i1416501` ON
`i1416501`.`setup` = `s`.`sid` AND `i1416501`.`input_id` = "1416501" LEFT JOIN `input_setting` `i1416502` ON
`i1416502`.`setup` = `s`.`sid` AND `i1416502`.`input_id` = "1416502" LEFT JOIN `input_setting` `i1416503` ON
`i1416503`.`setup` = `s`.`sid` AND `i1416503`.`input_id` = "1416503" LEFT JOIN `input_setting` `i1416504` ON
`i1416504`.`setup` = `s`.`sid` AND `i1416504`.`input_id` = "1416504" LEFT JOIN `input_setting` `i1416505` ON
`i1416505`.`setup` = `s`.`sid` AND `i1416505`.`input_id` = "1416505" LEFT JOIN `input_setting` `i1416506` ON
`i1416506`.`setup` = `s`.`sid` AND `i1416506`.`input_id` = "1416506" LEFT JOIN `input_setting` `i1416507` ON
`i1416507`.`setup` = `s`.`sid` AND `i1416507`.`input_id` = "1416507" LEFT JOIN `input_setting` `i1416508` ON
`i1416508`.`setup` = `s`.`sid` AND `i1416508`.`input_id` = "1416508" LEFT JOIN `input_setting` `i1416509` ON
`i1416509`.`setup` = `s`.`sid` AND `i1416509`.`input_id` = "1416509" LEFT JOIN `input_setting` `i1416510` ON
`i1416510`.`setup` = `s`.`sid` AND `i1416510`.`input_id` = "1416510" LEFT JOIN `input_setting` `i1416511` ON
`i1416511`.`setup` = `s`.`sid` AND `i1416511`.`input_id` = "1416511" LEFT JOIN `input_setting` `i1416512` ON
`i1416512`.`setup` = `s`.`sid` AND `i1416512`.`input_id` = "1416512" LEFT JOIN `input_setting` `i1416513` ON
`i1416513`.`setup` = `s`.`sid` AND `i1416513`.`input_id` = "1416513" LEFT JOIN `input_setting` `i1416514` ON
`i1416514`.`setup` = `s`.`sid` AND `i1416514`.`input_id` = "1416514" LEFT JOIN `input_setting` `i1416515` ON
`i1416515`.`setup` = `s`.`sid` AND `i1416515`.`input_id` = "1416515" LEFT JOIN `input_setting` `i1416516` ON
`i1416516`.`setup` = `s`.`sid` AND `i1416516`.`input_id` = "1416516" LEFT JOIN `input_setting` `i1416517` ON
`i1416517`.`setup` = `s`.`sid` AND `i1416517`.`input_id` = "1416517" LEFT JOIN `input_setting` `i1416518` ON
`i1416518`.`setup` = `s`.`sid` AND `i1416518`.`input_id` = "1416518" LEFT JOIN `input_setting` `i1416519` ON
`i1416519`.`setup` = `s`.`sid` AND `i1416519`.`input_id` = "1416519" LEFT JOIN `input_setting` `i1416520` ON
`i1416520`.`setup` = `s`.`sid` AND `i1416520`.`input_id` = "1416520" LEFT JOIN `input_setting` `i1416521` ON
`i1416521`.`setup` = `s`.`sid` AND `i1416521`.`input_id` = "1416521" LEFT JOIN `input_setting` `i1416522` ON
`i1416522`.`setup` = `s`.`sid` AND `i1416522`.`input_id` = "1416522" WHERE `implementation_id` = "202633" AND
`i1416459`.`value` = 'null' AND `i1416460`.`value` = '[{\"oml-python:serialized_object\": \"component_reference\", \"value\":
{\"key\": \"Imputer\", \"step_name\": \"Imputer\"}}, {\"oml-python:serialized_object\": \"component_reference\", \"value\":
{\"key\": \"VarianceThreshold\", \"step_name\": \"VarianceThreshold\"}}, {\"oml-python:serialized_object\":
\"component_reference\", \"value\": {\"key\": \"Estimator\", \"step_name\": \"Estimator\"}}]' AND `i1416461`.`value` = 'false'
AND `i1416462`.`value` = 'null' AND `i1416463`.`value` = '\"drop\"' AND `i1416464`.`value` = '0.3' AND
`i1416465`.`value` = 'null' AND `i1416466`.`value` = '[{\"oml-python:serialized_object\": \"component_reference\", \"value\":
{\"key\": \"cat\", \"step_name\": \"cat\", \"argument_1\": {\"oml-python:serialized_object\": \"function\", \"value\":
\"openml.testing.cat\"}}}, {\"oml-python:serialized_object\": \"component_reference\", \"value\": {\"key\": \"cont\",
\"step_name\": \"cont\", \"argument_1\": {\"oml-python:serialized_object\": \"function\", \"value\": \"openml.testing.cont\"}}}]'
AND `i1416467`.`value` = 'false' AND `i1416468`.`value` = 'null' AND `i1416469`.`value` = '[{\"oml-
python:serialized_object\": \"component_reference\", \"value\": {\"key\": \"simpleimputer\", \"step_name\": \"simpleimputer\"}},
{\"oml-python:serialized_object\": \"component_reference\", \"value\": {\"key\": \"onehotencoder\", \"step_name\":
\"onehotencoder\"}}]' AND `i1416470`.`value` = 'false' AND `i1416471`.`value` = 'false' AND `i1416472`.`value` = 'true'
AND `i1416473`.`value` = 'null' AND `i1416474`.`value` = 'NaN' AND `i1416475`.`value` = '\"most_frequent\"' AND
`i1416476`.`value` = '0' AND `i1416477`.`value` = 'null' AND `i1416478`.`value` = 'null' AND `i1416479`.`value` = 'null'
AND `i1416480`.`value` = '{\"oml-python:serialized_object\": \"type\", \"value\": \"np.float64\"}' AND `i1416481`.`value` =
'\"ignore\"' AND `i1416482`.`value` = 'null' AND `i1416483`.`value` = 'true' AND `i1416484`.`value` = 'null' AND
`i1416485`.`value` = '[{\"oml-python:serialized_object\": \"component_reference\", \"value\": {\"key\": \"customimputer\",
\"step_name\": \"customimputer\"}}, {\"oml-python:serialized_object\": \"component_reference\", \"value\": {\"key\":
\"standardscaler\", \"step_name\": \"standardscaler\"}}]' AND `i1416486`.`value` = 'false' AND `i1416487`.`value` = 'false'
AND `i1416488`.`value` = 'true' AND `i1416489`.`value` = 'null' AND `i1416490`.`value` = 'NaN' AND `i1416491`.`value`
= '\"mean\"' AND `i1416492`.`value` = '0' AND `i1416493`.`value` = 'true' AND `i1416494`.`value` = 'true' AND
`i1416495`.`value` = 'true' AND `i1416496`.`value` = '0.0' AND `i1416497`.`value` = '3' AND `i1416498`.`value` =
'\"raise-deprecating\"' AND `i1416500`.`value` = '\"warn\"' AND `i1416501`.`value` = '10' AND `i1416502`.`value` = 'null'
AND `i1416503`.`value` = '{\"min_samples_leaf\": [1, 2, 4, 8, 16, 32, 64], \"min_samples_split\": [2, 4, 8, 16, 32, 64, 128]}'
AND `i1416504`.`value` = '\"2*n_jobs\"' AND `i1416505`.`value` = '33003' AND `i1416506`.`value` = 'true' AND
`i1416507`.`value` = 'false' AND `i1416508`.`value` = 'null' AND `i1416509`.`value` = '0' AND `i1416510`.`value` = 'null'
AND `i1416511`.`value` = '\"gini\"' AND `i1416512`.`value` = 'null' AND `i1416513`.`value` = 'null' AND `i1416514`.`value`
= 'null' AND `i1416515`.`value` = '0.0' AND `i1416516`.`value` = 'null' AND `i1416517`.`value` = '1' AND
`i1416518`.`value` = '2' AND `i1416519`.`value` = '0.0' AND `i1416520`.`value` = 'false' AND `i1416521`.`value` = '62501'
AND `i1416522`.`value` = '\"best\"' AND (SELECT COUNT(*) FROM `input_setting` WHERE `setup` = `s`.`sid`) = 63 LIMIT
0,1;
It seems this query breaks when there are more than 61 hyperparameter settings, and this one has 63.
@janvanrijn what do you think? Shall we rewrite this query using IN subqueries? Then we don’t need joins.
interesting. I was aware of this limitation, but have not yet seen a way to solve this in a 'clean' way.
When rewriting the query, one has to be extremely aware of NULL values for hyperparameters. You don't want to match a setup that has specified all hyperparameters to a setup that has only specified a subset. All my previous designs to do so failed on this point.
Two questions:
- this limitation seems to only affect very complex flows. How come that a simple flow like the one that was shown seems to be affected by this?
- can't we change the MySQL settings to reflect this? (without causing an extreme load on the server)
We can change the MySQL setting, but that just moves the problem until a larger pipeline comes along. It will also hurt performance...
Here is the pipeline and all hyperparameters:
Pipeline(Imputer=ColumnTransformer(cat=sklearn.pipeline.Pipeline,cont=sklearn.pipeline.Pipeline)),VarianceThreshold=VarianceThreshold,Estimator=RandomizedSearchCV(estimator=DecisionTreeClassifier))
Pipeline - memory : null Pipeline - steps : [{"oml-python:serialized_object": "component_reference", "value": {"key": "Imputer", "step_name": "Imputer"}}, {"oml-python:serialized_object": "component_reference", "value": {"key": "VarianceThreshold", "step_name": "VarianceThreshold"}}, {"oml-python:serialized_object": "component_reference", "value": {"key": "Estimator", "step_name": "Estimator"}}] Pipeline - verbose : false ColumnTransformer - n_jobs : null ColumnTransformer - remainder : "drop" ColumnTransformer - sparse_threshold : 0.3 ColumnTransformer - transformer_weights : null ColumnTransformer - transformers : [{"oml-python:serialized_object": "component_reference", "value": {"key": "cat", "step_name": "cat", "argument_1": {"oml-python:serialized_object": "function", "value": "openml.testing.cat"}}}, {"oml-python:serialized_object": "component_reference", "value": {"key": "cont", "step_name": "cont", "argument_1": {"oml-python:serialized_object": "function", "value": "openml.testing.cont"}}}] ColumnTransformer - verbose : false Pipeline - memory : null Pipeline - steps : [{"oml-python:serialized_object": "component_reference", "value": {"key": "simpleimputer", "step_name": "simpleimputer"}}, {"oml-python:serialized_object": "component_reference", "value": {"key": "onehotencoder", "step_name": "onehotencoder"}}] Pipeline - verbose : false SimpleImputer - add_indicator : false SimpleImputer - copy : true SimpleImputer - fill_value : null SimpleImputer - missing_values : NaN SimpleImputer - strategy : "most_frequent" SimpleImputer - verbose : 0 OneHotEncoder - categorical_features : null OneHotEncoder - categories : null OneHotEncoder - drop : null OneHotEncoder - dtype : {"oml-python:serialized_object": "type", "value": "np.float64"} OneHotEncoder - handle_unknown : "ignore" OneHotEncoder - n_values : null OneHotEncoder - sparse : true Pipeline - memory : null Pipeline - steps : [{"oml-python:serialized_object": "component_reference", "value": {"key": "customimputer", "step_name": "customimputer"}}, {"oml-python:serialized_object": "component_reference", "value": {"key": "standardscaler", "step_name": "standardscaler"}}] Pipeline) - verbose : false CustomImputer - add_indicator : false CustomImputer - copy : true CustomImputer - fill_value : null CustomImputer - missing_values : NaN CustomImputer - strategy : "mean" CustomImputer - verbose : 0 StandardScaler - copy : true StandardScaler - with_mean : true StandardScaler - with_std : true VarianceThreshold - threshold : 0.0 RandomizedSearchCV - cv : 3 RandomizedSearchCV - error_score : "raise-deprecating" RandomizedSearchCV - iid : "warn" RandomizedSearchCV - n_iter : 10 RandomizedSearchCV - n_jobs : null RandomizedSearchCV - param_distributions : {"min_samples_leaf": [1, 2, 4, 8, 16, 32, 64], "min_samples_split": [2, 4, 8, 16, 32, 64, 128]} RandomizedSearchCV - pre_dispatch : "2*n_jobs" RandomizedSearchCV - random_state : 33003 RandomizedSearchCV - refit : true RandomizedSearchCV - return_train_score : false RandomizedSearchCV - scoring : null RandomizedSearchCV - verbose : 0 DecisionTreeClassifier - class_weight : null DecisionTreeClassifier - criterion : "gini" DecisionTreeClassifier - max_depth : null DecisionTreeClassifier - max_features : null DecisionTreeClassifier - max_leaf_nodes : null DecisionTreeClassifier - min_impurity_decrease : 0.0 DecisionTreeClassifier - min_impurity_split : null DecisionTreeClassifier - min_samples_leaf : 1 DecisionTreeClassifier - min_samples_split : 2 DecisionTreeClassifier - min_weight_fraction_leaf : 0.0 DecisionTreeClassifier - presort : false DecisionTreeClassifier - random_state : 62501 DecisionTreeClassifier - splitter : "best"
agreed that we need a durable solution for this, as this is a relatively easy pipeline. however i think it's hard to work around the join structure.
Still thinking about this.
This is the structure of the join-based query:
SELECT `sid`,`implementation_id`,`i1`.`value` AS `i1`, `i2`.`value` AS `i2`
FROM `algorithm_setup` AS `s`
LEFT JOIN `input_setting` `i1` ON `i1`.`setup` = `s`.`sid` AND `i1`.`input_id` = "1"
LEFT JOIN `input_setting` `i2` ON `i2`.`setup` = `s`.`sid` AND `i2`.`input_id` = "2"
WHERE `implementation_id` = "1" AND `i1`.`value` = 'a' AND `i2`.`value` = 'b'
Do we need to return the values in the select query? We only really need the setup ID, right?
Would this yield the same result?
SELECT `sid`
FROM `algorithm_setup`
WHERE `implementation_id` = "1" AND
`sid` in (SELECT `setup` FROM `input_setting` `i` WHERE `i`.`input_id` = "1" AND `i`.`value` = 'a') AND
`sid` in (SELECT `setup` FROM `input_setting` `i` WHERE `i`.`input_id` = "2" AND `i`.`value` = 'b') AND
(SELECT COUNT(*) FROM `input_setting` WHERE `setup` = `sid`) = 2
This should only return the setup ID that matches all of the constraints, right? The last constraint checks whether also the number of hyperparameters matches (to avoid matching a setup that has all these hyperparameter settings and some more).