OpenML icon indicating copy to clipboard operation
OpenML copied to clipboard

Database error - on upload of a run from OpenML-Python

Open Neeratyoy opened this issue 4 years ago • 25 comments

Description

An OpenMLServerException occurs for a particular unit test which executes a scikit-learn model on a task and tries to publish. The raised exception: openml.exceptions.OpenMLServerException: Database error. Setup search query failed - None

A post request is made which receives a 412 response and therefore an exception is raised during the check where the status code is not 200.

Steps/Code to Reproduce

Given the unit test invokes a series of locally defined functions, one simple way to run this single unit test could be to clone the OpenML-Python repo and run: pytest -k 'test_run_and_upload_decision_tree_pipeline

Given that the execution stack for the unit test case appears to run without errors, the actual post request made during publish can be found here. The data and file_elements packaged in the call are attached as pickle files for reference, while the URL used for the post request is: https://test.openml.org/api/v1/xml/run/

Neeratyoy avatar Jun 17 '20 10:06 Neeratyoy

I did a clean install, but still get this error ModuleNotFoundError: No module named 'pytest_remotedata.plugin' Missing dependency?

joaquinvanschoren avatar Jul 02 '20 16:07 joaquinvanschoren

Did you install the test dependencies? This can be done with pip install -e '.[test]' and I think the missing dependencies can also be installad with pip install pytest pytest-xdist pytest-timeout flaky.

mfeurer avatar Jul 02 '20 16:07 mfeurer

Yes, I did pip install -e '.[test]' I also just did pip install pytest pytest-xdist pytest-timeout flaky But I'm still getting the error

joaquinvanschoren avatar Jul 02 '20 16:07 joaquinvanschoren

I also tried 'pip install pytest-remotedata' and 'pip install -U pytest-remotedata', but still get the error. Weird.

joaquinvanschoren avatar Jul 02 '20 16:07 joaquinvanschoren

I installed everything from scratch again, now I get: ModuleNotFoundError: No module named 'pytest_openfiles.plugin'

joaquinvanschoren avatar Jul 02 '20 16:07 joaquinvanschoren

Upgraded pytest-openfiles to version 0.5.0 Also had to pip install babel

running now

joaquinvanschoren avatar Jul 02 '20 16:07 joaquinvanschoren

The test passes. Weirdly, I'm not seeing the run on the test server. What am I missing?

$ pytest -k 'test_run_and_upload_decision_tree_pipeline'
/Users/joa/anaconda3/lib/python3.7/site-packages/dash/testing/plugin.py:17: UserWarning:

run `pip install dash[testing]` if you need dash.testing

============================================= test session starts =============================================
platform darwin -- Python 3.7.3, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /Users/joa/Code/openml-python, inifile: setup.cfg
plugins: Flask-Dance-3.0.0, timeout-1.4.1, forked-1.2.0, typeguard-2.7.1, cov-2.10.0, flaky-3.6.1, openfiles-0.5.0, xdist-1.32.0, dash-1.3.0
collected 310 items / 309 deselected / 1 selected

tests/test_runs/test_run_functions.py .                                                                 [100%]INFO:unit_tests:Deleted (run, 76815)
INFO:unit_tests:Deleting datas...
INFO:unit_tests:Deleting flows...
INFO:unit_tests:Deleted (flow, 201746)
INFO:unit_tests:Deleting tasks...
INFO:unit_tests:Deleting studys...
INFO:unit_tests:Collecting file lists from: /Users/joa/Code/openml-python/tests/../tests/files/
INFO:unit_tests:Local files deleted
INFO:unit_tests:master is killed


===================================== 1 passed, 309 deselected in 11.25s

joaquinvanschoren avatar Jul 02 '20 16:07 joaquinvanschoren

Did you install the development branch or this branch? Also, the error appears to depend on the python version and/or scikit-learn version. The tests fail with scikit-learn 0.21.X (and we don't test 0.22.X yet).

mfeurer avatar Jul 03 '20 07:07 mfeurer

What do you mean with 'install the development branch'? I cloned from github, did checkout develop, and installed via pip install -e '.[test] Currently installed version is openml-0.11.0.dev0

I had scikit-learn 0.23.X installed, but now reverted to 0.21.3. The test still passes.

I am more concerned about the fact that, even though the test passes, the run is not appearing on the test server, and I'm not seeing any errors. Am I testing using some kind of mocking setup?

Is this the right test to run? pytest -k 'test_run_and_upload_decision_tree_pipeline'

joaquinvanschoren avatar Jul 03 '20 12:07 joaquinvanschoren

The run itself should appear on the test server and be deleted afterwards to keep the test server clean.

You need a branch from @amueller's fork of OpenML-Python which you can get with:

git checkout -b amueller-dataframe_run_on_task develop
git pull https://github.com/amueller/openml-python.git dataframe_run_on_task

sorry for not having that in the original description.

mfeurer avatar Jul 03 '20 12:07 mfeurer

Oh yeah, it does! Previously it was deleted too fast for me to see it :). What are the 'Test Test' runs I see appearing on the test server?

I tried pulling the PR, but get a lot of merge conflicts:

Successfully installed openml
$ git checkout -b amueller-dataframe_run_on_task develop
Switched to a new branch 'amueller-dataframe_run_on_task'
$ git pull https://github.com/amueller/openml-python.git dataframe_run_on_task
remote: Enumerating objects: 183, done.
remote: Counting objects: 100% (183/183), done.
remote: Compressing objects: 100% (41/41), done.
remote: Total 206 (delta 146), reused 179 (delta 142), pack-reused 23
Receiving objects: 100% (206/206), 85.67 KiB | 731.00 KiB/s, done.
Resolving deltas: 100% (147/147), completed with 43 local objects.
From https://github.com/amueller/openml-python
 * branch              dataframe_run_on_task -> FETCH_HEAD
Auto-merging tests/test_study/test_study_examples.py
CONFLICT (content): Merge conflict in tests/test_study/test_study_examples.py
Auto-merging tests/test_runs/test_run_functions.py
CONFLICT (content): Merge conflict in tests/test_runs/test_run_functions.py
Auto-merging tests/test_extensions/test_sklearn_extension/test_sklearn_extension.py
CONFLICT (content): Merge conflict in tests/test_extensions/test_sklearn_extension/test_sklearn_extension.py
Auto-merging openml/testing.py
CONFLICT (content): Merge conflict in openml/testing.py
Auto-merging openml/runs/functions.py
CONFLICT (content): Merge conflict in openml/runs/functions.py
Auto-merging openml/extensions/sklearn/extension.py
CONFLICT (content): Merge conflict in openml/extensions/sklearn/extension.py
Auto-merging openml/__init__.py
CONFLICT (content): Merge conflict in openml/__init__.py
Auto-merging examples/30_extended/datasets_tutorial.py
Automatic merge failed; fix conflicts and then commit the result.

joaquinvanschoren avatar Jul 03 '20 13:07 joaquinvanschoren

Hm, not sure how to actually merge this easily. Is it possible that you clone the repo again from github.com/amueller/openml-python and move to the branch dataframe_run_on_task and go on from there?

What are the 'Test Test' runs I see appearing on the test server?

I don't know from the top of my head, sorry.

mfeurer avatar Jul 03 '20 13:07 mfeurer

I could try, but is that what you do in the unit tests? I was trying to reproduce the error you got in the unit tests.

joaquinvanschoren avatar Jul 03 '20 13:07 joaquinvanschoren

The unit test only appears on that specific branch.

mfeurer avatar Jul 03 '20 13:07 mfeurer

I did the following:

git clone https://www.github.com/amueller/openml-python
cd openml-python
git checkout develop
pip install -e '.[test]'
pytest -k 'test_run_and_upload_decision_tree_pipeline'

And now it can't find that test:

$ pytest -k 'test_run_and_upload_decision_tree_pipeline'
======================================= test session starts 
platform darwin -- Python 3.7.3, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /Users/joa/Code/openml-python
plugins: Flask-Dance-3.0.0, timeout-1.4.1, typeguard-2.9.1, forked-1.2.0, mock-3.1.1, cov-2.10.0, flaky-3.6.1, openfiles-0.5.0, xdist-1.32.0
collected 72 items / 72 deselected

joaquinvanschoren avatar Jul 03 '20 13:07 joaquinvanschoren

If I run all the tests in that repo, 20 of them fail.

I'm still not sure why you want me to run the tests in the other repo. Is this still relevant to the server error you're getting?

joaquinvanschoren avatar Jul 03 '20 13:07 joaquinvanschoren

Yes, for reasons how that PR was created it's located in Andy's repository and we're working on it there. You need to go to the feature branch via git checkout dataframe_run_on_task to have access to the test. Hopefully it fails then.

mfeurer avatar Jul 03 '20 17:07 mfeurer

Yep, it fails: Too many tables; MySQL can only use 61 tables in a join.

What kind of setup are you guys trying to store? This is the query. Does that make sense?

SELECT `sid`,`implementation_id`, `i1416459`.`value` AS `i1416459`, `i1416460`.`value` AS `i1416460`, 
`i1416461`.`value` AS `i1416461`, `i1416462`.`value` AS `i1416462`, `i1416463`.`value` AS `i1416463`, 
`i1416464`.`value` AS `i1416464`, `i1416465`.`value` AS `i1416465`, `i1416466`.`value` AS `i1416466`, 
`i1416467`.`value` AS `i1416467`, `i1416468`.`value` AS `i1416468`, `i1416469`.`value` AS `i1416469`, 
`i1416470`.`value` AS `i1416470`, `i1416471`.`value` AS `i1416471`, `i1416472`.`value` AS `i1416472`, `i1416473`.`value`
 AS `i1416473`, `i1416474`.`value` AS `i1416474`, `i1416475`.`value` AS `i1416475`, `i1416476`.`value` AS `i1416476`, 
`i1416477`.`value` AS `i1416477`, `i1416478`.`value` AS `i1416478`, `i1416479`.`value` AS `i1416479`, 
`i1416480`.`value` AS `i1416480`, `i1416481`.`value` AS `i1416481`, `i1416482`.`value` AS `i1416482`, 
`i1416483`.`value` AS `i1416483`, `i1416484`.`value` AS `i1416484`, `i1416485`.`value` AS `i1416485`, 
`i1416486`.`value` AS `i1416486`, `i1416487`.`value` AS `i1416487`, `i1416488`.`value` AS `i1416488`, 
`i1416489`.`value` AS `i1416489`, `i1416490`.`value` AS `i1416490`, `i1416491`.`value` AS `i1416491`, 
`i1416492`.`value` AS `i1416492`, `i1416493`.`value` AS `i1416493`, `i1416494`.`value` AS `i1416494`, 
`i1416495`.`value` AS `i1416495`, `i1416496`.`value` AS `i1416496`, `i1416497`.`value` AS `i1416497`, 
`i1416498`.`value` AS `i1416498`, `i1416500`.`value` AS `i1416500`, `i1416501`.`value` AS `i1416501`, 
`i1416502`.`value` AS `i1416502`, `i1416503`.`value` AS `i1416503`, `i1416504`.`value` AS `i1416504`, 
`i1416505`.`value` AS `i1416505`, `i1416506`.`value` AS `i1416506`, `i1416507`.`value` AS `i1416507`, 
`i1416508`.`value` AS `i1416508`, `i1416509`.`value` AS `i1416509`, `i1416510`.`value` AS `i1416510`, `i1416511`.`value`
 AS `i1416511`, `i1416512`.`value` AS `i1416512`, `i1416513`.`value` AS `i1416513`, `i1416514`.`value` AS `i1416514`, 
`i1416515`.`value` AS `i1416515`, `i1416516`.`value` AS `i1416516`, `i1416517`.`value` AS `i1416517`, `i1416518`.`value`
 AS `i1416518`, `i1416519`.`value` AS `i1416519`, `i1416520`.`value` AS `i1416520`, `i1416521`.`value` AS `i1416521`,
 `i1416522`.`value` AS `i1416522`FROM `algorithm_setup` AS `s` LEFT JOIN `input_setting` `i1416459` ON 
`i1416459`.`setup` = `s`.`sid` AND `i1416459`.`input_id` = "1416459" LEFT JOIN `input_setting` `i1416460` ON 
`i1416460`.`setup` = `s`.`sid` AND `i1416460`.`input_id` = "1416460" LEFT JOIN `input_setting` `i1416461` ON 
`i1416461`.`setup` = `s`.`sid` AND `i1416461`.`input_id` = "1416461" LEFT JOIN `input_setting` `i1416462` ON 
`i1416462`.`setup` = `s`.`sid` AND `i1416462`.`input_id` = "1416462" LEFT JOIN `input_setting` `i1416463` ON 
`i1416463`.`setup` = `s`.`sid` AND `i1416463`.`input_id` = "1416463" LEFT JOIN `input_setting` `i1416464` ON 
`i1416464`.`setup` = `s`.`sid` AND `i1416464`.`input_id` = "1416464" LEFT JOIN `input_setting` `i1416465` ON 
`i1416465`.`setup` = `s`.`sid` AND `i1416465`.`input_id` = "1416465" LEFT JOIN `input_setting` `i1416466` ON 
`i1416466`.`setup` = `s`.`sid` AND `i1416466`.`input_id` = "1416466" LEFT JOIN `input_setting` `i1416467` ON 
`i1416467`.`setup` = `s`.`sid` AND `i1416467`.`input_id` = "1416467" LEFT JOIN `input_setting` `i1416468` ON 
`i1416468`.`setup` = `s`.`sid` AND `i1416468`.`input_id` = "1416468" LEFT JOIN `input_setting` `i1416469` ON 
`i1416469`.`setup` = `s`.`sid` AND `i1416469`.`input_id` = "1416469" LEFT JOIN `input_setting` `i1416470` ON 
`i1416470`.`setup` = `s`.`sid` AND `i1416470`.`input_id` = "1416470" LEFT JOIN `input_setting` `i1416471` ON 
`i1416471`.`setup` = `s`.`sid` AND `i1416471`.`input_id` = "1416471" LEFT JOIN `input_setting` `i1416472` ON 
`i1416472`.`setup` = `s`.`sid` AND `i1416472`.`input_id` = "1416472" LEFT JOIN `input_setting` `i1416473` ON 
`i1416473`.`setup` = `s`.`sid` AND `i1416473`.`input_id` = "1416473" LEFT JOIN `input_setting` `i1416474` ON 
`i1416474`.`setup` = `s`.`sid` AND `i1416474`.`input_id` = "1416474" LEFT JOIN `input_setting` `i1416475` ON 
`i1416475`.`setup` = `s`.`sid` AND `i1416475`.`input_id` = "1416475" LEFT JOIN `input_setting` `i1416476` ON 
`i1416476`.`setup` = `s`.`sid` AND `i1416476`.`input_id` = "1416476" LEFT JOIN `input_setting` `i1416477` ON 
`i1416477`.`setup` = `s`.`sid` AND `i1416477`.`input_id` = "1416477" LEFT JOIN `input_setting` `i1416478` ON 
`i1416478`.`setup` = `s`.`sid` AND `i1416478`.`input_id` = "1416478" LEFT JOIN `input_setting` `i1416479` ON 
`i1416479`.`setup` = `s`.`sid` AND `i1416479`.`input_id` = "1416479" LEFT JOIN `input_setting` `i1416480` ON 
`i1416480`.`setup` = `s`.`sid` AND `i1416480`.`input_id` = "1416480" LEFT JOIN `input_setting` `i1416481` ON 
`i1416481`.`setup` = `s`.`sid` AND `i1416481`.`input_id` = "1416481" LEFT JOIN `input_setting` `i1416482` ON 
`i1416482`.`setup` = `s`.`sid` AND `i1416482`.`input_id` = "1416482" LEFT JOIN `input_setting` `i1416483` ON 
`i1416483`.`setup` = `s`.`sid` AND `i1416483`.`input_id` = "1416483" LEFT JOIN `input_setting` `i1416484` ON 
`i1416484`.`setup` = `s`.`sid` AND `i1416484`.`input_id` = "1416484" LEFT JOIN `input_setting` `i1416485` ON 
`i1416485`.`setup` = `s`.`sid` AND `i1416485`.`input_id` = "1416485" LEFT JOIN `input_setting` `i1416486` ON 
`i1416486`.`setup` = `s`.`sid` AND `i1416486`.`input_id` = "1416486" LEFT JOIN `input_setting` `i1416487` ON 
`i1416487`.`setup` = `s`.`sid` AND `i1416487`.`input_id` = "1416487" LEFT JOIN `input_setting` `i1416488` ON 
`i1416488`.`setup` = `s`.`sid` AND `i1416488`.`input_id` = "1416488" LEFT JOIN `input_setting` `i1416489` ON 
`i1416489`.`setup` = `s`.`sid` AND `i1416489`.`input_id` = "1416489" LEFT JOIN `input_setting` `i1416490` ON 
`i1416490`.`setup` = `s`.`sid` AND `i1416490`.`input_id` = "1416490" LEFT JOIN `input_setting` `i1416491` ON 
`i1416491`.`setup` = `s`.`sid` AND `i1416491`.`input_id` = "1416491" LEFT JOIN `input_setting` `i1416492` ON 
`i1416492`.`setup` = `s`.`sid` AND `i1416492`.`input_id` = "1416492" LEFT JOIN `input_setting` `i1416493` ON 
`i1416493`.`setup` = `s`.`sid` AND `i1416493`.`input_id` = "1416493" LEFT JOIN `input_setting` `i1416494` ON 
`i1416494`.`setup` = `s`.`sid` AND `i1416494`.`input_id` = "1416494" LEFT JOIN `input_setting` `i1416495` ON 
`i1416495`.`setup` = `s`.`sid` AND `i1416495`.`input_id` = "1416495" LEFT JOIN `input_setting` `i1416496` ON 
`i1416496`.`setup` = `s`.`sid` AND `i1416496`.`input_id` = "1416496" LEFT JOIN `input_setting` `i1416497` ON 
`i1416497`.`setup` = `s`.`sid` AND `i1416497`.`input_id` = "1416497" LEFT JOIN `input_setting` `i1416498` ON 
`i1416498`.`setup` = `s`.`sid` AND `i1416498`.`input_id` = "1416498" LEFT JOIN `input_setting` `i1416500` ON 
`i1416500`.`setup` = `s`.`sid` AND `i1416500`.`input_id` = "1416500" LEFT JOIN `input_setting` `i1416501` ON 
`i1416501`.`setup` = `s`.`sid` AND `i1416501`.`input_id` = "1416501" LEFT JOIN `input_setting` `i1416502` ON 
`i1416502`.`setup` = `s`.`sid` AND `i1416502`.`input_id` = "1416502" LEFT JOIN `input_setting` `i1416503` ON 
`i1416503`.`setup` = `s`.`sid` AND `i1416503`.`input_id` = "1416503" LEFT JOIN `input_setting` `i1416504` ON 
`i1416504`.`setup` = `s`.`sid` AND `i1416504`.`input_id` = "1416504" LEFT JOIN `input_setting` `i1416505` ON 
`i1416505`.`setup` = `s`.`sid` AND `i1416505`.`input_id` = "1416505" LEFT JOIN `input_setting` `i1416506` ON 
`i1416506`.`setup` = `s`.`sid` AND `i1416506`.`input_id` = "1416506" LEFT JOIN `input_setting` `i1416507` ON 
`i1416507`.`setup` = `s`.`sid` AND `i1416507`.`input_id` = "1416507" LEFT JOIN `input_setting` `i1416508` ON 
`i1416508`.`setup` = `s`.`sid` AND `i1416508`.`input_id` = "1416508" LEFT JOIN `input_setting` `i1416509` ON 
`i1416509`.`setup` = `s`.`sid` AND `i1416509`.`input_id` = "1416509" LEFT JOIN `input_setting` `i1416510` ON 
`i1416510`.`setup` = `s`.`sid` AND `i1416510`.`input_id` = "1416510" LEFT JOIN `input_setting` `i1416511` ON 
`i1416511`.`setup` = `s`.`sid` AND `i1416511`.`input_id` = "1416511" LEFT JOIN `input_setting` `i1416512` ON 
`i1416512`.`setup` = `s`.`sid` AND `i1416512`.`input_id` = "1416512" LEFT JOIN `input_setting` `i1416513` ON 
`i1416513`.`setup` = `s`.`sid` AND `i1416513`.`input_id` = "1416513" LEFT JOIN `input_setting` `i1416514` ON 
`i1416514`.`setup` = `s`.`sid` AND `i1416514`.`input_id` = "1416514" LEFT JOIN `input_setting` `i1416515` ON 
`i1416515`.`setup` = `s`.`sid` AND `i1416515`.`input_id` = "1416515" LEFT JOIN `input_setting` `i1416516` ON 
`i1416516`.`setup` = `s`.`sid` AND `i1416516`.`input_id` = "1416516" LEFT JOIN `input_setting` `i1416517` ON 
`i1416517`.`setup` = `s`.`sid` AND `i1416517`.`input_id` = "1416517" LEFT JOIN `input_setting` `i1416518` ON 
`i1416518`.`setup` = `s`.`sid` AND `i1416518`.`input_id` = "1416518" LEFT JOIN `input_setting` `i1416519` ON 
`i1416519`.`setup` = `s`.`sid` AND `i1416519`.`input_id` = "1416519" LEFT JOIN `input_setting` `i1416520` ON 
`i1416520`.`setup` = `s`.`sid` AND `i1416520`.`input_id` = "1416520" LEFT JOIN `input_setting` `i1416521` ON 
`i1416521`.`setup` = `s`.`sid` AND `i1416521`.`input_id` = "1416521" LEFT JOIN `input_setting` `i1416522` ON 
`i1416522`.`setup` = `s`.`sid` AND `i1416522`.`input_id` = "1416522"  WHERE `implementation_id` = "202633"  AND 
`i1416459`.`value` = 'null'  AND `i1416460`.`value` = '[{\"oml-python:serialized_object\": \"component_reference\", \"value\":
 {\"key\": \"Imputer\", \"step_name\": \"Imputer\"}}, {\"oml-python:serialized_object\": \"component_reference\", \"value\": 
{\"key\": \"VarianceThreshold\", \"step_name\": \"VarianceThreshold\"}}, {\"oml-python:serialized_object\": 
\"component_reference\", \"value\": {\"key\": \"Estimator\", \"step_name\": \"Estimator\"}}]'  AND `i1416461`.`value` = 'false'  
AND `i1416462`.`value` = 'null'  AND `i1416463`.`value` = '\"drop\"'  AND `i1416464`.`value` = '0.3'  AND 
`i1416465`.`value` = 'null'  AND `i1416466`.`value` = '[{\"oml-python:serialized_object\": \"component_reference\", \"value\":
 {\"key\": \"cat\", \"step_name\": \"cat\", \"argument_1\": {\"oml-python:serialized_object\": \"function\", \"value\": 
\"openml.testing.cat\"}}}, {\"oml-python:serialized_object\": \"component_reference\", \"value\": {\"key\": \"cont\", 
\"step_name\": \"cont\", \"argument_1\": {\"oml-python:serialized_object\": \"function\", \"value\": \"openml.testing.cont\"}}}]'  
AND `i1416467`.`value` = 'false'  AND `i1416468`.`value` = 'null'  AND `i1416469`.`value` = '[{\"oml-
python:serialized_object\": \"component_reference\", \"value\": {\"key\": \"simpleimputer\", \"step_name\": \"simpleimputer\"}},
 {\"oml-python:serialized_object\": \"component_reference\", \"value\": {\"key\": \"onehotencoder\", \"step_name\": 
\"onehotencoder\"}}]'  AND `i1416470`.`value` = 'false'  AND `i1416471`.`value` = 'false'  AND `i1416472`.`value` = 'true'  
AND `i1416473`.`value` = 'null'  AND `i1416474`.`value` = 'NaN'  AND `i1416475`.`value` = '\"most_frequent\"'  AND 
`i1416476`.`value` = '0'  AND `i1416477`.`value` = 'null'  AND `i1416478`.`value` = 'null'  AND `i1416479`.`value` = 'null'  
AND `i1416480`.`value` = '{\"oml-python:serialized_object\": \"type\", \"value\": \"np.float64\"}'  AND `i1416481`.`value` = 
'\"ignore\"'  AND `i1416482`.`value` = 'null'  AND `i1416483`.`value` = 'true'  AND `i1416484`.`value` = 'null'  AND 
`i1416485`.`value` = '[{\"oml-python:serialized_object\": \"component_reference\", \"value\": {\"key\": \"customimputer\",
 \"step_name\": \"customimputer\"}}, {\"oml-python:serialized_object\": \"component_reference\", \"value\": {\"key\": 
\"standardscaler\", \"step_name\": \"standardscaler\"}}]'  AND `i1416486`.`value` = 'false'  AND `i1416487`.`value` = 'false'
  AND `i1416488`.`value` = 'true'  AND `i1416489`.`value` = 'null'  AND `i1416490`.`value` = 'NaN'  AND `i1416491`.`value`
 = '\"mean\"'  AND `i1416492`.`value` = '0'  AND `i1416493`.`value` = 'true'  AND `i1416494`.`value` = 'true'  AND 
`i1416495`.`value` = 'true'  AND `i1416496`.`value` = '0.0'  AND `i1416497`.`value` = '3'  AND `i1416498`.`value` = 
'\"raise-deprecating\"'  AND `i1416500`.`value` = '\"warn\"'  AND `i1416501`.`value` = '10'  AND `i1416502`.`value` = 'null' 
 AND `i1416503`.`value` = '{\"min_samples_leaf\": [1, 2, 4, 8, 16, 32, 64], \"min_samples_split\": [2, 4, 8, 16, 32, 64, 128]}'  
AND `i1416504`.`value` = '\"2*n_jobs\"'  AND `i1416505`.`value` = '33003'  AND `i1416506`.`value` = 'true'  AND 
`i1416507`.`value` = 'false'  AND `i1416508`.`value` = 'null'  AND `i1416509`.`value` = '0'  AND `i1416510`.`value` = 'null'
  AND `i1416511`.`value` = '\"gini\"'  AND `i1416512`.`value` = 'null'  AND `i1416513`.`value` = 'null'  AND `i1416514`.`value`
 = 'null'  AND `i1416515`.`value` = '0.0'  AND `i1416516`.`value` = 'null'  AND `i1416517`.`value` = '1'  AND 
`i1416518`.`value` = '2'  AND `i1416519`.`value` = '0.0'  AND `i1416520`.`value` = 'false'  AND `i1416521`.`value` = '62501' 
 AND `i1416522`.`value` = '\"best\"'  AND (SELECT COUNT(*) FROM `input_setting` WHERE `setup` = `s`.`sid`) = 63 LIMIT
 0,1;

joaquinvanschoren avatar Jul 03 '20 23:07 joaquinvanschoren

It seems this query breaks when there are more than 61 hyperparameter settings, and this one has 63.

@janvanrijn what do you think? Shall we rewrite this query using IN subqueries? Then we don’t need joins.

joaquinvanschoren avatar Jul 04 '20 08:07 joaquinvanschoren

interesting. I was aware of this limitation, but have not yet seen a way to solve this in a 'clean' way.

When rewriting the query, one has to be extremely aware of NULL values for hyperparameters. You don't want to match a setup that has specified all hyperparameters to a setup that has only specified a subset. All my previous designs to do so failed on this point.

Two questions:

  1. this limitation seems to only affect very complex flows. How come that a simple flow like the one that was shown seems to be affected by this?
  2. can't we change the MySQL settings to reflect this? (without causing an extreme load on the server)

janvanrijn avatar Jul 04 '20 14:07 janvanrijn

We can change the MySQL setting, but that just moves the problem until a larger pipeline comes along. It will also hurt performance...

joaquinvanschoren avatar Jul 04 '20 20:07 joaquinvanschoren

Here is the pipeline and all hyperparameters:

Pipeline(Imputer=ColumnTransformer(cat=sklearn.pipeline.Pipeline,cont=sklearn.pipeline.Pipeline)),VarianceThreshold=VarianceThreshold,Estimator=RandomizedSearchCV(estimator=DecisionTreeClassifier))

Pipeline - memory : null Pipeline - steps : [{"oml-python:serialized_object": "component_reference", "value": {"key": "Imputer", "step_name": "Imputer"}}, {"oml-python:serialized_object": "component_reference", "value": {"key": "VarianceThreshold", "step_name": "VarianceThreshold"}}, {"oml-python:serialized_object": "component_reference", "value": {"key": "Estimator", "step_name": "Estimator"}}] Pipeline - verbose : false ColumnTransformer - n_jobs : null ColumnTransformer - remainder : "drop" ColumnTransformer - sparse_threshold : 0.3 ColumnTransformer - transformer_weights : null ColumnTransformer - transformers : [{"oml-python:serialized_object": "component_reference", "value": {"key": "cat", "step_name": "cat", "argument_1": {"oml-python:serialized_object": "function", "value": "openml.testing.cat"}}}, {"oml-python:serialized_object": "component_reference", "value": {"key": "cont", "step_name": "cont", "argument_1": {"oml-python:serialized_object": "function", "value": "openml.testing.cont"}}}] ColumnTransformer - verbose : false Pipeline - memory : null Pipeline - steps : [{"oml-python:serialized_object": "component_reference", "value": {"key": "simpleimputer", "step_name": "simpleimputer"}}, {"oml-python:serialized_object": "component_reference", "value": {"key": "onehotencoder", "step_name": "onehotencoder"}}] Pipeline - verbose : false SimpleImputer - add_indicator : false SimpleImputer - copy : true SimpleImputer - fill_value : null SimpleImputer - missing_values : NaN SimpleImputer - strategy : "most_frequent" SimpleImputer - verbose : 0 OneHotEncoder - categorical_features : null OneHotEncoder - categories : null OneHotEncoder - drop : null OneHotEncoder - dtype : {"oml-python:serialized_object": "type", "value": "np.float64"} OneHotEncoder - handle_unknown : "ignore" OneHotEncoder - n_values : null OneHotEncoder - sparse : true Pipeline - memory : null Pipeline - steps : [{"oml-python:serialized_object": "component_reference", "value": {"key": "customimputer", "step_name": "customimputer"}}, {"oml-python:serialized_object": "component_reference", "value": {"key": "standardscaler", "step_name": "standardscaler"}}] Pipeline) - verbose : false CustomImputer - add_indicator : false CustomImputer - copy : true CustomImputer - fill_value : null CustomImputer - missing_values : NaN CustomImputer - strategy : "mean" CustomImputer - verbose : 0 StandardScaler - copy : true StandardScaler - with_mean : true StandardScaler - with_std : true VarianceThreshold - threshold : 0.0 RandomizedSearchCV - cv : 3 RandomizedSearchCV - error_score : "raise-deprecating" RandomizedSearchCV - iid : "warn" RandomizedSearchCV - n_iter : 10 RandomizedSearchCV - n_jobs : null RandomizedSearchCV - param_distributions : {"min_samples_leaf": [1, 2, 4, 8, 16, 32, 64], "min_samples_split": [2, 4, 8, 16, 32, 64, 128]} RandomizedSearchCV - pre_dispatch : "2*n_jobs" RandomizedSearchCV - random_state : 33003 RandomizedSearchCV - refit : true RandomizedSearchCV - return_train_score : false RandomizedSearchCV - scoring : null RandomizedSearchCV - verbose : 0 DecisionTreeClassifier - class_weight : null DecisionTreeClassifier - criterion : "gini" DecisionTreeClassifier - max_depth : null DecisionTreeClassifier - max_features : null DecisionTreeClassifier - max_leaf_nodes : null DecisionTreeClassifier - min_impurity_decrease : 0.0 DecisionTreeClassifier - min_impurity_split : null DecisionTreeClassifier - min_samples_leaf : 1 DecisionTreeClassifier - min_samples_split : 2 DecisionTreeClassifier - min_weight_fraction_leaf : 0.0 DecisionTreeClassifier - presort : false DecisionTreeClassifier - random_state : 62501 DecisionTreeClassifier - splitter : "best"

joaquinvanschoren avatar Jul 04 '20 20:07 joaquinvanschoren

agreed that we need a durable solution for this, as this is a relatively easy pipeline. however i think it's hard to work around the join structure.

janvanrijn avatar Jul 04 '20 20:07 janvanrijn

Still thinking about this.

This is the structure of the join-based query:

SELECT `sid`,`implementation_id`,`i1`.`value` AS `i1`, `i2`.`value` AS `i2` 
FROM `algorithm_setup` AS `s` 
LEFT JOIN `input_setting` `i1` ON `i1`.`setup` = `s`.`sid` AND `i1`.`input_id` = "1" 
LEFT JOIN `input_setting` `i2` ON `i2`.`setup` = `s`.`sid` AND `i2`.`input_id` = "2"
WHERE `implementation_id` = "1"  AND `i1`.`value` = 'a'  AND `i2`.`value` = 'b'

Do we need to return the values in the select query? We only really need the setup ID, right?

joaquinvanschoren avatar Jul 04 '20 20:07 joaquinvanschoren

Would this yield the same result?

SELECT `sid` 
FROM `algorithm_setup`
WHERE `implementation_id` = "1"  AND 
`sid` in (SELECT `setup` FROM `input_setting` `i` WHERE `i`.`input_id` = "1" AND `i`.`value` = 'a') AND
`sid` in (SELECT `setup` FROM `input_setting` `i` WHERE `i`.`input_id` = "2" AND `i`.`value` = 'b') AND
 (SELECT COUNT(*) FROM `input_setting` WHERE `setup` = `sid`) = 2

This should only return the setup ID that matches all of the constraints, right? The last constraint checks whether also the number of hyperparameters matches (to avoid matching a setup that has all these hyperparameter settings and some more).

joaquinvanschoren avatar Jul 04 '20 20:07 joaquinvanschoren