osprey icon indicating copy to clipboard operation
osprey copied to clipboard

sqlite3.OperationalError with multiple workers

Open jeiros opened this issue 6 years ago • 1 comments

I'm running Osprey in an HPC facility using the PBS PRO queue system. I'm launching jobs as array jobs, so it is my understanding that multiple workers are accesing the database file at the same time, maybe being the cause of the issue that I'm reporting here:

Here is the Osprey config file, and here is the PBS submission file.

Some of the jobs run without problems, but most (>60%) are giving the following error:

======================================================================
= osprey is a tool for machine learning hyperparameter optimization. =
======================================================================

osprey version:      1.2.0dev
time:                January 16, 2018  2:46 PM
hostname:            cx1-138-2-3.cx1.hpc.ic.ac.uk
cwd:                 /tmp/pbs.1108144[7].cx1
pid:                 15308

Loading config file:     /work/je714/cross-validations/ef-hand/cv_cx1.yaml...

msmbuilder version:  3.7.0
mdtraj version:      1.8.0


Loading dataset...

Dataset contains 145 element(s) with out labels
The elements have shape: [(7250, 263), (7250, 263), (1500, 263), (2500, 263), (7871, 263), (5625, 263), (4500, 263), (4277, 263), (4725, 263), (4568, 263), (8100, 263), (7425, 263), (5690, 263), (1000, 263), (2500, 263), (2500, 263), (2500, 263), (2500, 263), (2500, 263), (2500, 263), ...]
Instantiated estimator:
  Pipeline(steps=[('tica', tICA(commute_mapping=False, kinetic_mapping=False, lag_time=1,
   n_components=None, shrinkage=None)), ('cluster', MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++',
        init_size=None, max_iter=100, max_no_improvement=10, n_clusters=8,
        n_init=3, rando...les=5,
         prior_counts=0, reversible_type='mle', sliding_window=True,
         verbose=True))])
Hyperparameter search space:
  tica__lag_time           	(int)          1 <= x <= 200
  tica__commute_mapping    	(enum)    choices = (True, False)
  cluster__n_clusters      	(int)         50 <= x <= 5000
  tica__n_components       	(int)          1 <= x <= 20

----------------------------------------------------------------------
Beginning iteration                                              1 / 1
----------------------------------------------------------------------
Loading trials database: sqlite:////work/je714/cross-validations/ef-hand/osprey-trials.db...
History contains: 178 trials
Choosing next hyperparameters with random...
  {'tica__lag_time': 125, 'tica__commute_mapping': False, 'cluster__n_clusters': 708, 'tica__n_components': 2}
(random took 0.006 s)

/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
An unexpected error has occurred with osprey (version 1.2.0dev), please
consider sending the following traceback to the osprey GitHub issue tracker at:
        https://github.com/msmbuilder/osprey/issues

Traceback (most recent call last):
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 721, in _commit_impl
    self.engine.dialect.do_commit(self.connection)
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/engine/default.py", line 443, in do_commit
    dbapi_connection.commit()
sqlite3.OperationalError: disk I/O error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/je714/.conda/envs/osprey/bin/osprey", line 11, in <module>
    load_entry_point('osprey', 'console_scripts', 'osprey')()
  File "/export131/home/je714/osprey/osprey/cli/main.py", line 37, in main
    args_func(args, p)
  File "/export131/home/je714/osprey/osprey/cli/main.py", line 42, in args_func
    args.func(args, p)
  File "/export131/home/je714/osprey/osprey/cli/parser_worker.py", line 8, in func
    execute(args, parser)
  File "/export131/home/je714/osprey/osprey/execute_worker.py", line 89, in execute
    max_param_suggestion_retries=max_param_suggestion_retries)
  File "/export131/home/je714/osprey/osprey/execute_worker.py", line 149, in initialize_trial
    session.commit()
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/orm/session.py", line 874, in commit
    self.transaction.commit()
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/orm/session.py", line 465, in commit
    t[1].commit()
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 1623, in commit
    self._do_commit()
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 1654, in _do_commit
    self.connection._commit_impl()
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 723, in _commit_impl
    self._handle_dbapi_exception(e, None, None, None, None)
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 1393, in _handle_dbapi_exception
    exc_info
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/util/compat.py", line 186, in reraise
    raise value.with_traceback(tb)
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 721, in _commit_impl
    self.engine.dialect.do_commit(self.connection)
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/engine/default.py", line 443, in do_commit
    dbapi_connection.commit()
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) disk I/O error
Exception during reset or similar
Traceback (most recent call last):
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/pool.py", line 687, in _finalize_fairy
    fairy._reset(pool)
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/pool.py", line 829, in _reset
    pool._dialect.do_rollback(self)
  File "/home/je714/.conda/envs/osprey/lib/python3.4/site-packages/sqlalchemy/engine/default.py", line 440, in do_rollback
    dbapi_connection.rollback()
sqlite3.OperationalError: cannot rollback - no transaction is active

I've seen issue #6 from awhile ago but I am not sure this is related. Any idea what is going on here?

Also, I'm using the latest copy of the github code for Osprey.

Thanks for any help!

jeiros avatar Jan 16 '18 15:01 jeiros

Thanks for the report! I think it's a limitation of sqlite: https://stackoverflow.com/a/9018525

It might be worth looking into expanding osprey to use a mysql db (or some other client/server-type database).

cxhernandez avatar Jan 16 '18 17:01 cxhernandez