aim icon indicating copy to clipboard operation
aim copied to clipboard

Flap on initing Run

Open Alexponomarev7 opened this issue 1 year ago • 11 comments

🐛 Bug

I have a very strange bug, sometimes everything is ok, sometimes i got his:

AttributeError: 'str' object has no attribute 'class_'
  return "<%s at 0x%x>" % (state.class_.__name__, id(state.obj()))
  "row is otherwise not present." % base.state_str(state)
 File "/usr/python3.8/lib/python3.8/site-packages/sqlalchemy/orm/base.py", line 264, in state_str
 File "/usr/python3.8/lib/python3.8/site-packages/sqlalchemy/orm/exc.py", line 138, in __init__
  raise exception(*args) if args else exception()
 File "/usr/python3.8/lib/python3.8/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
  raise_exception(response.exception)
 File "/usr/python3.8/lib/python3.8/site-packages/aim/ext/transport/client.py", line 225, in get_resource_handler
  handler = self._rpc_client.get_resource_handler(self, self.resource_type, args=self.init_args)
 File "/usr/python3.8/lib/python3.8/site-packages/aim/storage/structured/proxy.py", line 28, in __init__
  return StructuredRunProxy(self._client, hash_, read_only)
 File "/usr/python3.8/lib/python3.8/site-packages/aim/sdk/repo.py", line 351, in request_props
  self._props = self.repo.request_props(self.hash, self.read_only)
 File "/usr/python3.8/lib/python3.8/site-packages/aim/sdk/run.py", line 440, in props
  self.props
 File "/usr/python3.8/lib/python3.8/site-packages/aim/sdk/run.py", line 325, in __init__
  super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
 File "/usr/python3.8/lib/python3.8/site-packages/aim/sdk/run.py", line 828, in __init__
  return func(*args, **kwargs)
 File "/usr/python3.8/lib/python3.8/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
  raise e
 File "/usr/python3.8/lib/python3.8/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
  _SafeModeConfig.exception_callback(e, func)
 File "/usr/python3.8/lib/python3.8/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
  run = Run(
 File "run.py", line 71, in _init_aim
  self.run = self._init_aim()

To reproduce

I initialise run like that:

run = Run(
    repo="aim://127.0.0.1:53800",
    experiment=project_name,
)

Expected behavior

No error or another :)

Environment

  • Aim Version 3.16.2
  • 3.8
  • 3.8
  • Ubuntu 20.04

Alexponomarev7 avatar Mar 13 '23 16:03 Alexponomarev7

Hey @Alexponomarev7! Thanks a lot for the report. Can I ask you which version of sqlalchemy are you using on server side?

mihran113 avatar Mar 13 '23 16:03 mihran113

And also, is your server-side aim on 3.16.2 version?

mihran113 avatar Mar 13 '23 16:03 mihran113

@mihran113 Thank you for fast response!

>>> sqlalchemy.__version__
'1.4.46'
 ~ $ docker run --network host --entrypoint aim aimstack/aim version
Aim v3.16.2

Alexponomarev7 avatar Mar 13 '23 16:03 Alexponomarev7

Seems you're running aim server on docker? Could you please provide some more details about your setup, so I can reproduce it on my end?

And one more question: did this happen starting from 3.16.2 version or was it happening earlier as well?

mihran113 avatar Mar 13 '23 17:03 mihran113

Yes, sorry about missing this in the environment information. We haven't tried earlier versions, because now we try to move our infra from old one to Aim. I can only describe our enviroment as i use docker to up UI and server. They use custom repo path, I guess it doesn't matter but something like /home/aim. For last hour I made about 10 our runs with simple training, all runs are the same. 2 of 10 have completed with the error that i described here

Alexponomarev7 avatar Mar 13 '23 17:03 Alexponomarev7

Hmm, pretty strange, I'll try to reproduce it on my end, doesn't seem to be something obvious or setup related. Will ping you once any updates.

mihran113 avatar Mar 13 '23 17:03 mihran113

@mihran113 Hi! We have been using AIM for 2 days and we've got more context which happens with such problem:

On client side we got this:

sqlalchemy.exc.InvalidRequestError: This session is in 'prepared' state; no further SQL can be emitted within this transaction.

On backend side we got this: image

Alexponomarev7 avatar Mar 16 '23 12:03 Alexponomarev7

It's not about initing, seems that it's about track using, but it's also a problem about sqlalchemy It may be connected

Alexponomarev7 avatar Mar 16 '23 12:03 Alexponomarev7

I'm experiencing the same issue; I'm using the lightning adapter and this is happening both when runs fail and/or succeed!

the only way I've managed to consistently overcome the issue is by

pip uninstall sqlalchemy
pip install "sqlalchemy<2.0.0" -U

If I had to guess, this could be an sqlalchemy caching issue?

constd avatar Mar 17 '23 22:03 constd

Hey @Alexponomarev7! I wasn't able to reproduce it on my end. If possible could you please share an example script that might help to reproduce it? The warnings on server side indicate that it might have something to do with adding tags to runs.

mihran113 avatar Mar 20 '23 14:03 mihran113

I haven't been able to create a simple example script but I'm running into the same issue with sqlalchemy 1.4.39 when adding tags to a run immediately after creation.

run = Run(
      repo=repo_path,
      experiment=experiment,
      **self._aim_run_kwargs,
  )
  for t in self._tags:
      run.add_tag(t)```

However, it doesn't seem to occur every time and I haven't been able to determine the exact conditions when it does. If I do I'll post the update there.

Here is the relevant portion of the stacktrace:

Traceback (most recent call last):
  File "/py_env/quickstart/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 853, in _on_result
    on_result(trial, *args, **kwargs)
  File "/py_env/quickstart/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 1187, in _on_trial_reset
    self._actor_started(tracked_actor, log="REUSED")
  File "/py_env/quickstart/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 764, in _actor_started
    self._callbacks.on_trial_start(
  File "/py_env/quickstart/lib/python3.11/site-packages/ray/tune/callback.py", line 384, in on_trial_start
    callback.on_trial_start(**info)
  File "/py_env/quickstart/lib/python3.11/site-packages/ray/tune/logger/logger.py", line 145, in on_trial_start
    self.log_trial_start(trial)
  File "/py_env/quickstart/lib/python3.11/site-packages/pln/fitting/tune_callbacks.py", line 96, in log_trial_start
    self._trial_to_run[trial] = self._create_run(trial)
                                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/py_env/quickstart/lib/python3.11/site-packages/pln/fitting/tune_callbacks.py", line 75, in _create_run
    run.add_tag(t)
  File "/py_env/quickstart/lib/python3.11/site-packages/aim/sdk/run.py", line 254, in add_tag
    return self.props.add_tag(value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/py_env/quickstart/lib/python3.11/site-packages/aim/storage/structured/proxy.py", line 86, in add_tag
    return self._rpc_client.run_instruction(self._hash, self._handler, 'add_tag', (value,))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/py_env/quickstart/lib/python3.11/site-packages/aim/ext/transport/client.py", line 260, in run_instruction
    return self._run_read_instructions(queue_id, resource, method, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/py_env/quickstart/lib/python3.11/site-packages/aim/ext/transport/client.py", line 285, in _run_read_instructions
    raise_exception(status_msg.header.exception)
  File "/py_env/quickstart/lib/python3.11/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
    raise exception(*args) if args else exception()
sqlalchemy.exc.InvalidRequestError: This session is in 'prepared' state; no further SQL can be emitted within this transaction.

nateyoder avatar Aug 18 '23 18:08 nateyoder