codeflare-sdk icon indicating copy to clipboard operation
codeflare-sdk copied to clipboard

`cluster.wait_ready()` fails with `'MissingModel' object is not callable`

Open kpouget opened this issue 2 years ago • 1 comments

As part of my automated Codeflare testing, I'm hitting this exception:

Traceback (most recent call last):
  File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 180, in <module>
    sys.exit(main())
  File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 175, in main
    fire.Fire(Entrypoint())
  File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 49, in wrapper
    fct(*args, **kwargs)
  File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 148, in sdk_user_run_one
    test_sdk_user.run_one()
  File "/opt/ci-artifacts/src/testing/codeflare/test_sdk_user.py", line 165, in run_one
    timeout(entrypoint.main,
  File "/opt/ci-artifacts/src/testing/codeflare/test_sdk_user.py", line 148, in timeout
    return func(*args, **kwargs)
  File "/mnt/logs/002__run_one/sample.py", line 28, in main
    cluster.wait_ready()
  File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/cluster/cluster.py", line 221, in wait_ready
    status, ready = self.status(print_to_console=False)
  File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/cluster/cluster.py", line 160, in status
    appwrapper = _app_wrapper_status(self.config.name, self.config.namespace)
  File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/cluster/cluster.py", line 345, in _app_wrapper_status
    return _map_to_app_wrapper(cluster)
  File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/cluster/cluster.py", line 469, in _map_to_app_wrapper
    status=AppWrapperStatus(cluster_model.status.state.lower()),
TypeError: 'MissingModel' object is not callable

This python file is being executed:

    # Create our cluster and submit appwrapper
    cluster = Cluster(ClusterConfiguration(
        namespace=namespace, name=f"mnisttest-user{user_idx}",
        min_worker=2, max_worker=2,
        min_cpus=2, max_cpus=2,
        min_memory=4, max_memory=4,
        gpu=0,
        instascale=False))
    # Bring up the cluster
    cluster.up()
    cluster.wait_ready() # <-- this line raises the exception
    cluster.status()
    cluster.details()

    job_def = DDPJobDefinition(name="mnisttest", script="mnist.py", workspace=".", scheduler_args={"requirements": "./requirements.txt"})
    job = job_def.submit(cluster)

The RayCluster Pods are pending because of https://github.com/project-codeflare/multi-cluster-app-dispatcher/issues/512, but codeflare-sdk shouldn't fail because of it:

codeflare-sdk-user-test-user-1                     mnisttest-user1-head-v7fn8                                            0/1     Pending     0               6m43s   <none>         <none>                                       <none>           <none>
codeflare-sdk-user-test-user-1                     nisttest-user1-worker-small-group-mnisttest-user1-dwhb4               0/1     Pending     0               6m43s   <none>         <none>                                       <none>           <none>
codeflare-sdk-user-test-user-1                     nisttest-user1-worker-small-group-mnisttest-user1-xccpd               0/1     Pending     0               6m43s   <none>         <none>                                       <none>           <none>

Here is the state of the AppWrapper (captured manually after the test): appwrapper.yaml.log


  • Codeflare SDK is installed from pip (latest version)
    • I'll remove the --quiet flag to capture the exact version being installed
  • Codeflare stack is installed from ODH + OpenShift Codeflare operator

kpouget avatar Jul 26 '23 08:07 kpouget

seems to be the same issue as https://github.com/project-codeflare/codeflare-sdk/issues/226

kpouget avatar Jul 26 '23 19:07 kpouget