mars icon indicating copy to clipboard operation
mars copied to clipboard

[BUG] test_ownership_when_scale_in hang

Open chaokunyang opened this issue 3 years ago • 3 comments

Describe the bug When running test case DEBUG_OSCAR=1 pytest -v -s mars/deploy/oscar/tests/test_ray.py::test_ownership_when_scale_in, it hangs occasionally.

To Reproduce To help us reproducing this bug, please provide information below:

  1. Your Python version: 3.7.9

  2. The version of Mars you use: master

  3. Versions of crucial packages, such as numpy, scipy and pandas

  4. Full stack of the error. image image image image

  5. Minimized code to reproduce the error.

Expected behavior test_ownership_when_scale_in should finish in less than 120 seconds

chaokunyang avatar May 17 '22 12:05 chaokunyang

image Seems mars hang at ray call.

chaokunyang avatar May 17 '22 13:05 chaokunyang

image core owker log:

(mars-py3.8-dev) chaokunyangdeMacBook-Pro:mars chaokunyang$ grep -nr "580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000" /tmp/ray/session_latest/logs/
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3016:[2022-05-17 21:55:33,621 D 12183 6422828] reference_count.cc:181: Adding owned object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3123:[2022-05-17 21:58:15,824 D 12183 6422227] reference_count.cc:1365: Published message for 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000, 0 locations, spilled url: [], spilled node ID: NIL_ID, and object size: 33624, and primary node ID: NIL_ID, pending creation? 1
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3124:[2022-05-17 21:58:15,824 D 12183 6422227] task_manager.cc:240: Task return object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000 has size 33624
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3125:[2022-05-17 21:58:15,824 D 12183 6422227] memory_store.cc:201: Putting object into memory store. objectid is 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3126:[2022-05-17 21:58:15,824 D 12183 6422227] reference_count.cc:240: Add local reference 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3127:[2022-05-17 21:58:15,824 D 12183 6422227] reference_count.cc:241: REF 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000 borrowers: 0 local_ref_count: 2 submitted_count: 0 contained_in_owned: 0 contained_in_borrowed: 0 contains: 0 stored_in: 0 lineage_ref_count: 0
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3128:[2022-05-17 21:58:15,837 D 12183 6422227] reference_count.cc:297: Remove local reference 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3129:[2022-05-17 21:58:15,837 D 12183 6422227] reference_count.cc:298: REF 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000 borrowers: 0 local_ref_count: 1 submitted_count: 0 contained_in_owned: 0 contained_in_borrowed: 0 contains: 0 stored_in: 0 lineage_ref_count: 0
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3130:[2022-05-17 21:58:15,837 D 12183 6422227] reference_count.cc:302: REF 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000 borrowers: 0 local_ref_count: 1 submitted_count: 0 contained_in_owned: 0 contained_in_borrowed: 0 contains: 0 stored_in: 0 lineage_ref_count: 0
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3131:[2022-05-17 21:58:15,837 D 12183 6422227] task_manager.cc:305: Task 580d375e7573d16eb11bba3846f2bc9f44b924b401000000 returned direct object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000, now has 1 plasma returns in scope
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3133:[2022-05-17 21:58:15,837 D 12183 6422227] reference_count.cc:1365: Published message for 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000, 0 locations, spilled url: [], spilled node ID: NIL_ID, and object size: 33624, and primary node ID: NIL_ID, pending creation? 0
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3136:[2022-05-17 21:58:15,837 D 12183 6422828] reference_count.cc:297: Remove local reference 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3137:[2022-05-17 21:58:15,837 D 12183 6422828] reference_count.cc:298: REF 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000 borrowers: 0 local_ref_count: 0 submitted_count: 0 contained_in_owned: 0 contained_in_borrowed: 0 contains: 0 stored_in: 0 lineage_ref_count: 0
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3138:[2022-05-17 21:58:15,837 D 12183 6422828] reference_count.cc:523: Attempting to delete object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3139:[2022-05-17 21:58:15,837 D 12183 6422828] reference_count.cc:529: REF 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000 borrowers: 0 local_ref_count: 0 submitted_count: 0 contained_in_owned: 0 contained_in_borrowed: 0 contains: 0 stored_in: 0 lineage_ref_count: 0
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3140:[2022-05-17 21:58:15,837 D 12183 6422828] reference_count.cc:562: Deleting Reference to object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3141:[2022-05-17 21:58:15,837 D 12183 6422828] reference_count.cc:385: Releasing lineage for object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3142:[2022-05-17 21:58:15,837 D 12183 6422828] task_manager.cc:534: No lineage for object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-worker-ec515be87c40bb2a049838dc504f82c51676eb1298b6efc1bd8ac8fe_12359.log:3978:[2022-05-17 21:58:15,818 D 12359 6433770] core_worker.cc:2219: Creating return object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-worker-ec515be87c40bb2a049838dc504f82c51676eb1298b6efc1bd8ac8fe_12359.log:4008:[2022-05-17 21:58:15,821 D 12359 6433770] core_worker.cc:2411: Sealing return object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000

chaokunyang avatar May 17 '22 14:05 chaokunyang

Seems

image core owker log:

(mars-py3.8-dev) chaokunyangdeMacBook-Pro:mars chaokunyang$ grep -nr "580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000" /tmp/ray/session_latest/logs/
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3016:[2022-05-17 21:55:33,621 D 12183 6422828] reference_count.cc:181: Adding owned object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3123:[2022-05-17 21:58:15,824 D 12183 6422227] reference_count.cc:1365: Published message for 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000, 0 locations, spilled url: [], spilled node ID: NIL_ID, and object size: 33624, and primary node ID: NIL_ID, pending creation? 1
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3124:[2022-05-17 21:58:15,824 D 12183 6422227] task_manager.cc:240: Task return object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000 has size 33624
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3125:[2022-05-17 21:58:15,824 D 12183 6422227] memory_store.cc:201: Putting object into memory store. objectid is 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3126:[2022-05-17 21:58:15,824 D 12183 6422227] reference_count.cc:240: Add local reference 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3127:[2022-05-17 21:58:15,824 D 12183 6422227] reference_count.cc:241: REF 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000 borrowers: 0 local_ref_count: 2 submitted_count: 0 contained_in_owned: 0 contained_in_borrowed: 0 contains: 0 stored_in: 0 lineage_ref_count: 0
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3128:[2022-05-17 21:58:15,837 D 12183 6422227] reference_count.cc:297: Remove local reference 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3129:[2022-05-17 21:58:15,837 D 12183 6422227] reference_count.cc:298: REF 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000 borrowers: 0 local_ref_count: 1 submitted_count: 0 contained_in_owned: 0 contained_in_borrowed: 0 contains: 0 stored_in: 0 lineage_ref_count: 0
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3130:[2022-05-17 21:58:15,837 D 12183 6422227] reference_count.cc:302: REF 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000 borrowers: 0 local_ref_count: 1 submitted_count: 0 contained_in_owned: 0 contained_in_borrowed: 0 contains: 0 stored_in: 0 lineage_ref_count: 0
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3131:[2022-05-17 21:58:15,837 D 12183 6422227] task_manager.cc:305: Task 580d375e7573d16eb11bba3846f2bc9f44b924b401000000 returned direct object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000, now has 1 plasma returns in scope
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3133:[2022-05-17 21:58:15,837 D 12183 6422227] reference_count.cc:1365: Published message for 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000, 0 locations, spilled url: [], spilled node ID: NIL_ID, and object size: 33624, and primary node ID: NIL_ID, pending creation? 0
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3136:[2022-05-17 21:58:15,837 D 12183 6422828] reference_count.cc:297: Remove local reference 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3137:[2022-05-17 21:58:15,837 D 12183 6422828] reference_count.cc:298: REF 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000 borrowers: 0 local_ref_count: 0 submitted_count: 0 contained_in_owned: 0 contained_in_borrowed: 0 contains: 0 stored_in: 0 lineage_ref_count: 0
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3138:[2022-05-17 21:58:15,837 D 12183 6422828] reference_count.cc:523: Attempting to delete object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3139:[2022-05-17 21:58:15,837 D 12183 6422828] reference_count.cc:529: REF 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000 borrowers: 0 local_ref_count: 0 submitted_count: 0 contained_in_owned: 0 contained_in_borrowed: 0 contains: 0 stored_in: 0 lineage_ref_count: 0
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3140:[2022-05-17 21:58:15,837 D 12183 6422828] reference_count.cc:562: Deleting Reference to object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3141:[2022-05-17 21:58:15,837 D 12183 6422828] reference_count.cc:385: Releasing lineage for object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_12183.log:3142:[2022-05-17 21:58:15,837 D 12183 6422828] task_manager.cc:534: No lineage for object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-worker-ec515be87c40bb2a049838dc504f82c51676eb1298b6efc1bd8ac8fe_12359.log:3978:[2022-05-17 21:58:15,818 D 12359 6433770] core_worker.cc:2219: Creating return object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000
/tmp/ray/session_latest/logs//python-core-worker-ec515be87c40bb2a049838dc504f82c51676eb1298b6efc1bd8ac8fe_12359.log:4008:[2022-05-17 21:58:15,821 D 12359 6433770] core_worker.cc:2411: Sealing return object 580d375e7573d16eb11bba3846f2bc9f44b924b40100000001000000

Seems executed ray call didn't return in 3 minutes

chaokunyang avatar May 17 '22 14:05 chaokunyang