daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-11149 pool: bump GC ULT stack size

Open bfaccini opened this issue 2 years ago • 4 comments

Looks like GC ULT stack can overflow upon certain situations, may be due to some PMDK/SPDK/DPDK recent changes ... So bump its stack size from default/16K to DSS_DEEP_STACK_SZ.

Signed-off-by: Bruno Faccini [email protected]

bfaccini avatar Aug 08 '22 14:08 bfaccini

Bug-tracker data: Ticket title is 'osa/offline_drain.py:OSAOfflineDrain.test_osa_offline_drain_during_aggregation - ERROR: daos_engine:1 *** Process 341899 received signal 11 ***' Status is 'Resolved' Labels: 'ci_impact,daily_test,triaged' Job should run at elevated priority (1) https://daosio.atlassian.net/browse/DAOS-11149

github-actions[bot] avatar Aug 08 '22 15:08 github-actions[bot]

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-9924/3/display/redirect

daosbuild1 avatar Aug 11 '22 00:08 daosbuild1

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-9924/3/display/redirect

daosbuild1 avatar Aug 11 '22 00:08 daosbuild1

Test stage Functional Hardware Small completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-9924/3/display/redirect

daosbuild1 avatar Aug 11 '22 02:08 daosbuild1

@NiuYawei and @gnailzenh would be cool if you could review when you have some time ;-)

bfaccini avatar Aug 20 '22 12:08 bfaccini

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-9924/4/display/redirect

daosbuild1 avatar Aug 21 '22 15:08 daosbuild1

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-9924/4/display/redirect

daosbuild1 avatar Aug 21 '22 18:08 daosbuild1

You might want to consider merging master to get more success at passing tests;

This branch is 2 commits ahead, 54 commits behind master.

brianjmurrell avatar Aug 22 '22 14:08 brianjmurrell

Abandoning this PR, now that problem has been identified and it is suspected to be fixed by PR-9667 which has landed.

bfaccini avatar Oct 12 '22 09:10 bfaccini

Looks like these changes must be pushed finally, as found in DAOS-11832 associated coredumps.

bfaccini avatar Oct 17 '22 14:10 bfaccini

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-9924/2/execution/node/1100/log

daosbuild1 avatar Oct 17 '22 20:10 daosbuild1

The only 2 tests failures for the last CI session :

Existing failures - 2
Test Hardware / Functional Hardware Medium / 3-./osa/online_extend.py:OSAOnlineExtend.test_osa_online_extend_oclass;run-aggregation-checksum-container-daos_racer-extra_servers-hosts-ior-client_processes-iorflags-job_manager-loop_test-mdtest-wr_size-32K-pool-rebuild-server_config-servers-0-1-setup-test_obj_class-test_ranks-9fd5 – FTEST_osa.OSAOnlineExtend

Test Hardware / Functional Hardware Medium / 4-./osa/online_extend.py:OSAOnlineExtend.test_osa_online_extend_mdtest;run-aggregation-checksum-container-daos_racer-extra_servers-hosts-ior-client_processes-iorflags-job_manager-loop_test-mdtest-wr_size-32K-pool-rebuild-server_config-servers-0-1-setup-test_obj_class-test_ranks-9fd5 – FTEST_osa.OSAOnlineExtend

appear definitely related to DAOS-11936 and its associated fix/PR-10604 is currently under review ...

bfaccini avatar Oct 19 '22 09:10 bfaccini