gks: zero-initialize ref_cntx before use
Previously, ref_cntx was left uninitialized in bli_gks_cntx_ukr_is_ref() and bli_gks_cntx_ukr2_is_ref(). This lead to undefined behavior. The changes explicitly zero the structure to ensure deterministic behavior.
This fixes the issue mention at #882
Hi @jerryyiransun nice debugging work, but I'm concerned that your fix works as the context is supposed to be fully initialized by bli_cntx_init. I do see that this function returns a possible error code which is not checked. Could you modify ref_kernels/bli_cntx_ref.c at line 366 as follows:
err_t e_val = bli_cntx_init( cntx );
bli_check_error_code( e_val );
I am interested to see if this detects an error in the initialization.
~Alternatively, the error could be in getting the reference initialization function pointer which is also not checked. Please also modify frame/base/bli_gks.c line 357 as:~
err_t e_val = bli_gks_init_ref_cntx( &ref_cntx );
bli_check_error_code( e_val );
EDIT: never mind, the code IS checked there.
Would it be more appropriate to put the memset inside the function bli_gks_init_ref_cntx() ?
Hi @jerryyiransun nice debugging work, but I'm concerned that your fix works as the context is supposed to be fully initialized by
bli_cntx_init. I do see that this function returns a possible error code which is not checked. Could you modifyref_kernels/bli_cntx_ref.cat line 366 as follows:err_t e_val = bli_cntx_init( cntx ); bli_check_error_code( e_val );I am interested to see if this detects an error in the initialization.
I got a slightly different error from bli_check_error_code
clang ./obj/generic/testsuite/test_addm.o ./obj/generic/testsuite/test_addv.o ./obj/generic/testsuite/test_amaxv.o ./obj/generic/testsuite/test_axpbyv.o ./obj/generic/testsuite/test_axpy2v.o ./obj/generic/testsuite/test_axpyf.o ./obj/generic/testsuite/test_axpym.o ./obj/generic/testsuite/test_axpyv.o ./obj/generic/testsuite/test_copym.o ./obj/generic/testsuite/test_copyv.o ./obj/generic/testsuite/test_dotaxpyv.o ./obj/generic/testsuite/test_dotv.o ./obj/generic/testsuite/test_dotxaxpyf.o ./obj/generic/testsuite/test_dotxf.o ./obj/generic/testsuite/test_dotxv.o ./obj/generic/testsuite/test_gemm.o ./obj/generic/testsuite/test_gemm_ukr.o ./obj/generic/testsuite/test_gemmt.o ./obj/generic/testsuite/test_gemmtrsm_ukr.o ./obj/generic/testsuite/test_gemv.o ./obj/generic/testsuite/test_ger.o ./obj/generic/testsuite/test_hemm.o ./obj/generic/testsuite/test_hemv.o ./obj/generic/testsuite/test_her.o ./obj/generic/testsuite/test_her2.o ./obj/generic/testsuite/test_her2k.o ./obj/generic/testsuite/test_herk.o ./obj/generic/testsuite/test_invscalm.o ./obj/generic/testsuite/test_invscalv.o ./obj/generic/testsuite/test_libblis.o ./obj/generic/testsuite/test_normfm.o ./obj/generic/testsuite/test_normfv.o ./obj/generic/testsuite/test_randm.o ./obj/generic/testsuite/test_randv.o ./obj/generic/testsuite/test_scal2m.o ./obj/generic/testsuite/test_scal2v.o ./obj/generic/testsuite/test_scalm.o ./obj/generic/testsuite/test_scalv.o ./obj/generic/testsuite/test_setm.o ./obj/generic/testsuite/test_setv.o ./obj/generic/testsuite/test_subm.o ./obj/generic/testsuite/test_subv.o ./obj/generic/testsuite/test_symm.o ./obj/generic/testsuite/test_symv.o ./obj/generic/testsuite/test_syr.o ./obj/generic/testsuite/test_syr2.o ./obj/generic/testsuite/test_syr2k.o ./obj/generic/testsuite/test_syrk.o ./obj/generic/testsuite/test_trmm.o ./obj/generic/testsuite/test_trmm3.o ./obj/generic/testsuite/test_trmv.o ./obj/generic/testsuite/test_trsm.o ./obj/generic/testsuite/test_trsm_ukr.o ./obj/generic/testsuite/test_trsv.o ./obj/generic/testsuite/test_xpbym.o ./obj/generic/testsuite/test_xpbyv.o lib/generic/libblis.a -Wl,-bedit=no -m64 -L/data/zopen/usr/local/zopen/curl/curl-8.15.0.20250728_192752.zos/lib -L/data/zopen/usr/local/zopen/libpsl/libpsl-master.20250811_092633.zos/lib -L/data/zopen/usr/local/zopen/libssh2/libssh2-1.11.1.20250828_054328.zos/lib -L/data/zopen/usr/local/zopen/openssl/openssl-3.5.2.20250903_120438.zos/lib -L/data/zopen/usr/local/zopen/ncurses/ncurses-6.5.20250819_155443.zos/lib -L/data/zopen/usr/local/zopen/zoslib/zoslib-zopen2.20250825_142938.zos/lib -lcurl -lpsl -lssh2 -lssl -lcrypto -lncurses -lzoslib /data/zopen/usr/local/zopen/zoslib/zoslib-zopen2.20250825_142938.zos/lib/celquopt.s.o -lzoslib-supp /data/jerry/blisport/.zoslib_hooks/zoslib_env_hook.c.o -o test_libblis.x
./test_libblis.x -g ./testsuite/input.general.fast \
-o ./testsuite/input.operations.fast \
> output.testsuite
libblis: ref_kernels/bli_cntx_ref.c (line 367):
libblis: Failed to obtain lock.
libblis: Aborting.
1: 0x1A78B75C bli_abort+0x5c
2: 0x1A78F1C4 bli_cntx_init_generic_ref+0x134
3: 0x1A770A3A bli_gks_cntx_ukr_is_ref+0x98
4: 0x1A7717B4 bli_ind_init+0x5e
5: 0x1A776DCA bli_pthread_switch_on+0x68
6: 0x1A76D00E bli_init_once+0x46
7: 0x1A731A46 bli_thread_get_thread_impl+0x1a
8: 0x1A72EB72 libblis_test_output_params_struct+0x11a
9: 0x1A72D5FC libblis_test_read_params_file+0xea6
10: 0x1A945ADE main+0xe0
11: 0x1A945EFA CELQINIT+0x1aca [CELQINIT:]
l□terminatingl□bash: line 3: 16843084 SIGABRT ./test_libblis.x -g ./testsuite/input.general.fast -o ./testsuite/input.operations.fast > output.testsuite
Would it be more appropriate to put the memset inside the function
bli_gks_init_ref_cntx()?
good point, let me make that update
So there is definitely some UB going on. I would be very interested to see what happens if you do memset with 0xff instead of 0x00, and then print out the raw bytes of the structure after the call to f. The code is written such that all elements of the struct should be initialized and I can't see any codepath that would leave something unitialized---but at least I would like to know what field(s) it is.
While I know you're probably just trying to move forward with the z/OS port, I would like to make sure the bug is fixed "for good"...