blis icon indicating copy to clipboard operation
blis copied to clipboard

gks: zero-initialize ref_cntx before use

Open jerryyiransun opened this issue 4 months ago • 6 comments

Previously, ref_cntx was left uninitialized in bli_gks_cntx_ukr_is_ref() and bli_gks_cntx_ukr2_is_ref(). This lead to undefined behavior. The changes explicitly zero the structure to ensure deterministic behavior.

This fixes the issue mention at #882

jerryyiransun avatar Aug 26 '25 19:08 jerryyiransun

Hi @jerryyiransun nice debugging work, but I'm concerned that your fix works as the context is supposed to be fully initialized by bli_cntx_init. I do see that this function returns a possible error code which is not checked. Could you modify ref_kernels/bli_cntx_ref.c at line 366 as follows:

	err_t e_val = bli_cntx_init( cntx );
	bli_check_error_code( e_val );

I am interested to see if this detects an error in the initialization.

devinamatthews avatar Aug 28 '25 16:08 devinamatthews

~Alternatively, the error could be in getting the reference initialization function pointer which is also not checked. Please also modify frame/base/bli_gks.c line 357 as:~

    err_t e_val = bli_gks_init_ref_cntx( &ref_cntx );
    bli_check_error_code( e_val );

EDIT: never mind, the code IS checked there.

devinamatthews avatar Aug 28 '25 16:08 devinamatthews

Would it be more appropriate to put the memset inside the function bli_gks_init_ref_cntx() ?

hominhquan avatar Aug 29 '25 00:08 hominhquan

Hi @jerryyiransun nice debugging work, but I'm concerned that your fix works as the context is supposed to be fully initialized by bli_cntx_init. I do see that this function returns a possible error code which is not checked. Could you modify ref_kernels/bli_cntx_ref.c at line 366 as follows:

	err_t e_val = bli_cntx_init( cntx );
	bli_check_error_code( e_val );

I am interested to see if this detects an error in the initialization.

I got a slightly different error from bli_check_error_code

clang ./obj/generic/testsuite/test_addm.o ./obj/generic/testsuite/test_addv.o ./obj/generic/testsuite/test_amaxv.o ./obj/generic/testsuite/test_axpbyv.o ./obj/generic/testsuite/test_axpy2v.o ./obj/generic/testsuite/test_axpyf.o ./obj/generic/testsuite/test_axpym.o ./obj/generic/testsuite/test_axpyv.o ./obj/generic/testsuite/test_copym.o ./obj/generic/testsuite/test_copyv.o ./obj/generic/testsuite/test_dotaxpyv.o ./obj/generic/testsuite/test_dotv.o ./obj/generic/testsuite/test_dotxaxpyf.o ./obj/generic/testsuite/test_dotxf.o ./obj/generic/testsuite/test_dotxv.o ./obj/generic/testsuite/test_gemm.o ./obj/generic/testsuite/test_gemm_ukr.o ./obj/generic/testsuite/test_gemmt.o ./obj/generic/testsuite/test_gemmtrsm_ukr.o ./obj/generic/testsuite/test_gemv.o ./obj/generic/testsuite/test_ger.o ./obj/generic/testsuite/test_hemm.o ./obj/generic/testsuite/test_hemv.o ./obj/generic/testsuite/test_her.o ./obj/generic/testsuite/test_her2.o ./obj/generic/testsuite/test_her2k.o ./obj/generic/testsuite/test_herk.o ./obj/generic/testsuite/test_invscalm.o ./obj/generic/testsuite/test_invscalv.o ./obj/generic/testsuite/test_libblis.o ./obj/generic/testsuite/test_normfm.o ./obj/generic/testsuite/test_normfv.o ./obj/generic/testsuite/test_randm.o ./obj/generic/testsuite/test_randv.o ./obj/generic/testsuite/test_scal2m.o ./obj/generic/testsuite/test_scal2v.o ./obj/generic/testsuite/test_scalm.o ./obj/generic/testsuite/test_scalv.o ./obj/generic/testsuite/test_setm.o ./obj/generic/testsuite/test_setv.o ./obj/generic/testsuite/test_subm.o ./obj/generic/testsuite/test_subv.o ./obj/generic/testsuite/test_symm.o ./obj/generic/testsuite/test_symv.o ./obj/generic/testsuite/test_syr.o ./obj/generic/testsuite/test_syr2.o ./obj/generic/testsuite/test_syr2k.o ./obj/generic/testsuite/test_syrk.o ./obj/generic/testsuite/test_trmm.o ./obj/generic/testsuite/test_trmm3.o ./obj/generic/testsuite/test_trmv.o ./obj/generic/testsuite/test_trsm.o ./obj/generic/testsuite/test_trsm_ukr.o ./obj/generic/testsuite/test_trsv.o ./obj/generic/testsuite/test_xpbym.o ./obj/generic/testsuite/test_xpbyv.o lib/generic/libblis.a -Wl,-bedit=no -m64  -L/data/zopen/usr/local/zopen/curl/curl-8.15.0.20250728_192752.zos/lib -L/data/zopen/usr/local/zopen/libpsl/libpsl-master.20250811_092633.zos/lib -L/data/zopen/usr/local/zopen/libssh2/libssh2-1.11.1.20250828_054328.zos/lib -L/data/zopen/usr/local/zopen/openssl/openssl-3.5.2.20250903_120438.zos/lib -L/data/zopen/usr/local/zopen/ncurses/ncurses-6.5.20250819_155443.zos/lib -L/data/zopen/usr/local/zopen/zoslib/zoslib-zopen2.20250825_142938.zos/lib -lcurl -lpsl -lssh2 -lssl -lcrypto -lncurses -lzoslib /data/zopen/usr/local/zopen/zoslib/zoslib-zopen2.20250825_142938.zos/lib/celquopt.s.o -lzoslib-supp /data/jerry/blisport/.zoslib_hooks/zoslib_env_hook.c.o -o test_libblis.x
./test_libblis.x -g ./testsuite/input.general.fast \
                   -o ./testsuite/input.operations.fast \
                    > output.testsuite

libblis: ref_kernels/bli_cntx_ref.c (line 367):
libblis: Failed to obtain lock.
libblis: Aborting.
 1: 0x1A78B75C bli_abort+0x5c
 2: 0x1A78F1C4 bli_cntx_init_generic_ref+0x134
 3: 0x1A770A3A bli_gks_cntx_ukr_is_ref+0x98
 4: 0x1A7717B4 bli_ind_init+0x5e
 5: 0x1A776DCA bli_pthread_switch_on+0x68
 6: 0x1A76D00E bli_init_once+0x46
 7: 0x1A731A46 bli_thread_get_thread_impl+0x1a
 8: 0x1A72EB72 libblis_test_output_params_struct+0x11a
 9: 0x1A72D5FC libblis_test_read_params_file+0xea6
 10: 0x1A945ADE main+0xe0
 11: 0x1A945EFA CELQINIT+0x1aca [CELQINIT:]
l□terminatingl□bash: line 3: 16843084 SIGABRT                    ./test_libblis.x -g ./testsuite/input.general.fast -o ./testsuite/input.operations.fast > output.testsuite

jerryyiransun avatar Sep 06 '25 01:09 jerryyiransun

Would it be more appropriate to put the memset inside the function bli_gks_init_ref_cntx() ?

good point, let me make that update

jerryyiransun avatar Sep 06 '25 01:09 jerryyiransun

So there is definitely some UB going on. I would be very interested to see what happens if you do memset with 0xff instead of 0x00, and then print out the raw bytes of the structure after the call to f. The code is written such that all elements of the struct should be initialized and I can't see any codepath that would leave something unitialized---but at least I would like to know what field(s) it is.

While I know you're probably just trying to move forward with the z/OS port, I would like to make sure the bug is fixed "for good"...

devinamatthews avatar Sep 12 '25 00:09 devinamatthews