ml-commons icon indicating copy to clipboard operation
ml-commons copied to clipboard

ARM64 CentOS7 compatibility issues with djl/pytorch due to glibc requirements

Open peterzhuamazon opened this issue 1 year ago • 4 comments

We are having issues in ml-commons on arm64, where a lib related to pytorch is requiring glibc >= 2.18

/opt/java/openjdk-21/bin/java: relocation error: /tmp/tmpfv4oghde/1/local-test-cluster/opensearch-2.15.0/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-aarch64/libstdc++.so.6: symbol __cxa_thread_atexit_impl, version GLIBC_2.18 not defined in file libc.so.6 with link time reference

https://ci.opensearch.org/ci/dbc/integ-test/2.15.0/9970/linux/arm64/tar/test-results/8297/integ-test/neural-search/without-security/local-cluster-logs/id-1/stderr.txt https://ci.opensearch.org/ci/dbc/integ-test/2.15.0/9970/linux/arm64/tar/test-results/8297/integ-test/neural-search/without-security/stderr.txt

Note that we are using CentOS7 to build and test OS plugins, which has glibc 2.17 after all. This issue would cause the cluster to crash, resulted in integTest suck in the middle with connection reset. This has impacted ml and ml related plugins such as ml/neural/flowframework to fail their tests. And this has been an issue on arm64 TAR since 2.12 as we trace the logs all the way back.

CentOS7 is going to deprecate on 06/30 and this shouldnt be a problem for AL2 as AL2 has gblic 2.28.

We will switch to AL2 on 2.16 anyway due to k-NN. https://github.com/opensearch-project/opensearch-build/issues/4379

Note: This has affected ML, Flow-Framework, Neural-Search.

Thanks.

peterzhuamazon avatar Jun 17 '24 23:06 peterzhuamazon