TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

Fix runtime lib loading logic

Open ksivaman opened this issue 2 months ago • 1 comments

Description

This is a small refactor of library loading logic during runtime to be more consistent and avoid duplication. The main point is to check python packages as a last ditch attempt to find the library and prioritize system installations.

Fixes a bug where the incorrect shared object is loaded (with mismatching versions) due to presence of PyPI packages that are installed by pytorch/jax etc.

Type of change

  • [ ] Documentation change (change only to the documentation, either a fix or a new content)
  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [x] Infra/Build change
  • [x] Code refactoring

Changes

  • Remove duplication of loading logic for various libs such as curand, cudnn etc.
  • Prioritize loading packages via system, e.g. LD_LIBRARY_PATH before checking python packages.
  • Remove search via ldconfig as redundant and brute force.

Checklist:

  • [x] I have read and followed the contributing guidelines
  • [x] The functionality is complete
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [x] My changes generate no new warnings
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [x] New and existing unit tests pass locally with my changes

ksivaman avatar Oct 23 '25 16:10 ksivaman

/te-ci

ksivaman avatar Oct 28 '25 20:10 ksivaman