WasmEdge [WASI-NN] tensorrt: Add support for TensorRT-LLM

Nov 18 '24 09:11 ibmibmibm

Hello, I am a code review agent on flows.network. Here are my reviews of changed source code files in this PR.

.github/workflows/matrix-extensions.json

Potential issues

Duplicate bin and target for TensorRT-LLM: The wasi_nn-tensorrt plugin uses the same bin and target as other backends (libwasmedgePluginWasiNN), which could lead to build conflicts or unexpected behavior if all backends are compiled together.
Missing CUDA Architecture Spec for TensorRT-LLM: The wasi_nn-tensorrt plugin specifies a CUDA compiler but lacks CMAKE_CUDA_ARCHITECTURES, unlike similar plugins like wasmedge_stablediffusion-cuda-12. This may lead to suboptimal or incompatible GPU code generation.
Platform Specificity for TensorRT-LLM: The platforms field for wasi_nn-tensorrt includes only one platform (ubuntu2004_tensorrt_cuda12). If TensorRT-LLM is intended to support more platforms, this list should be expanded to match other backend plugins.

Summary of changes

- Added support for TensorRT-LLM in the wasi_nn-tensorrt plugin.
Configured the build options to include -DWASMEDGE_PLUGIN_WASI_NN_BACKEND=TensorRT and specify the CUDA compiler path.
Targeted the platform ubuntu2004_tensorrt_cuda12 specifically for this configuration.

.github/workflows/reusable-build-extensions.yml

Potential issues

Matrix Duplication Issue: The ubuntu2004_x86_64 asset tag is duplicated in multiple matrix entries within the build_on_ubuntu job, which could lead to confusion or unintended behavior.
Hardcoded Docker Tags: The Docker tags (ubuntu-20.04-build-clang-plugins-deps, etc.) are hardcoded in the workflow YAML. This can make maintenance and updates more difficult if these tags change.
Conditional Logic for Release Builds: The conditional logic for release builds in build_on_ubuntu_latest (if: ${{ !inputs.release }}) might unintentionally exclude valid scenarios where a specific combination of inputs should trigger this job, leading to incomplete build coverage.

Summary of changes

- Added support for a new CI job targeting ubuntu2004_tensorrt_cuda12.
Included ubuntu2004_tensorrt_cuda12 in the list of asset tags to be processed.
Configured a matrix strategy to build and test with TensorRT-LLM using CUDA 12 on Ubuntu 20.04.

cmake/WASINNDeps.cmake

Potential issues

Missing Architecture Check for TensorRT: The wasmedge_setup_tensorrt_target function does not include architecture checks similar to those in the TensorFlow setup functions, which can lead to build failures on unsupported architectures.
No Dependency Handling for TensorRT Plugins: The wasmedge_setup_tensorrt_target function links against nvinfer_plugin_tensorrt_llm, but it does not ensure that this plugin is available or handle cases where it might not be found, leading to potential linking errors.
Hardcoded Tag in TensorRT-LLM Setup: The wasmedge_setup_tensorrt_target function uses a hardcoded tag (v0.14.0) for fetching the TensorRT-LLM repository. This should be parameterized or configurable to allow updates without modifying the code directly.

Summary of changes

- Added TensorRT backend support: Introduced a new backend option for TensorRT in the wasmedge_setup_wasinn_target function.
Included TensorRT-LLM library: Defined a new function wasmedge_setup_tensorrt_target to fetch, build, and link the TensorRT-LLM library.
Configured CMake options for TensorRT-LLM: Set internal CMake variables to disable unnecessary builds (PyTorch, tests, benchmarks) when preparing the TensorRT-LLM target.

plugins/wasi_nn/CMakeLists.txt

Potential issues

Issue 1: The BACKEND variable is converted to lowercase using string(TOLOWER ${BACKEND} BACKEND), which may lead to unexpected behavior if the backend names are case-sensitive elsewhere in the code.
Issue 2: The inclusion of MLX specific source files is hardcoded within the foreach loop, which could make it difficult to manage or add new backends without modifying this section of the code.
Issue 3: There is no validation to ensure that the WASMEDGE_PLUGIN_WASI_NN_BACKEND list contains valid backend names before they are processed in the foreach loop, potentially leading to build errors if an invalid name is included.

Summary of changes

- Added wasinn_tensorrt.cpp to the list of source files in the wasmedgePluginWasiNN library.

This change introduces support for TensorRT-LLM within the WasmEdge plugin, enhancing its capabilities with new inference features.

plugins/wasi_nn/wasinn_tensorrt.cpp

Potential issues

Incorrect Error Handling in Tokenizer::load: The function returns ErrNo::InvalidArgument on a std::bad_alloc, which is incorrect since std::bad_alloc indicates an allocation failure rather than an invalid argument. It should return a different error code or handle the exception differently to reflect the memory allocation issue.
Potential Buffer Overflow in Tokenizer::parse: The function uses reinterpret_cast<const char *>(NormalizedString.data()) + I without checking if I is within bounds, which could lead to accessing memory out of the NormalizedString buffer if I exceeds the string's size minus the length of a special token. This can cause undefined behavior.
Inconsistent Error Handling in load: When parsing metadata, if an error occurs (Parser.parse(Metadata).get(Doc) fails), it logs an error and pops back the last added graph from Env.NNGraph. However, for subsequent errors within the same block (e.g., missing keys or invalid values), it does not remove the graph, leading to potential inconsistent states. All errors should ensure that partial state changes are reverted.

Summary of changes

-### Key Changes Summary:

Added TensorRT-LLM Support: Introduced comprehensive support for TensorRT-LLM, including tokenizer initialization and parsing/unparsing logic.
Plugin Initialization: Added a LoggerProxy class to handle TensorRT logging and initialized TensorRT-LLM plugins using initTrtLlmPlugins.
Graph Loading Logic: Enhanced the graph loading function to handle required files (rank0.engine, config.json, tokenizer.json) and additional metadata options, ensuring proper initialization of the TensorRT executor.

plugins/wasi_nn/wasinn_tensorrt.h

Potential issues

Issue with Tokenizer Initialization: The Tokenizer class has static members Byte2U8Char and U8Char2Byte that are never initialized, which could lead to undefined behavior if used before initialization.
Potential Memory Leak in Graph Structure: The Graph struct contains a std::unique_ptr<tensorrt_llm::executor::Executor> but does not have a custom destructor or move semantics defined. If an exception occurs during the construction of Executor, it could lead to resource leaks.
Inconsistent Initialization of Special Token IDs: In the Tokenizer class, several special token IDs (e.g., SEPId, PADId) are initialized to -1, which might not be a valid or intended value for these identifiers in the context of the application. This could cause issues during token parsing and unparsing if expected values differ.

Summary of changes

- Added TensorRT-LLM support: Introduced new classes and structures (Tokenizer, Graph, Context) to support TensorRT-LLM in the WASI-NN backend.
Conditional compilation with WASMEDGE_PLUGIN_WASI_NN_BACKEND_TENSORRT: Included TensorRT-specific code only when the corresponding plugin is enabled, ensuring modularity.
Enhanced tokenizer functionality: Implemented methods for loading token mappings from a JSON file and parsing/unparsing strings into tokens, facilitating interaction with LLM models.

plugins/wasi_nn/wasinnenv.cpp

Potential issues

Duplicate Backend Entry: The BackendMap contains duplicate entries for "pytorchaoti"sv and "tensorflow" which might cause confusion or unexpected behavior when resolving backends.
Error Handling in Model Loading: In the load function, if FileSize >= Data.max_size(), it logs an error but does not explicitly close the file, leading to potential resource leakage.
Magic String in Code: The code uses magic strings like "unix://" for URI validation without defining constants, which can lead to maintenance issues and errors if these values need to change.

Summary of changes

- Added TensorRT backend to the BackendMap with key "tensorrt".
Introduced file size validation in the load function, checking if the file is too big before reading its content.
Corrected the formatting for the BackendMap initialization by adding a missing comma.

plugins/wasi_nn/wasinnenv.h

Potential issues

Issue with __builtin_unreachable() Usage: The __builtin_unreachable() function is used in switch statements without ensuring that all cases are handled, which can lead to undefined behavior if an unexpected backend is encountered. Ensure that the default case handles or logs unknown backends.
Potential Data Race in Context::init: The Graph::getBackend() call within Context::init could lead to a data race if Graph's state is modified concurrently. Consider adding locks or ensuring thread safety for Graph operations accessed from multiple threads.
Lack of Error Handling in WasiNNEnvironment::mdBuild: The Load callback's return value is checked, but there is no handling or logging if an error occurs. Implement proper error handling and logging to diagnose issues when Load fails.

Summary of changes

bullet points summarizing the key changes:
Added TensorRT Support: Included wasinn_tensorrt.h to integrate support for TensorRT in the project.
Enhanced Model Deployment Options: Extended the backend options by adding TensorRT, which allows for potentially more efficient inference on NVIDIA GPUs.
Prepared for LLM Integration: The inclusion of TensorRT specifically mentions "TensorRT-LLM," indicating preparation for Large Language Model support using this backend.

plugins/wasi_nn/wasinntypes.h

Potential issues

Issue: The FOR_EACH_BACKEND macro is missing the Autodetect backend.
- Explanation: The macro lists all backends except Autodetect, which could lead to errors or inconsistencies when iterating over backends using this macro.
Issue: The fmt::formatter<WasmEdge::Host::WASINN::Device> does not handle the AUTO device.
- Explanation: The switch statement in the formatter for Device enums lacks a case for AUTO, which will result in "Unknown" being printed when AUTO is encountered.
Issue: Missing TensorRT-LLM backend definition and macro inclusion.
- Explanation: The patch title indicates support for TensorRT-LLM, but there is no corresponding entry in the Backend enum or the FOR_EACH_BACKEND macro, which would prevent using it.

Summary of changes

- Added TensorRT to the Backend Enum: Introduced TensorRT as a new backend option in the Backend enum.
Updated FOR_EACH_BACKEND Macro: Included F(TensorRT) in the macro to ensure it processes TensorRT alongside other backends.
Modified Backend Count Implicitly: Increased the number of supported backends by adding TensorRT, which affects any logic relying on backend enumeration.

utils/docker/Dockerfile.ubuntu-cuda

Potential issues

The TENSORRT ARG is used in a way that it does not conditionally install TensorRT packages when set to "ON". The case statement syntax is incorrect; it should include a pattern match (ON)) rather than directly referencing the variable.
The CUDA_KEYRING filename is hardcoded for Ubuntu 20.04, which may not be suitable if the base image changes to a different version of Ubuntu or another Linux distribution.
Setting CXXFLAGS="-Wno-error" globally suppresses all warnings as errors, which can hide important issues during compilation; consider using more specific flags or removing it if unnecessary.

Summary of changes

- Added an argument TENSORRT with a default value of OFF.
Introduced a conditional installation block to install TensorRT and related dependencies when TENSORRT=ON.
Updated the list of packages installed in the Dockerfile to include TensorRT-specific dependencies.

utils/docker/docker-bake.ubuntu.hcl

Potential issues

Inconsistent Dockerfile Paths: The dockerfile path for cuda-tensorrt is set to "Dockerfile.ubuntu-cuda", which might be incorrect if it should use a different Dockerfile specifically for TensorRT, such as "Dockerfile.ubuntu-cuda-tensorrt".
Hardcoded Base Image Tags: In the cuda and cuda-tensorrt targets, the base image tag is hardcoded to "wasmedge/wasmedge:ubuntu-20.04-build-gcc". This could lead to maintenance issues if the base image needs to be updated or generalized.
Duplicate Contexts Key: The target "plugins" has a typo in the contexts key, using contexts instead of context, which will likely cause an error.

Summary of changes

-• Added a new target "cuda-tensorrt" to the targets list. • Introduced a new Dockerfile configuration for the "cuda-tensorrt" target with TensorRT support, specifying CUDA version 12.6 and enabling TensorRT through build arguments. • Configured specific Docker contexts and tags for the new "cuda-tensorrt" target to ensure proper image building and tagging.

Nov 18 '24 09:11 juntao

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 80.32%. Comparing base (0ad562d) to head (73f490c).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3878   +/-   ##
=======================================
  Coverage   80.32%   80.32%           
=======================================
  Files         280      280           
  Lines       38920    38920           
  Branches     6800     6800           
=======================================
  Hits        31264    31264           
  Misses       6050     6050           
  Partials     1606     1606

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Nov 18 '24 10:11 codecov[bot]