FBGEMM icon indicating copy to clipboard operation
FBGEMM copied to clipboard

Improve robustness: Add PackAMatrix warning and VBE metadata validation

Open Jitterx69 opened this issue 3 weeks ago • 4 comments

Summary

This PR addresses two identified technical debt items (TODOs) in the codebase to improve library robustness and developer experience. It introduces a runtime warning for unoptimized matrix packing paths in the C++ core and adds critical input validation for Variable Batch Size Embedding (VBE) metadata generation in the Python GPU extensions.

Changes

  1. C++ Core (src/PackAMatrix.cc):

    • Implemented a "warn-once" mechanism for the matrix_op_t::Transpose path in PackAMatrix.
    • Previously, this unoptimized path executed silently. Now, it prints a warning to stderr on the first use, alerting users to potential performance implications without polluting logs.
  2. Python GPU Extensions (fbgemm_gpu/.../split_table_batched_embeddings_ops_training_common.py):

    • Added comprehensive input validation to generate_vbe_metadata.
    • The function now validates that batch_size_per_feature_per_rank:
      • Is not empty.
      • Matches the expected number of features (feature_dims_cpu).
      • Has a consistent number of ranks across all features.
    • Raises descriptive ValueErrors to help developers debug shape mismatch issues earlier in the stack.

Motivation

These changes were motivated by existing TODO comments in the codebase requesting these specific improvements. By addressing them, we prevent undefined behavior and provide clearer feedback to users and developers, aligning with best practices for robust library design.

Verification

New tests were added to verify these changes:

  • C++: test/PackAMatrixTransposeTest.cc - Verifies that the warning is correctly printed to stderr when the transpose path is triggered.
  • Python: fbgemm_gpu/test/vbe_metadata_test.py - Verifies that generate_vbe_metadata raises the correct ValueErrors for various malformed inputs.

Contributor

Mohit Ranjan

Jitterx69 avatar Dec 05 '25 08:12 Jitterx69

Hi @Jitterx69!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

meta-cla[bot] avatar Dec 05 '25 08:12 meta-cla[bot]

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

meta-cla[bot] avatar Dec 05 '25 11:12 meta-cla[bot]

@Jitterx69 Thanks for making this contribution! It looks like the PackAMatrixTest is failing on ARM, could you take a look into this?

q10 avatar Dec 05 '25 23:12 q10

@Jitterx69 Thanks for making this contribution! It looks like the PackAMatrixTest is failing on ARM, could you take a look into this?

On it

Jitterx69 avatar Dec 06 '25 16:12 Jitterx69