Improve robustness: Add PackAMatrix warning and VBE metadata validation
Summary
This PR addresses two identified technical debt items (TODOs) in the codebase to improve library robustness and developer experience. It introduces a runtime warning for unoptimized matrix packing paths in the C++ core and adds critical input validation for Variable Batch Size Embedding (VBE) metadata generation in the Python GPU extensions.
Changes
-
C++ Core (src/PackAMatrix.cc):
- Implemented a "warn-once" mechanism for the
matrix_op_t::Transposepath in PackAMatrix. - Previously, this unoptimized path executed silently. Now, it prints a warning to
stderron the first use, alerting users to potential performance implications without polluting logs.
- Implemented a "warn-once" mechanism for the
-
Python GPU Extensions (
fbgemm_gpu/.../split_table_batched_embeddings_ops_training_common.py):- Added comprehensive input validation to generate_vbe_metadata.
- The function now validates that
batch_size_per_feature_per_rank:- Is not empty.
- Matches the expected number of features (
feature_dims_cpu). - Has a consistent number of ranks across all features.
- Raises descriptive
ValueErrors to help developers debug shape mismatch issues earlier in the stack.
Motivation
These changes were motivated by existing TODO comments in the codebase requesting these specific improvements. By addressing them, we prevent undefined behavior and provide clearer feedback to users and developers, aligning with best practices for robust library design.
Verification
New tests were added to verify these changes:
- C++: test/PackAMatrixTransposeTest.cc - Verifies that the warning is correctly printed to
stderrwhen the transpose path is triggered. - Python: fbgemm_gpu/test/vbe_metadata_test.py - Verifies that generate_vbe_metadata raises the correct
ValueErrors for various malformed inputs.
Contributor
Mohit Ranjan
Hi @Jitterx69!
Thank you for your pull request and welcome to our community.
Action Required
In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!
@Jitterx69 Thanks for making this contribution! It looks like the PackAMatrixTest is failing on ARM, could you take a look into this?
@Jitterx69 Thanks for making this contribution! It looks like the PackAMatrixTest is failing on ARM, could you take a look into this?
On it