feat(sdk): Add dict parameter access for granular value extraction
Add Dict Parameter Access Feature for Pipeline Parameters
Summary
This PR adds support for extracting individual values from dictionary pipeline parameters using Pythonic syntax, eliminating the need to pass entire dictionaries to components that only need specific values.
Motivation
Previously, when a pipeline had a dictionary parameter, the entire dictionary had to be passed to every component, even if the component only needed a single value or subset. This led to:
- Unnecessary data exposure to components
- Reduced code clarity
- Potential security concerns with passing sensitive data unnecessarily
Changes
1. New DictSubvariable Class (sdk/python/kfp/dsl/pipeline_channel.py)
- Introduced
DictSubvariableclass to represent values extracted from dict parameters - Added
__getitem__method toPipelineParameterChannelto enable dict-style access - Supports both single-level and nested dict access
- Supports passing sub-dictionaries to components
2. Compiler Support (sdk/python/kfp/compiler/pipeline_spec_builder.py)
- Extended compiler to recognize
DictSubvariableinstances - Generates appropriate CEL (Common Expression Language)
parameter_expression_selectorexpressions - Supports nested access by building chained CEL expressions
- Works in both task and group contexts
3. Type Checking (sdk/python/kfp/dsl/pipeline_task.py)
- Modified type compatibility checking to skip strict validation for
DictSubvariable - Actual types are resolved at runtime by the CEL evaluator
4. Tests
- Added comprehensive unit tests in
pipeline_channel_test.py - Added integration test pipeline in
test_data/pipeline_files/valid/pipeline_with_dict_parameter_access.py
Usage Examples
Single-Level Access
@dsl.pipeline
def my_pipeline(config: dict):
# Extract a single value
component1(db_host=config['db_host'])
Nested Access
@dsl.pipeline
def my_pipeline(config: dict):
# Access nested values
component2(host=config['database']['host'])
component3(username=config['database']['credentials']['username'])
Sub-Dictionary Passing
@dsl.pipeline
def my_pipeline(config: dict):
# Pass an entire sub-dictionary
component4(db_config=config['database'])
Technical Details
Compile-Time vs Runtime
- Compile-time: The SDK creates
DictSubvariableobjects and the compiler generates CEL expressions - Runtime: The backend driver evaluates CEL expressions to extract values from the actual JSON data
CEL Expression Generation
- Single:
parseJson(string_value)["key"] - Nested:
parseJson(string_value)["database"]["host"] - Backend already supports these expressions (no backend changes required)
Benefits
- Improved Security: Components only receive the data they need
- Better Code Clarity: Clear intent about what data each component uses
- Reduced Coupling: Components don't depend on entire config structures
- Type Safety: Type checking happens at runtime based on actual values
- Backward Compatible: Existing code continues to work unchanged
Testing
Run unit tests:
pytest -v sdk/python/kfp/dsl/pipeline_channel_test.py::DictSubvariableTest
Run integration test:
python test_data/pipeline_files/valid/pipeline_with_dict_parameter_access.py
Checklist
- [x] Added unit tests
- [x] Added integration test
- [x] Code follows project formatting standards (isort)
- [x] All tests pass
- [x] No breaking changes
- [x] Documentation (docstrings) added
Related Issues
Closes #12418
Additional Notes
This feature leverages existing backend CEL expression evaluation capabilities, so no server-side changes are required. All changes are SDK-side and applied at compile time.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign james-jwu for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Hi @wassimbensalem. Thanks for your PR.
I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/assign
/ok-to-test
/ok-to-test
/ok-to-test
Is anything still missing here?
@hbelmiro can we test here?
@wassimbensalem I just asked my AI Code assistant to do some analysis on the PR, and as expected it provided some security risks:
Risks
- CEL Expression Injection Risks ⚠️ HIGH PRIORITY
● The PR introduces CEL expression generation like parseJson(string_value)["key"]. This presents several injection vectors:
- Dynamic Key Injection: If user-controlled data becomes part of the CEL expression keys, malicious actors could potentially:
- Access unintended data: config['../../sensitive_data']
- Cause DoS through complex expressions: config[repeat("x", 1000000)]
- JSON Parsing Vulnerabilities: The parseJson() function could be exploited with:
- Malformed JSON leading to parser exploits
- Deeply nested objects causing stack overflow
- Large payloads causing memory exhaustion
- Input Validation and Sanitization Issues ⚠️ MEDIUM PRIORITY
From pipeline_channel.py:92-95, I see basic regex validation for names: valid_name_regex = r'^[A-Za-z][A-Za-z0-9\s_-]*$'
Concerns:
- Key Name Validation: The PR doesn't specify validation for dict keys used in access patterns
- Type Confusion: Runtime resolution could allow type confusion attacks
- Path Traversal: Dict access patterns like config['../../../secret'] could bypass intended boundaries
- Data Exposure and Access Control Issues ⚠️ HIGH PRIORITY
● The PR claims "Improved Security: Components only receive the data they need," but introduces new risks:
- Privilege Escalation: Components could potentially access more data than intended through:
- Nested access patterns: config['system']['admin_password']
- Array/object traversal: config['users'][0]['credentials']
- Information Disclosure: Error messages from CEL evaluation could leak:
- Dict structure information
- Sensitive key names
- Value type information
- Serialization and Deserialization Vulnerabilities ⚠️ MEDIUM PRIORITY
From pipeline_spec_builder.py:73-85, dict values are converted to protobuf. The new feature adds:
- Deserialization Attacks: JSON parsing in CEL could be exploited with malicious payloads
- Memory Exhaustion: Large nested dicts could cause resource exhaustion
Security Recommendations
Immediate Actions Required:
- Implement Strict Key Validation: - Whitelist allowed characters in dict keys - Prevent path traversal patterns (../, absolute paths) - Limit key length and nesting depth
- Add CEL Expression Sanitization: - Validate generated CEL expressions before evaluation - Implement expression complexity limits - Add timeout protection for CEL evaluation
- Enhance Access Control: - Document which dict keys should be accessible to which components - Consider implementing a permission model for dict access - Add audit logging for dict access patterns
Long-term Considerations:
- Security Testing: Add specific test cases for injection attempts and edge cases
- Documentation: Clearly document security boundaries and recommended usage patterns
- Monitoring: Implement runtime monitoring for suspicious access patterns
Verdict
While the feature improves functionality, the current implementation introduces significant security risks that need addressing before production deployment. The CEL injection vectors and lack of proper input validation are particularly concerning in multi-tenant environments.
Recommendation: Request security review and implementation of the above mitigations before merging PR #12419.
fyi @HumairAK @droctothorpe @mprahl @zazulam