pipelines icon indicating copy to clipboard operation
pipelines copied to clipboard

feat(sdk): Add dict parameter access for granular value extraction

Open wassimbensalem opened this issue 1 month ago • 9 comments

Add Dict Parameter Access Feature for Pipeline Parameters

Summary

This PR adds support for extracting individual values from dictionary pipeline parameters using Pythonic syntax, eliminating the need to pass entire dictionaries to components that only need specific values.

Motivation

Previously, when a pipeline had a dictionary parameter, the entire dictionary had to be passed to every component, even if the component only needed a single value or subset. This led to:

  • Unnecessary data exposure to components
  • Reduced code clarity
  • Potential security concerns with passing sensitive data unnecessarily

Changes

1. New DictSubvariable Class (sdk/python/kfp/dsl/pipeline_channel.py)

  • Introduced DictSubvariable class to represent values extracted from dict parameters
  • Added __getitem__ method to PipelineParameterChannel to enable dict-style access
  • Supports both single-level and nested dict access
  • Supports passing sub-dictionaries to components

2. Compiler Support (sdk/python/kfp/compiler/pipeline_spec_builder.py)

  • Extended compiler to recognize DictSubvariable instances
  • Generates appropriate CEL (Common Expression Language) parameter_expression_selector expressions
  • Supports nested access by building chained CEL expressions
  • Works in both task and group contexts

3. Type Checking (sdk/python/kfp/dsl/pipeline_task.py)

  • Modified type compatibility checking to skip strict validation for DictSubvariable
  • Actual types are resolved at runtime by the CEL evaluator

4. Tests

  • Added comprehensive unit tests in pipeline_channel_test.py
  • Added integration test pipeline in test_data/pipeline_files/valid/pipeline_with_dict_parameter_access.py

Usage Examples

Single-Level Access

@dsl.pipeline
def my_pipeline(config: dict):
    # Extract a single value
    component1(db_host=config['db_host'])

Nested Access

@dsl.pipeline
def my_pipeline(config: dict):
    # Access nested values
    component2(host=config['database']['host'])
    component3(username=config['database']['credentials']['username'])

Sub-Dictionary Passing

@dsl.pipeline
def my_pipeline(config: dict):
    # Pass an entire sub-dictionary
    component4(db_config=config['database'])

Technical Details

Compile-Time vs Runtime

  • Compile-time: The SDK creates DictSubvariable objects and the compiler generates CEL expressions
  • Runtime: The backend driver evaluates CEL expressions to extract values from the actual JSON data

CEL Expression Generation

  • Single: parseJson(string_value)["key"]
  • Nested: parseJson(string_value)["database"]["host"]
  • Backend already supports these expressions (no backend changes required)

Benefits

  1. Improved Security: Components only receive the data they need
  2. Better Code Clarity: Clear intent about what data each component uses
  3. Reduced Coupling: Components don't depend on entire config structures
  4. Type Safety: Type checking happens at runtime based on actual values
  5. Backward Compatible: Existing code continues to work unchanged

Testing

Run unit tests:

pytest -v sdk/python/kfp/dsl/pipeline_channel_test.py::DictSubvariableTest

Run integration test:

python test_data/pipeline_files/valid/pipeline_with_dict_parameter_access.py

Checklist

  • [x] Added unit tests
  • [x] Added integration test
  • [x] Code follows project formatting standards (isort)
  • [x] All tests pass
  • [x] No breaking changes
  • [x] Documentation (docstrings) added

Related Issues

Closes #12418

Additional Notes

This feature leverages existing backend CEL expression evaluation capabilities, so no server-side changes are required. All changes are SDK-side and applied at compile time.

wassimbensalem avatar Nov 07 '25 10:11 wassimbensalem

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign james-jwu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Nov 07 '25 10:11 google-oss-prow[bot]

Hi @wassimbensalem. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Nov 07 '25 10:11 google-oss-prow[bot]

/assign

wassimbensalem avatar Nov 07 '25 10:11 wassimbensalem

/ok-to-test

hbelmiro avatar Nov 07 '25 17:11 hbelmiro

/ok-to-test

hbelmiro avatar Nov 10 '25 12:11 hbelmiro

/ok-to-test

hbelmiro avatar Nov 10 '25 14:11 hbelmiro

Is anything still missing here?

wassimbensalem avatar Nov 11 '25 14:11 wassimbensalem

@hbelmiro can we test here?

wassimbensalem avatar Nov 24 '25 16:11 wassimbensalem

@wassimbensalem I just asked my AI Code assistant to do some analysis on the PR, and as expected it provided some security risks:

Risks

  1. CEL Expression Injection Risks ⚠️ HIGH PRIORITY

● The PR introduces CEL expression generation like parseJson(string_value)["key"]. This presents several injection vectors:

  • Dynamic Key Injection: If user-controlled data becomes part of the CEL expression keys, malicious actors could potentially:
    • Access unintended data: config['../../sensitive_data']
    • Cause DoS through complex expressions: config[repeat("x", 1000000)]
  • JSON Parsing Vulnerabilities: The parseJson() function could be exploited with:
    • Malformed JSON leading to parser exploits
    • Deeply nested objects causing stack overflow
    • Large payloads causing memory exhaustion
  1. Input Validation and Sanitization Issues ⚠️ MEDIUM PRIORITY

From pipeline_channel.py:92-95, I see basic regex validation for names: valid_name_regex = r'^[A-Za-z][A-Za-z0-9\s_-]*$'

Concerns:

  • Key Name Validation: The PR doesn't specify validation for dict keys used in access patterns
  • Type Confusion: Runtime resolution could allow type confusion attacks
  • Path Traversal: Dict access patterns like config['../../../secret'] could bypass intended boundaries
  1. Data Exposure and Access Control Issues ⚠️ HIGH PRIORITY

● The PR claims "Improved Security: Components only receive the data they need," but introduces new risks:

  • Privilege Escalation: Components could potentially access more data than intended through:
    • Nested access patterns: config['system']['admin_password']
    • Array/object traversal: config['users'][0]['credentials']
  • Information Disclosure: Error messages from CEL evaluation could leak:
    • Dict structure information
    • Sensitive key names
    • Value type information
  1. Serialization and Deserialization Vulnerabilities ⚠️ MEDIUM PRIORITY

From pipeline_spec_builder.py:73-85, dict values are converted to protobuf. The new feature adds:

  • Deserialization Attacks: JSON parsing in CEL could be exploited with malicious payloads
  • Memory Exhaustion: Large nested dicts could cause resource exhaustion

Security Recommendations

Immediate Actions Required:

  1. Implement Strict Key Validation: - Whitelist allowed characters in dict keys - Prevent path traversal patterns (../, absolute paths) - Limit key length and nesting depth
  2. Add CEL Expression Sanitization: - Validate generated CEL expressions before evaluation - Implement expression complexity limits - Add timeout protection for CEL evaluation
  3. Enhance Access Control: - Document which dict keys should be accessible to which components - Consider implementing a permission model for dict access - Add audit logging for dict access patterns

Long-term Considerations:

  1. Security Testing: Add specific test cases for injection attempts and edge cases
  2. Documentation: Clearly document security boundaries and recommended usage patterns
  3. Monitoring: Implement runtime monitoring for suspicious access patterns

Verdict

While the feature improves functionality, the current implementation introduces significant security risks that need addressing before production deployment. The CEL injection vectors and lack of proper input validation are particularly concerning in multi-tenant environments.

Recommendation: Request security review and implementation of the above mitigations before merging PR #12419.

fyi @HumairAK @droctothorpe @mprahl @zazulam

nsingla avatar Dec 04 '25 21:12 nsingla