julep icon indicating copy to clipboard operation
julep copied to clipboard

feat(agents-api, integrations-service): tool calls in tasks and sessions

Open Ahmad-mtos opened this issue 5 months ago • 1 comments
trafficstars

  • feat(agents-api): add convert_to_openai_tool function for integration tools
  • refactor: Lint agents-api (CI)
  • fix(integrations-service): refactor integration tool naming on conversion
  • feat(agents-api): add tool_runner.py that handles tool calling
  • chore(agents-api): refactor prompt_step.py to use tool_runner functionalities
  • lint
  • refactor: Lint agents-api (CI)

Ahmad-mtos avatar May 24 '25 09:05 Ahmad-mtos

Claude finished @Ahmad-mtos's task —— View job


Changelog Bot Analysis ✅

Analyzed the PR changes and created a changelog entry for user-visible features.

Todo List:

  • [x] Read diff.patch file to understand changes
  • [x] Determine if changes are user-visible
  • [x] Create changelog entry if needed
  • [x] Commit and push changes

Analysis:

This PR introduces significant user-visible changes related to tool calling functionality:

New Features Added:

  • Tool calling functionality in tasks and sessions
  • Integration tools conversion to OpenAI-compatible format
  • New tool runner system for handling tool execution
  • Enhanced prompt steps with tool runner capabilities

Actions Taken:

Created changelog entry: changelog/dev/2025-06-30.md
Committed changes: 226bee8

The changelog entry documents these user-facing improvements to tool calling capabilities in the Julep platform.


github-actions[bot] avatar Jun 30 '25 12:06 github-actions[bot]

Claude finished @Ahmad-mtos's task —— View job


Pull Request Review: Tool Calls in Tasks and Sessions

I've completed a comprehensive review of this PR that adds tool calling functionality to tasks and sessions. Here's my detailed analysis:

Review Checklist:

  • [x] Analyze core tool calling implementation (tool_runner.py, tool_executor.py)
  • [x] Review integration service changes and tool conversion logic
  • [x] Examine security considerations for tool execution
  • [x] Check error handling and logging practices
  • [x] Assess performance implications
  • [x] Review feature flags integration
  • [x] Analyze API changes and backwards compatibility
  • [x] Check test coverage and missing tests
  • [x] Address existing review comments from whiterabbit1983

Strengths

1. Architecture & Design

  • Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
  • Type safety: Extensive use of Pydantic models and beartype decorators throughout
  • Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
  • Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

  • Proper exception wrapping with context preservation (IntegrationExecutionException)
  • Graceful fallbacks for unknown tools (creates dummy error responses)
  • Activity logging integration for debugging

3. Integration Points

  • Seamless integration with existing workflow execution system
  • Proper Temporal activity patterns maintained
  • LLM provider abstraction through LiteLLM

⚠️ Issues & Concerns

1. Critical Issues

Potential Index Error in Workflow (workflow/task_execution/__init__.py:451)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes messages list is non-empty. If len(messages) == 0, this will raise an IndexError.
Fix: Add length check before accessing last element.

Missing Error Logging (tool_executor.py:66-69)

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise

Issue: Error is logged but then re-raised without additional context.
Recommendation: Consider whether to add structured error handling or wrap in custom exception.

Non-Async HTTP Client (integrations.py:24)

async with AsyncClient(timeout=600) as client:

Issue: Code is correct but timeout of 600 seconds (10 minutes) is very long.
Recommendation: Consider shorter timeout with retry logic.

2. Security Concerns

Tool Execution Security

  • Input validation: Tool arguments are passed through JSON parsing but need additional validation
  • Sandbox concerns: simpleeval mentioned in comments but not applied to all tool inputs
  • Setup/arguments merging: Multiple sources merged without clear precedence (tool_runner.py:136-140)

Integration Service Calls

  • Timeout vulnerability: 600-second timeout could be exploited for DoS
  • Error information leakage: Full error messages from integrations might expose sensitive info

3. Performance Issues

Tool Resolution Inefficiency (tool_runner.py:166-172)

tool_map = {}
for t in tools:
    if t.type == "integration" and t.integration is not None:
        tool_map[t.integration.provider] = t
    else:
        tool_map[t.name] = t

Issue: Tool map recreation on every LLM call instead of caching.
Impact: O(n) lookup cost for each tool resolution.

Synchronous Tool Processing (tool_runner.py:185-208)

Issue: Tools are processed sequentially in the loop.
Recommendation: Consider parallel execution for independent tool calls.

4. Code Quality Issues

Inconsistent Error Handling

  • Some functions use custom exceptions, others use generic exceptions
  • Error context not always preserved in chain

Magic Strings (integrations/get_integration_tool.py:39)

if m.method == method:

Issue: Should use f-string for consistency (addresses whiterabbit1983's comment).

Type Annotation Gaps

  • tool_runner.py:137: setup variable type could be more specific
  • Return types in some functions could be more precise

🔍 Test Coverage Analysis

Missing Tests

  1. No tests found for tool_runner.py functionality
  2. No tests for run_llm_with_tools tool loop logic
  3. No tests for feature flag integration with tools
  4. No tests for error scenarios in tool execution
  5. No integration tests for the full tool calling pipeline

Test Recommendations

# Suggested test areas:
- test_tool_runner.py: Unit tests for format_tool, run_llm_with_tools
- test_tool_execution_integration.py: End-to-end tool calling tests
- test_feature_flags_tools.py: Feature flag behavior with tools
- test_tool_error_handling.py: Error scenarios and recovery

📝 Addressing Review Comments

1. whiterabbit1983's Comments

Error logging in tool_executor.pyAddressed Above

Status: Valid concern - error is logged but context could be improved

Async client in integrations.pyCode is Correct

Status: Code already uses AsyncClient correctly

F-string in integration serviceMinor Issue

Fix: Change to f-string for consistency

len(messages) == 0 checkCritical Issue

Status: Major issue that needs immediate fix


🚀 Recommendations

High Priority

  1. Fix IndexError: Add length check before messages[-1] access
  2. Add comprehensive tests: At minimum, unit tests for core tool runner logic
  3. Security review: Validate all tool inputs and implement proper sandboxing
  4. Performance optimization: Cache tool maps and consider parallel execution

Medium Priority

  1. Error handling standardization: Use consistent exception types across modules
  2. Timeout configuration: Make integration timeouts configurable
  3. Logging improvements: Add structured logging with correlation IDs

Low Priority

  1. Code style: Convert remaining string formatting to f-strings
  2. Type annotations: Improve specificity of return types
  3. Documentation: Add docstrings for public methods

📊 Overall Assessment

Score: 7/10

This PR introduces valuable tool calling functionality with good architectural principles. However, it has several critical issues that need addressing before merge:

  • Critical bugs (IndexError) must be fixed
  • Test coverage is insufficient for such a complex feature
  • Security implications need careful review
  • Performance optimizations would benefit production usage

The feature flag approach for gradual rollout is excellent and shows good production awareness.

Recommendation: Address critical issues and add basic test coverage before merging.


claude[bot] avatar Jun 30 '25 12:06 claude[bot]

Claude finished @Ahmad-mtos's task —— View job


Pull Request Review: Tool Calls in Tasks and Sessions

I've completed a comprehensive review of this PR that adds tool calling functionality to tasks and sessions. Here's my detailed analysis:

Review Checklist:

  • [x] Analyze core tool calling implementation (tool_runner.py, tool_executor.py)
  • [x] Review integration service changes and tool conversion logic
  • [x] Examine security considerations for tool execution
  • [x] Check error handling and logging practices
  • [x] Assess performance implications
  • [x] Review feature flags integration
  • [x] Analyze API changes and backwards compatibility
  • [x] Check test coverage and missing tests
  • [x] Address existing review comments from whiterabbit1983

Strengths

1. Architecture & Design

  • Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
  • Type safety: Extensive use of Pydantic models and beartype decorators throughout
  • Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
  • Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

  • Proper exception wrapping with context preservation (IntegrationExecutionException)
  • Graceful fallbacks for unknown tools (creates dummy error responses)
  • Activity logging integration for debugging

3. Integration Points

  • Seamless integration with existing workflow execution system
  • Proper Temporal activity patterns maintained
  • LLM provider abstraction through LiteLLM

⚠️ Issues & Concerns

1. Critical Issues

Potential Index Error in Workflow (workflow/task_execution/__init__.py:451)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes messages list is non-empty. If len(messages) == 0, this will raise an IndexError.
Fix: Add length check before accessing last element.

Missing Error Logging (tool_executor.py:66-69)

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise

Issue: Error is logged but then re-raised without additional context.
Recommendation: Consider whether to add structured error handling or wrap in custom exception.

Non-Async HTTP Client (integrations.py:24)

async with AsyncClient(timeout=600) as client:

Issue: Code is correct but timeout of 600 seconds (10 minutes) is very long.
Recommendation: Consider shorter timeout with retry logic.

2. Security Concerns

Tool Execution Security

  • Input validation: Tool arguments are passed through JSON parsing but need additional validation
  • Sandbox concerns: simpleeval mentioned in comments but not applied to all tool inputs
  • Setup/arguments merging: Multiple sources merged without clear precedence (tool_runner.py:136-140)

Integration Service Calls

  • Timeout vulnerability: 600-second timeout could be exploited for DoS
  • Error information leakage: Full error messages from integrations might expose sensitive info

3. Performance Issues

Tool Resolution Inefficiency (tool_runner.py:166-172)

tool_map = {}
for t in tools:
    if t.type == "integration" and t.integration is not None:
        tool_map[t.integration.provider] = t
    else:
        tool_map[t.name] = t

Issue: Tool map recreation on every LLM call instead of caching.
Impact: O(n) lookup cost for each tool resolution.

Synchronous Tool Processing (tool_runner.py:185-208)

Issue: Tools are processed sequentially in the loop.
Recommendation: Consider parallel execution for independent tool calls.

4. Code Quality Issues

Inconsistent Error Handling

  • Some functions use custom exceptions, others use generic exceptions
  • Error context not always preserved in chain

Magic Strings (integrations/get_integration_tool.py:39)

if m.method == method:

Issue: Should use f-string for consistency (addresses whiterabbit1983's comment).

Type Annotation Gaps

  • tool_runner.py:137: setup variable type could be more specific
  • Return types in some functions could be more precise

🔍 Test Coverage Analysis

Missing Tests

  1. No tests found for tool_runner.py functionality
  2. No tests for run_llm_with_tools tool loop logic
  3. No tests for feature flag integration with tools
  4. No tests for error scenarios in tool execution
  5. No integration tests for the full tool calling pipeline

Test Recommendations

# Suggested test areas:
- test_tool_runner.py: Unit tests for format_tool, run_llm_with_tools
- test_tool_execution_integration.py: End-to-end tool calling tests
- test_feature_flags_tools.py: Feature flag behavior with tools
- test_tool_error_handling.py: Error scenarios and recovery

📝 Addressing Review Comments

1. whiterabbit1983's Comments

Error logging in tool_executor.pyAddressed Above

Status: Valid concern - error is logged but context could be improved

Async client in integrations.pyCode is Correct

Status: Code already uses AsyncClient correctly

F-string in integration serviceMinor Issue

Fix: Change to f-string for consistency

len(messages) == 0 checkCritical Issue

Status: Major issue that needs immediate fix


🚀 Recommendations

High Priority

  1. Fix IndexError: Add length check before messages[-1] access
  2. Add comprehensive tests: At minimum, unit tests for core tool runner logic
  3. Security review: Validate all tool inputs and implement proper sandboxing
  4. Performance optimization: Cache tool maps and consider parallel execution

Medium Priority

  1. Error handling standardization: Use consistent exception types across modules
  2. Timeout configuration: Make integration timeouts configurable
  3. Logging improvements: Add structured logging with correlation IDs

Low Priority

  1. Code style: Convert remaining string formatting to f-strings
  2. Type annotations: Improve specificity of return types
  3. Documentation: Add docstrings for public methods

📊 Overall Assessment

Score: 7/10

This PR introduces valuable tool calling functionality with good architectural principles. However, it has several critical issues that need addressing before merge:

  • Critical bugs (IndexError) must be fixed
  • Test coverage is insufficient for such a complex feature
  • Security implications need careful review
  • Performance optimizations would benefit production usage

The feature flag approach for gradual rollout is excellent and shows good production awareness.

Recommendation: Address critical issues and add basic test coverage before merging.


claude[bot] avatar Jun 30 '25 12:06 claude[bot]

Claude finished @Ahmad-mtos's task —— View job


Pull Request Review: Tool Calls in Tasks and Sessions

I've completed a comprehensive review of this PR that adds tool calling functionality to tasks and sessions. Here's my detailed analysis:

Review Checklist:

  • [x] Analyze core tool calling implementation (tool_runner.py, tool_executor.py)
  • [x] Review integration service changes and tool conversion logic
  • [x] Examine security considerations for tool execution
  • [x] Check error handling and logging practices
  • [x] Assess performance implications
  • [x] Review feature flags integration
  • [x] Analyze API changes and backwards compatibility
  • [x] Check test coverage and missing tests
  • [x] Address existing review comments from whiterabbit1983

Strengths

1. Architecture & Design

  • Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
  • Type safety: Extensive use of Pydantic models and beartype decorators throughout
  • Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
  • Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

  • Proper exception wrapping with context preservation (IntegrationExecutionException)
  • Graceful fallbacks for unknown tools (creates dummy error responses)
  • Activity logging integration for debugging

3. Integration Points

  • Seamless integration with existing workflow execution system
  • Proper Temporal activity patterns maintained
  • LLM provider abstraction through LiteLLM

⚠️ Critical Issues

1. IndexError Vulnerability (workflow/task_execution/__init__.py:451)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes messages list is non-empty. If len(messages) == 0, this will raise an IndexError.
Fix: Add length check before accessing last element:

if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:

2. Security Vulnerabilities

Critical: Direct eval() Usage (expressions.py:76-90)

extra_lambdas[k] = eval(v)  # Bypasses simpleeval sandbox!

Risk: Allows arbitrary code execution, completely undermining the sandbox.

JSON Bomb Vulnerability (tool_runner.py:131)

arguments_str = json.loads(call.function.arguments)

Risk: No size limits on JSON parsing could lead to memory exhaustion attacks.

Argument Merging Without Validation (execute_integration.py:53,63)

arguments = arguments | (integration.arguments or {}) | merged_tool_args.get(tool_name, {})

Risk: Malicious arguments could override legitimate ones without validation.

3. Performance Issues

Tool Resolution Inefficiency (tool_runner.py:166-172)

tool_map = {}
for t in tools:
    if t.type == "integration" and t.integration is not None:
        tool_map[t.integration.provider] = t
    else:
        tool_map[t.name] = t

Issue: Tool map recreation on every LLM call instead of caching.

Sequential Tool Processing (tool_runner.py:185-208)

Issue: Tools are processed sequentially in the loop.
Recommendation: Consider parallel execution for independent tool calls.


🔍 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.pyValid Concern

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise

Issue: Error is logged but context could be improved with structured logging.

2. Async client in integrations.pyCode is Correct

The code correctly uses AsyncClient, but the 600-second timeout is excessive:

async with AsyncClient(timeout=600) as client:  # 10 minutes!

Recommendation: Reduce to 30-60 seconds with retry logic.

3. F-string in integration serviceMinor Style Issue

Line 39 should use f-string for consistency:

# Current
if m.method == method:
# Should be  
if f"{m.method}" == method:  # or just leave as-is since it's already clear

4. len(messages) == 0 checkCritical Issue (Addressed above)


🚨 Additional Security Concerns

Network Security

  • Excessive HTTP timeouts (10 minutes) could enable DoS attacks
  • No response size limits could cause memory exhaustion
  • Error information disclosure may leak sensitive details

Input Validation

  • Missing size limits on tool arguments and setup parameters
  • Insufficient sanitization of integration responses
  • Weak timeout enforcement in expression evaluation

📊 Test Coverage Analysis

Existing Tests

  • test_tool_queries.py: CRUD operations for tool management
  • test_tool_call_step.py: Basic tool call construction and formatting

Missing Critical Tests

  1. No tests for tool_runner.py functionality
  2. No tests for run_llm_with_tools tool loop logic
  3. No tests for feature flag integration with tools
  4. No tests for error scenarios in tool execution
  5. No integration tests for the full tool calling pipeline
  6. No security tests for malicious input handling

Test Recommendations

# Suggested new test files:
- tests/test_tool_runner.py: Core tool runner functionality
- tests/test_tool_security.py: Security vulnerability tests
- tests/test_tool_integration.py: End-to-end tool calling tests
- tests/test_tool_error_handling.py: Error scenarios and recovery

🔧 Recommendations

Must Fix Before Merge (Blockers)

  1. Fix IndexError: Add messages length check in workflow execution
  2. Remove eval() usage: Replace direct eval with safe alternatives
  3. Add JSON size limits: Prevent JSON bomb attacks
  4. Add basic test coverage: At minimum, unit tests for core tool runner logic

Should Fix Soon (High Priority)

  1. Reduce HTTP timeouts: Configure reasonable timeout values
  2. Implement response size limits: Prevent memory exhaustion
  3. Add input validation: Validate tool arguments before processing
  4. Improve error handling: Sanitize error messages to prevent information disclosure

Nice to Have (Medium Priority)

  1. Performance optimization: Cache tool maps and consider parallel execution
  2. Enhanced logging: Add structured logging with correlation IDs
  3. Rate limiting: Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 6/10

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities and bugs that must be addressed before merging.

Positives:

  • Well-structured architecture with clear separation of concerns
  • Proper use of feature flags for gradual rollout
  • Good integration with existing systems
  • Type safety through Pydantic models

Critical Issues:

  • Security vulnerabilities that could lead to arbitrary code execution
  • IndexError bug that could crash the system
  • Insufficient test coverage for such a complex feature
  • Performance inefficiencies that could impact production

Recommendation: DO NOT MERGE until critical security vulnerabilities and the IndexError bug are fixed, and basic test coverage is added.


claude[bot] avatar Jun 30 '25 15:06 claude[bot]

Claude finished @Ahmad-mtos's task —— View job


Changelog Bot Analysis ✅

Analyzed the PR changes and updated the changelog entry with comprehensive details about new user-visible features.

Todo List:

  • [x] Read diff.patch file to understand changes
  • [x] Determine if changes are user-visible
  • [x] Update changelog entry with detailed features
  • [x] Commit and push changes

Analysis:

This PR introduces significant user-visible changes related to tool calling functionality:

New Features Added:

  • New prompt step parameters: tool_choice, unwrap, auto_run_tools, disable_cache
  • Session parameters: auto_run_tools and forward_tool_calls
  • Support for Anthropic tool types: computer, text editor, and bash tools
  • Feature flags integration using Unleash for controlled rollouts
  • Enhanced tool runner system and OpenAI-compatible tool conversion

Actions Taken:

Updated changelog entry: changelog/dev/2025-06-30.md
Committed detailed changes: 0e4d1e9

The updated changelog entry now provides users with specific details about all the new tool calling parameters and capabilities introduced in this PR.


github-actions[bot] avatar Jun 30 '25 15:06 github-actions[bot]

Claude finished @Ahmad-mtos's task —— View job


Pull Request Review: Tool Calls in Tasks and Sessions

I've completed a comprehensive review of this PR that adds tool calling functionality to tasks and sessions. Here's my detailed analysis:

Review Checklist:

  • [x] Read repository guidelines (CLAUDE.md/AGENTS.md)
  • [x] Analyze core tool calling implementation (tool_runner.py, tool_executor.py)
  • [x] Review integration service changes and tool conversion logic
  • [x] Examine security considerations for tool execution
  • [x] Check error handling and logging practices
  • [x] Assess performance implications
  • [x] Review feature flags integration
  • [x] Analyze API changes and backwards compatibility
  • [x] Check test coverage and missing tests
  • [x] Address existing review comments from whiterabbit1983

Strengths

1. Architecture & Design

  • Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
  • Type safety: Extensive use of Pydantic models and beartype decorators throughout
  • Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
  • Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

  • Proper exception wrapping with context preservation (IntegrationExecutionException)
  • Graceful fallbacks for unknown tools (creates dummy error responses)
  • Activity logging integration for debugging

3. Integration Points

  • Seamless integration with existing workflow execution system
  • Proper Temporal activity patterns maintained
  • LLM provider abstraction through LiteLLM

🚨 Critical Security Vulnerabilities

1. CRITICAL: Direct eval() Usage (expressions.py:90)

extra_lambdas[k] = eval(v)  # Bypasses simpleeval sandbox!

Risk: IMMEDIATE SECURITY THREAT - Allows arbitrary code execution, completely undermining the sandbox.
Impact: Full system compromise, data exfiltration, service disruption
Status: BLOCKS MERGE - This must be fixed before deployment

2. IndexError Vulnerability (workflow/task_execution/__init__.py:451)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes messages list is non-empty. If len(messages) == 0, this will raise an IndexError.
Fix: Add length check: if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:

3. JSON Bomb Vulnerability (tool_runner.py:131)

arguments_str = json.loads(call.function.arguments)

Risk: No size limits on JSON parsing could lead to memory exhaustion attacks.


⚠️ Additional Security Concerns

Argument Merging Without Validation (execute_integration.py:53,63)

arguments = arguments | (integration.arguments or {}) | merged_tool_args.get(tool_name, {})
setup = setup | integration_setup | merged_tool_setup.get(tool_name, {})

Risk: Malicious arguments could override legitimate ones without validation.

Excessive Timeouts (integrations.py:24,43)

async with AsyncClient(timeout=600) as client:  # 10 minutes!

Risk: Could enable DoS attacks; should be reduced to 30-60 seconds with retry logic.


📝 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.pyValid Concern

Status: Error is logged but context could be improved with structured logging.

2. Async client in integrations.pyCode is Correct

Status: Code correctly uses AsyncClient, but timeout is excessive (600s → 30-60s recommended).

3. F-string in integration serviceMinor Style Issue

Line 38: Code is already clear and doesn't require f-string conversion.

4. len(messages) == 0 checkCritical Issue

Status: Confirmed critical IndexError vulnerability that could crash the system.


🔍 Performance Issues

Tool Resolution Inefficiency (tool_runner.py:166-172)

tool_map = {}
for t in tools:
    if t.type == "integration" and t.integration is not None:
        tool_map[t.integration.provider] = t
    else:
        tool_map[t.name] = t

Issue: Tool map recreation on every LLM call instead of caching.
Impact: O(n) lookup cost for each tool resolution.

Sequential Tool Processing (tool_runner.py:185-208)

Issue: Tools are processed sequentially in the loop.
Recommendation: Consider parallel execution for independent tool calls.


📊 Test Coverage Analysis

Existing Tests

  • test_tool_queries.py: CRUD operations for tool management
  • test_tool_call_step.py: Basic tool call construction and formatting

Missing Critical Tests

  1. No tests for tool_runner.py functionality
  2. No tests for run_llm_with_tools tool loop logic
  3. No tests for feature flag integration with tools
  4. No tests for error scenarios in tool execution
  5. No integration tests for the full tool calling pipeline
  6. No security tests for malicious input handling

Test Recommendations

# Suggested new test files:
- tests/test_tool_runner.py: Core tool runner functionality
- tests/test_tool_security.py: Security vulnerability tests
- tests/test_tool_integration.py: End-to-end tool calling tests
- tests/test_tool_error_handling.py: Error scenarios and recovery

🔧 Recommendations

Must Fix Before Merge (Blockers)

  1. 🚨 CRITICAL: Remove eval() usage - Replace direct eval with safe alternatives in expressions.py:90
  2. Fix IndexError - Add messages length check in workflow execution
  3. Add JSON size limits - Prevent JSON bomb attacks
  4. Add basic test coverage - At minimum, unit tests for core tool runner logic

Should Fix Soon (High Priority)

  1. Reduce HTTP timeouts - Configure reasonable timeout values (30-60s)
  2. Implement response size limits - Prevent memory exhaustion
  3. Add input validation - Validate tool arguments before processing
  4. Improve error handling - Sanitize error messages to prevent information disclosure

Nice to Have (Medium Priority)

  1. Performance optimization - Cache tool maps and consider parallel execution
  2. Enhanced logging - Add structured logging with correlation IDs
  3. Rate limiting - Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 4/10 (Updated from previous reviews)

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities that make it unsafe for production deployment.

Positives:

  • Well-structured architecture with clear separation of concerns
  • Proper use of feature flags for gradual rollout
  • Good integration with existing systems
  • Type safety through Pydantic models

Critical Issues:

  • 🚨 CRITICAL SECURITY VULNERABILITY: Direct eval() usage allows arbitrary code execution
  • IndexError bug that could crash the system
  • Insufficient test coverage for such a complex feature
  • Performance inefficiencies that could impact production

Recommendation: ❌ DO NOT MERGE until the critical security vulnerability in expressions.py is fixed, the IndexError bug is resolved, and basic test coverage is added.

The direct eval() usage represents an immediate security threat that could lead to full system compromise. This must be addressed before any deployment considerations.


claude[bot] avatar Jun 30 '25 15:06 claude[bot]

Claude finished @Ahmad-mtos's task —— View job


Changelog Bot Analysis ✅

Analyzed the PR changes and updated the changelog entry with comprehensive details about new user-visible features.

Todo List:

  • [x] Read diff.patch file to understand changes
  • [x] Determine if changes are user-visible
  • [x] Update changelog entry with detailed features
  • [x] Commit and push changes

Analysis:

This PR introduces significant user-visible changes related to tool calling functionality:

New Features Added:

  • New prompt step parameters: tool_choice, unwrap, auto_run_tools, disable_cache
  • Session parameters: auto_run_tools and forward_tool_calls
  • Support for Anthropic tool types: computer, text editor, and bash tools
  • Feature flags integration using Unleash for controlled rollouts
  • Enhanced tool runner system and OpenAI-compatible tool conversion

Actions Taken:

Updated changelog entry: changelog/dev/2025-06-30.md
Committed detailed changes: 7244770

The updated changelog entry now provides users with specific details about all the new tool calling parameters and capabilities introduced in this PR.


github-actions[bot] avatar Jun 30 '25 15:06 github-actions[bot]

Claude finished @Ahmad-mtos's task —— View job


Pull Request Review: Tool Calls in Tasks and Sessions

Conducting comprehensive review of PR #1453 that adds tool calling functionality to tasks and sessions.

Review Checklist:

  • [x] Read repository guidelines (CLAUDE.md/AGENTS.md)
  • [x] Analyze core tool calling implementation
  • [x] Address CI type checking failures
  • [x] Review security vulnerabilities (especially eval() usage)
  • [x] Check error handling and logging practices
  • [x] Assess performance implications
  • [x] Review test coverage
  • [x] Address existing reviewer comments from whiterabbit1983
  • [x] Provide comprehensive feedback and recommendations

Strengths

1. Architecture & Design

  • Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
  • Type safety: Extensive use of Pydantic models and beartype decorators throughout
  • Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
  • Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

  • Proper exception wrapping with context preservation (IntegrationExecutionException)
  • Graceful fallbacks for unknown tools (creates dummy error responses)
  • Activity logging integration for debugging

3. Integration Points

  • Seamless integration with existing workflow execution system
  • Proper Temporal activity patterns maintained
  • LLM provider abstraction through LiteLLM

🚨 Critical Issues

1. CI Type Checking Failures

The build is currently failing with several type errors that must be fixed:

feature_flags.py:329, 340

# Line 329
result = self._client.get_boolean_value(...)  # _client can be None
# Line 340
return default_value  # Expected bool, got Optional[bool]

Fix: Add null checks and ensure return type is always bool.

tool_runner.py:54, 82

# Line 54
"parameters": tool.api_call.params_schema.model_dump(exclude_none=True),  # params_schema can be None
# Line 82  
"parameters": tool.function.parameters,  # function can be None

Fix: Add null checks before accessing attributes.

2. Critical Security Vulnerability (expressions.py:90)

extra_lambdas[k] = eval(v)  # Direct eval() bypasses simpleeval sandbox!

Risk: IMMEDIATE SECURITY THREAT - Allows arbitrary code execution
Impact: Full system compromise, data exfiltration, service disruption
Status: BLOCKS MERGE

3. IndexError Vulnerability (workflow/task_execution/__init__.py:451)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes non-empty list. If len(messages) == 0, raises IndexError
Fix: if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:


⚠️ Additional Security & Performance Concerns

Network Security

  • Excessive HTTP timeouts (600s) in integrations.py:24,43 could enable DoS attacks
  • No response size limits could cause memory exhaustion
  • No input validation on JSON payloads (tool_runner.py:131)

Performance Issues

  • Tool map recreation on every LLM call (tool_runner.py:166-172) - should cache
  • Sequential tool processing - could benefit from parallel execution for independent tools
  • No rate limiting on tool execution

📝 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.py:66-69Valid Concern

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise  # Error logged but re-raised without additional context

Recommendation: Consider wrapping in custom exception with more context.

2. Async client in integrations.py:24Code is Correct

Code correctly uses AsyncClient, but 600-second timeout is excessive.
Recommendation: Reduce to 30-60 seconds with retry logic.

3. F-string in integration service:38Minor Style Issue

The current code if m.method == method: is clear and doesn't require f-string conversion.

4. len(messages) == 0 checkCritical Issue (Addressed above)


📊 Test Coverage Analysis

Existing Tests

  • test_tool_runner.py: Good coverage of format_tool function with various tool types
  • test_tool_queries.py: CRUD operations for tool management
  • test_tool_call_step.py: Basic tool call construction

Missing Critical Tests

  1. No tests for run_llm_with_tools tool loop logic
  2. No tests for feature flag integration with tools
  3. No tests for error scenarios in tool execution
  4. No tests for security boundaries (malicious inputs)
  5. No integration tests for the full tool calling pipeline
  6. No tests for tool resolution and mapping logic

🔧 Recommendations

Must Fix Before Merge (Blockers)

  1. 🚨 CRITICAL: Fix eval() usage - Replace with safe alternatives in expressions.py:90
  2. Fix CI type errors - Add null checks in feature_flags.py and tool_runner.py
  3. Fix IndexError - Add length check in workflow execution
  4. Add JSON size limits - Prevent JSON bomb attacks

Should Fix Soon (High Priority)

  1. Reduce HTTP timeouts - Configure reasonable timeout values (30-60s)
  2. Add input validation - Validate tool arguments before processing
  3. Implement response size limits - Prevent memory exhaustion
  4. Add security tests - Test malicious input handling

Nice to Have (Medium Priority)

  1. Performance optimization - Cache tool maps and consider parallel execution
  2. Enhanced logging - Add structured logging with correlation IDs
  3. Rate limiting - Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 4/10

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities and type safety issues that make it unsafe for production deployment.

Positives:

  • Well-structured architecture with clear separation of concerns
  • Proper use of feature flags for gradual rollout
  • Good integration with existing systems
  • Comprehensive test coverage for tool formatting

Critical Issues:

  • 🚨 CRITICAL SECURITY VULNERABILITY: Direct eval() usage allows arbitrary code execution
  • CI failures due to type checking errors that break the build
  • IndexError bug that could crash the system
  • Excessive timeouts that could be exploited

Recommendation: ❌ DO NOT MERGE until:

  1. Critical security vulnerability in expressions.py is fixed
  2. CI type checking errors are resolved
  3. IndexError bug is patched
  4. Basic input validation is added

The direct eval() usage represents an immediate security threat that could lead to full system compromise. This must be addressed before any deployment considerations.


claude[bot] avatar Jul 01 '25 18:07 claude[bot]

Claude finished @Ahmad-mtos's task —— View job


Changelog Bot Analysis ✅

Analyzed the PR changes and determined that no new changelog entry is needed.

Todo List:

  • [x] Read diff.patch file to understand changes
  • [x] Determine if changes are user-visible
  • [x] Create or update changelog entry if needed
  • [x] Commit and push changes

Analysis:

This PR contains no new user-visible changes that require a changelog entry beyond what's already documented in the existing changelog/dev/2025-06-30.md file.

Changes in this update:

  • Added test files (test_prompt_step_auto_tools.py, test_tool_runner.py)
  • Added feature flag client implementation
  • Minor fixes and refactoring

User-visible features already documented:

  • ✅ Tool calling functionality
  • ✅ New prompt step parameters (tool_choice, unwrap, auto_run_tools, disable_cache)
  • ✅ New session parameters (auto_run_tools, forward_tool_calls)
  • ✅ Anthropic tool support (computer_20241022, text_editor_20241022, bash_20241022)
  • ✅ Feature flags integration using Unleash

Conclusion: The existing changelog entry from June 30, 2025 comprehensively covers all user-visible changes in this PR. No additional changelog entry is needed.


github-actions[bot] avatar Jul 01 '25 18:07 github-actions[bot]

Claude finished @Ahmad-mtos's task —— View job


Pull Request Review: Tool Calls in Tasks and Sessions

Conducting comprehensive review of PR #1453 that adds tool calling functionality to tasks and sessions.

Review Checklist:

  • [x] Read repository guidelines (AGENTS.md/CLAUDE.md)
  • [x] Analyze core tool calling implementation
  • [x] Address existing reviewer comments from whiterabbit1983
  • [x] Review security vulnerabilities (especially eval() usage)
  • [x] Check error handling and logging practices
  • [x] Assess performance implications
  • [x] Review test coverage
  • [x] Provide comprehensive feedback and recommendations

Strengths

1. Architecture & Design

  • Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
  • Type safety: Extensive use of Pydantic models and beartype decorators throughout
  • Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
  • Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

  • Proper exception wrapping with context preservation (IntegrationExecutionException)
  • Graceful fallbacks for unknown tools (creates dummy error responses)
  • Activity logging integration for debugging

3. Integration Points

  • Seamless integration with existing workflow execution system
  • Proper Temporal activity patterns maintained
  • LLM provider abstraction through LiteLLM

🚨 Critical Issues

1. Critical Security Vulnerability (expressions.py:90)

extra_lambdas[k] = eval(v)  # Direct eval() bypasses simpleeval sandbox!

Risk: IMMEDIATE SECURITY THREAT - Allows arbitrary code execution
Impact: Full system compromise, data exfiltration, service disruption
Status: BLOCKS MERGE

2. IndexError Vulnerability (workflow/task_execution/__init__.py:451)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes non-empty list. If len(messages) == 0, raises IndexError
Fix: if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:

3. Type Safety Issues (tool_runner.py)

Lines 54, 82: Missing null checks before accessing optional attributes

# Line 54
"parameters": tool.api_call.params_schema.model_dump(exclude_none=True),  # params_schema can be None
# Line 82  
"parameters": tool.function.parameters,  # function can be None

⚠️ Additional Security & Performance Concerns

Network Security

  • Excessive HTTP timeouts (600s) in integrations.py:24,43 could enable DoS attacks
  • No response size limits could cause memory exhaustion
  • No input validation on JSON payloads (tool_runner.py:131)

Performance Issues

  • Tool map recreation on every LLM call (tool_runner.py:166-172) - should cache
  • Sequential tool processing - could benefit from parallel execution for independent tools
  • No rate limiting on tool execution

📝 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.py:66-69Valid Concern

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise  # Error logged but re-raised without additional context

Recommendation: Consider wrapping in custom exception with more context.

2. Async client in integrations.py:24Code is Correct

Code correctly uses AsyncClient, but 600-second timeout is excessive.
Recommendation: Reduce to 30-60 seconds with retry logic.

3. F-string in integration service:38Minor Style Issue

The current code if m.method == method: is clear and doesn't require f-string conversion.

4. len(messages) == 0 checkCritical Issue (Addressed above)


📊 Test Coverage Analysis

Existing Tests

  • test_tool_runner.py: Good coverage of format_tool function with various tool types
  • test_prompt_step_auto_tools.py: Feature flag integration testing
  • Basic tool call construction tests

Missing Critical Tests

  1. No tests for run_llm_with_tools tool loop logic
  2. No tests for error scenarios in tool execution
  3. No tests for security boundaries (malicious inputs)
  4. No integration tests for the full tool calling pipeline
  5. No tests for tool resolution and mapping logic

🔧 Recommendations

Must Fix Before Merge (Blockers)

  1. 🚨 CRITICAL: Fix eval() usage - Replace with safe alternatives in expressions.py:90
  2. Fix IndexError - Add length check in workflow execution
  3. Add null checks - Fix type safety issues in tool_runner.py:54,82
  4. Add JSON size limits - Prevent JSON bomb attacks

Should Fix Soon (High Priority)

  1. Reduce HTTP timeouts - Configure reasonable timeout values (30-60s)
  2. Add input validation - Validate tool arguments before processing
  3. Implement response size limits - Prevent memory exhaustion
  4. Add security tests - Test malicious input handling

Nice to Have (Medium Priority)

  1. Performance optimization - Cache tool maps and consider parallel execution
  2. Enhanced logging - Add structured logging with correlation IDs
  3. Rate limiting - Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 4/10

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities and type safety issues that make it unsafe for production deployment.

Positives:

  • Well-structured architecture with clear separation of concerns
  • Proper use of feature flags for gradual rollout
  • Good integration with existing systems
  • Comprehensive test coverage for tool formatting

Critical Issues:

  • 🚨 CRITICAL SECURITY VULNERABILITY: Direct eval() usage allows arbitrary code execution
  • IndexError bug that could crash the system
  • Type safety issues that could cause runtime errors
  • Excessive timeouts that could be exploited

Recommendation: ❌ DO NOT MERGE until:

  1. Critical security vulnerability in expressions.py is fixed
  2. IndexError bug is patched
  3. Type safety issues are resolved
  4. Basic input validation is added

The direct eval() usage represents an immediate security threat that could lead to full system compromise. This must be addressed before any deployment considerations.


claude[bot] avatar Jul 01 '25 18:07 claude[bot]

Claude encountered an error —— View job


Changelog Bot Analysis

Analyzing the PR changes to determine if a changelog entry is needed.

Todo List:

  • [ ] Read diff.patch file to understand changes
  • [ ] Determine if changes are user-visible
  • [ ] Create or update changelog entry if needed
  • [ ] Commit and push changes

github-actions[bot] avatar Jul 01 '25 18:07 github-actions[bot]

Claude finished @Ahmad-mtos's task —— View job


Pull Request Review: Tool Calls in Tasks and Sessions

Conducting comprehensive review of PR #1453 that adds tool calling functionality to tasks and sessions.

Review Checklist:

  • [x] Read repository guidelines (CLAUDE.md/AGENTS.md)
  • [x] Analyze core tool calling implementation
  • [x] Address existing reviewer comments from whiterabbit1983
  • [x] Review security vulnerabilities (especially eval() usage)
  • [x] Check error handling and logging practices
  • [x] Assess performance implications
  • [x] Review test coverage
  • [x] Provide comprehensive feedback and recommendations

Strengths

1. Architecture & Design

  • Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
  • Type safety: Extensive use of Pydantic models and beartype decorators throughout
  • Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
  • Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

  • Proper exception wrapping with context preservation (IntegrationExecutionException)
  • Graceful fallbacks for unknown tools (creates dummy error responses)
  • Activity logging integration for debugging

3. Integration Points

  • Seamless integration with existing workflow execution system
  • Proper Temporal activity patterns maintained
  • LLM provider abstraction through LiteLLM

🚨 Critical Issues

1. Critical Security Vulnerability (expressions.py:90)

extra_lambdas[k] = eval(v)  # Direct eval() bypasses simpleeval sandbox!

Risk: IMMEDIATE SECURITY THREAT - Allows arbitrary code execution
Impact: Full system compromise, data exfiltration, service disruption
Status: BLOCKS MERGE

2. IndexError Vulnerability (workflow/task_execution/__init__.py:451)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes non-empty list. If len(messages) == 0, raises IndexError
Fix: if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:

3. Type Safety Issues (tool_runner.py)

Lines 54, 82: Missing null checks before accessing optional attributes

# Line 54
"parameters": tool.api_call.params_schema.model_dump(exclude_none=True),  # params_schema can be None
# Line 82  
"parameters": tool.function.parameters,  # function can be None

⚠️ Additional Security & Performance Concerns

Network Security

  • Excessive HTTP timeouts (600s) in integrations.py:24,43 could enable DoS attacks
  • No response size limits could cause memory exhaustion
  • No input validation on JSON payloads (tool_runner.py:131)

Performance Issues

  • Tool map recreation on every LLM call (tool_runner.py:166-172) - should cache
  • Sequential tool processing - could benefit from parallel execution for independent tools
  • No rate limiting on tool execution

📝 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.py:66-69Valid Concern

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise  # Error logged but re-raised without additional context

Recommendation: Consider wrapping in custom exception with more context.

2. Async client in integrations.py:24Code is Correct

Code correctly uses AsyncClient, but 600-second timeout is excessive.
Recommendation: Reduce to 30-60 seconds with retry logic.

3. F-string in integration service:38Minor Style Issue

The current code if m.method == method: is clear and doesn't require f-string conversion.

4. len(messages) == 0 checkCritical Issue (Addressed above)


📊 Test Coverage Analysis

Existing Tests

  • test_tool_runner.py: Good coverage of format_tool function with various tool types
  • test_prompt_step_auto_tools.py: Feature flag integration testing
  • Basic tool call construction tests

Missing Critical Tests

  1. No tests for run_llm_with_tools tool loop logic
  2. No tests for error scenarios in tool execution
  3. No tests for security boundaries (malicious inputs)
  4. No integration tests for the full tool calling pipeline
  5. No tests for tool resolution and mapping logic

🔧 Recommendations

Must Fix Before Merge (Blockers)

  1. 🚨 CRITICAL: Fix eval() usage - Replace with safe alternatives in expressions.py:90
  2. Fix IndexError - Add length check in workflow execution
  3. Add null checks - Fix type safety issues in tool_runner.py:54,82
  4. Add JSON size limits - Prevent JSON bomb attacks

Should Fix Soon (High Priority)

  1. Reduce HTTP timeouts - Configure reasonable timeout values (30-60s)
  2. Add input validation - Validate tool arguments before processing
  3. Implement response size limits - Prevent memory exhaustion
  4. Add security tests - Test malicious input handling

Nice to Have (Medium Priority)

  1. Performance optimization - Cache tool maps and consider parallel execution
  2. Enhanced logging - Add structured logging with correlation IDs
  3. Rate limiting - Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 4/10

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities and type safety issues that make it unsafe for production deployment.

Positives:

  • Well-structured architecture with clear separation of concerns
  • Proper use of feature flags for gradual rollout
  • Good integration with existing systems
  • Comprehensive test coverage for tool formatting

Critical Issues:

  • 🚨 CRITICAL SECURITY VULNERABILITY: Direct eval() usage allows arbitrary code execution
  • IndexError bug that could crash the system
  • Type safety issues that could cause runtime errors
  • Excessive timeouts that could be exploited

Recommendation: ❌ DO NOT MERGE until:

  1. Critical security vulnerability in expressions.py is fixed
  2. IndexError bug is patched
  3. Type safety issues are resolved
  4. Basic input validation is added

The direct eval() usage represents an immediate security threat that could lead to full system compromise. This must be addressed before any deployment considerations.


claude[bot] avatar Jul 01 '25 18:07 claude[bot]

Claude finished @Ahmad-mtos's task —— View job


Changelog Bot Analysis ✅

Analyzed the PR changes and determined that no new changelog entry is needed.

Todo List:

  • [x] Read diff.patch file to understand changes
  • [x] Determine if changes are user-visible
  • [x] Create or update changelog entry if needed
  • [x] Commit and push changes

Analysis:

This PR contains no new user-visible changes that require a changelog entry beyond what's already documented in the existing changelog/dev/2025-06-30.md file.

Changes in this update:

  • Added test files (test_prompt_step_auto_tools.py, test_tool_runner.py)
  • Added feature flag client implementation
  • Minor fixes and refactoring

User-visible features already documented:

  • ✅ Tool calling functionality
  • ✅ New prompt step parameters (tool_choice, unwrap, auto_run_tools, disable_cache)
  • ✅ New session parameters (auto_run_tools, forward_tool_calls)
  • ✅ Anthropic tool support (computer_20241022, text_editor_20241022, bash_20241022)
  • ✅ Feature flags integration using Unleash

Conclusion: The existing changelog entry from June 30, 2025 comprehensively covers all user-visible changes in this PR. No additional changelog entry is needed.

github-actions[bot] avatar Jul 01 '25 18:07 github-actions[bot]

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 Security concerns

Sensitive information exposure:
The feature flags configuration includes API tokens and URLs that are passed through environment variables. While this is a standard practice, ensure that the unleash_api_token and other credentials are properly secured and not logged in plain text. The custom headers in UnleashProvider._initialize_unleash() method directly uses the API token which could be exposed in debug logs.

⚡ Recommended focus areas for review

Error Handling

The feature flag client has extensive error handling but may silently fall back to defaults in production scenarios where proper error visibility is needed. The initialization errors are logged but the client continues operating with potentially unexpected behavior.

except ImportError:
    logger.warning(
        "⚠️ UnleashClient not available, feature flags will use default values"
    )
    self._unleash_client = None
except Exception as e:
    logger.error(f"❌ Failed to initialize Unleash client: {e}")
    logger.warning("⚠️ Feature flags will use default values")
    self._unleash_client = None
Type Safety

The convert_litellm_to_chosen_tool_call function performs JSON parsing and dictionary manipulation that could fail with malformed input. The function assumes specific structure in tool_spec without proper validation.

arguments_str = json.loads(call.function.arguments)

tool_spec = tool.model_dump()
if "id" in tool_spec:
    tool_spec.pop("id")

setup = getattr(tool_spec, f"{tool.type}", {}).get("setup", {})
if setup:
    setup = setup.model_dump()
tool_spec[f"{tool.type}"]["setup"] = setup

# TODO: add computer_20241022, text_editor_20241022, bash_20241022
tool_spec[f"{tool.type}"]["arguments"] = arguments_str

return BaseChosenToolCall(
    id=call.id,
    **tool_spec,
)
Logic Complexity

The prompt step handling logic has become complex with feature flag branching and different response structure handling. The nested try-catch blocks and multiple conditional paths increase the risk of edge case bugs.

try:
    if get_feature_flag_value(
        "auto_tool_calls_prompt_step",
        developer_id=self.context.execution_input.developer_id,
    ):
        if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:
            workflow.logger.debug(f"Prompt step: Received response: {messages}")
            return WorkflowResult(state=PartialTransition(output=messages))

        # TODO: make sure to include filtered function tool calls in the last message, or filter them from 2nd message
        tool_calls_input = messages[-1]["tool_calls"]
    else:
        message = messages
        if (
            step.unwrap
            or not step.auto_run_tools
            or message["choices"][0]["finish_reason"] != "tool_calls"
        ):
            workflow.logger.debug(f"Prompt step: Received response: {message}")
            return WorkflowResult(state=PartialTransition(output=message))

        choice = message["choices"][0]
        tool_calls_input = choice["message"]["tool_calls"]
except (KeyError, IndexError, TypeError) as e:
    workflow.logger.error(f"Prompt step: Error parsing response structure: {e}")
    msg = f"Invalid response structure in prompt step: {e}"
    raise ApplicationError(msg) from e
except Exception as e:
    workflow.logger.error(f"Prompt step: Unexpected error: {e}")
    msg = f"Unexpected error in prompt step: {e}"
    raise ApplicationError(msg) from e

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix incorrect response and request schemas

The response schema for creating a tool should return a proper tool object, not
a string enum. The request body schema should also reference a tool creation
schema, not an agent creation schema.

src/openapi.yaml [593-617]

 /agents/{id}/tools:
   post:
     operationId: AgentToolsRoute_create
     description: Create a new resource owned by the given parent
     parameters:
       - name: id
         in: path
         required: true
         description: ID of parent resource
         schema:
           $ref: '#/components/schemas/Common.uuid'
     responses:
       '201':
         description: The request has succeeded and a new resource has been created as a result.
         content:
           application/json:
             schema:
-              type: string
-              enum:
-                - Create a new tool for this agent
+              $ref: '#/components/schemas/Tools.Tool'
     requestBody:
       required: true
       content:
         application/json:
           schema:
-            $ref: '#/components/schemas/Agents.CreateAgentRequest'
+            $ref: '#/components/schemas/Tools.CreateToolRequest'

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies that the POST /agents/{id}/tools endpoint uses an incorrect request body schema (Agents.CreateAgentRequest instead of one for tools) and an incorrect response schema (a string instead of the created tool object), which are critical API definition errors.

High
Return proper tool objects in responses

The response schemas for PUT and PATCH operations should return the actual tool
object after modification, not string enums. This ensures consistent API
behavior and proper response handling.

src/openapi.yaml [619-680]

 /agents/{id}/tools/{child_id}:
   put:
     operationId: AgentToolsRoute_update
     description: Update a resource owned by the given parent using its id
     parameters: ...
     responses:
       '200':
         description: The request has succeeded.
         content:
           application/json:
             schema:
-              type: string
-              enum:
-                - Update an existing tool (overwrite existing values)
+              $ref: '#/components/schemas/Tools.Tool'
   patch:
     operationId: AgentToolsRoute_patch
     description: Patch a resource owned by the given parent using its id
     parameters: ...
     responses:
       '200':
         description: The request has succeeded.
         content:
           application/json:
             schema:
-              type: string
-              enum:
-                - Update an existing tool (merges with existing values)
+              $ref: '#/components/schemas/Tools.Tool'

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies that the PUT and PATCH endpoints for tools return a placeholder string instead of the updated tool object, which is a critical error in the API definition that needs to be fixed for consistency and usability.

High
Add null check for params_schema
Suggestion Impact:The suggestion was implemented but with a different approach. Instead of using a ternary operator inline, the commit extracts the null check into separate lines with a params variable, achieving the same goal of preventing AttributeError when params_schema is None.

code diff:

+        params = {}
+        if tool.api_call.params_schema is not None:
+            params = tool.api_call.params_schema.model_dump(exclude_none=True)
         return {
             "type": "function",
             "function": {
                 "name": tool.name,
                 "description": tool.description,
-                "parameters": tool.api_call.params_schema.model_dump(exclude_none=True),
+                "parameters": params,

Add null check for params_schema before calling model_dump() to prevent
AttributeError when the schema is None. This ensures the function handles cases
where API call tools don't have parameter schemas defined.

src/agents-api/agents_api/common/utils/tool_runner.py [48-56]

 if tool.type == "api_call" and tool.api_call is not None:
     return {
         "type": "function",
         "function": {
             "name": tool.name,
             "description": tool.description,
-            "parameters": tool.api_call.params_schema.model_dump(exclude_none=True),
+            "parameters": tool.api_call.params_schema.model_dump(exclude_none=True) if tool.api_call.params_schema else {},
         },
     }

[Suggestion processed]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies that tool.api_call.params_schema can be None, and calling model_dump() on it would cause an AttributeError, so adding a check prevents a potential crash.

Medium
Add None check before UUID conversion

Add validation to ensure the field value is not None before attempting UUID
conversion to prevent ValueError when UUID() receives None. This prevents
runtime errors when optional UUID fields are present but have None values.

src/agents-api/agents_api/activities/execute_system.py [62-64]

 for field in uuid_fields:
-    if field in arguments and not isinstance(arguments[field], UUID):
+    if field in arguments and arguments[field] is not None and not isinstance(arguments[field], UUID):
         arguments[field] = UUID(arguments[field])
  • [ ] Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that arguments[field] could be None, which would cause UUID(None) to raise a ValueError, and adding a None check improves the code's robustness.

Medium
Fix external volume dependency issue

The external volume unleash_db_data may not exist on first run, causing
container startup failures. Consider making it a regular named volume or add
initialization instructions.

src/feature-flags/docker-compose.yml [48-50]

 volumes:
   unleash_db_data:
-    external: true
+    driver: local
  • [ ] Apply / Chat
Suggestion importance[1-10]: 7

__

Why: This suggestion correctly identifies that using an external volume requires manual setup, and removing this dependency improves the ease of use for developers setting up the project.

Medium
  • [ ] Update

Claude finished @Ahmad-mtos's task —— View job


Pull Request Review: Tool Calls in Tasks and Sessions

Conducting comprehensive review of PR #1453 that adds tool calling functionality to tasks and sessions.

Review Checklist:

  • [x] Read repository guidelines (CLAUDE.md/AGENTS.md)
  • [x] Analyze core tool calling implementation
  • [x] Address existing reviewer comments from whiterabbit1983
  • [x] Review security vulnerabilities (especially eval() usage)
  • [x] Check error handling and logging practices
  • [x] Assess performance implications
  • [x] Review test coverage
  • [x] Provide comprehensive feedback and recommendations

Strengths

1. Architecture & Design

  • Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
  • Type safety: Extensive use of Pydantic models and beartype decorators throughout
  • Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
  • Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

  • Proper exception wrapping with context preservation (IntegrationExecutionException)
  • Graceful fallbacks for unknown tools (creates dummy error responses)
  • Activity logging integration for debugging

3. Integration Points

  • Seamless integration with existing workflow execution system
  • Proper Temporal activity patterns maintained
  • LLM provider abstraction through LiteLLM

🚨 Critical Issues

1. Critical Security Vulnerability (expressions.py:90)

extra_lambdas[k] = eval(v)  # Direct eval() bypasses simpleeval sandbox!

Risk: IMMEDIATE SECURITY THREAT - Allows arbitrary code execution
Impact: Full system compromise, data exfiltration, service disruption
Status: BLOCKS MERGE

2. IndexError Vulnerability (workflow/task_execution/__init__.py:451)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes non-empty list. If len(messages) == 0, raises IndexError
Fix: if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:

3. Type Safety Issues (tool_runner.py)

Lines 54, 82: Missing null checks before accessing optional attributes

# Line 54
"parameters": tool.api_call.params_schema.model_dump(exclude_none=True),  # params_schema can be None
# Line 82  
"parameters": tool.function.parameters,  # function can be None

⚠️ Additional Security & Performance Concerns

Network Security

  • Excessive HTTP timeouts (600s) in integrations.py:24,43 could enable DoS attacks
  • No response size limits could cause memory exhaustion
  • No input validation on JSON payloads (tool_runner.py:131)

Performance Issues

  • Tool map recreation on every LLM call (tool_runner.py:166-172) - should cache
  • Sequential tool processing - could benefit from parallel execution for independent tools
  • No rate limiting on tool execution

📝 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.py:66-69Valid Concern

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise  # Error logged but re-raised without additional context

Recommendation: Consider wrapping in custom exception with more context.

2. Async client in integrations.py:24Code is Correct

Code correctly uses AsyncClient, but 600-second timeout is excessive.
Recommendation: Reduce to 30-60 seconds with retry logic.

3. F-string in integration service:38Minor Style Issue

The current code if m.method == method: is clear and doesn't require f-string conversion.

4. len(messages) == 0 checkCritical Issue (Addressed above)


📊 Test Coverage Analysis

Existing Tests

  • test_tool_runner.py: Good coverage of format_tool function with various tool types
  • test_prompt_step_auto_tools.py: Feature flag integration testing
  • Basic tool call construction tests

Missing Critical Tests

  1. No tests for run_llm_with_tools tool loop logic
  2. No tests for error scenarios in tool execution
  3. No tests for security boundaries (malicious inputs)
  4. No integration tests for the full tool calling pipeline
  5. No tests for tool resolution and mapping logic

🔧 Recommendations

Must Fix Before Merge (Blockers)

  1. 🚨 CRITICAL: Fix eval() usage - Replace with safe alternatives in expressions.py:90
  2. Fix IndexError - Add length check in workflow execution
  3. Add null checks - Fix type safety issues in tool_runner.py:54,82
  4. Add JSON size limits - Prevent JSON bomb attacks

Should Fix Soon (High Priority)

  1. Reduce HTTP timeouts - Configure reasonable timeout values (30-60s)
  2. Add input validation - Validate tool arguments before processing
  3. Implement response size limits - Prevent memory exhaustion
  4. Add security tests - Test malicious input handling

Nice to Have (Medium Priority)

  1. Performance optimization - Cache tool maps and consider parallel execution
  2. Enhanced logging - Add structured logging with correlation IDs
  3. Rate limiting - Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 4/10

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities and type safety issues that make it unsafe for production deployment.

Positives:

  • Well-structured architecture with clear separation of concerns
  • Proper use of feature flags for gradual rollout
  • Good integration with existing systems
  • Comprehensive test coverage for tool formatting

Critical Issues:

  • 🚨 CRITICAL SECURITY VULNERABILITY: Direct eval() usage allows arbitrary code execution
  • IndexError bug that could crash the system
  • Type safety issues that could cause runtime errors
  • Excessive timeouts that could be exploited

Recommendation: ❌ DO NOT MERGE until:

  1. Critical security vulnerability in expressions.py is fixed
  2. IndexError bug is patched
  3. Type safety issues are resolved
  4. Basic input validation is added

The direct eval() usage represents an immediate security threat that could lead to full system compromise. This must be addressed before any deployment considerations.


claude[bot] avatar Jul 01 '25 19:07 claude[bot]

Claude finished @Ahmad-mtos's task —— View job


Changelog Bot Analysis ✅

Analyzed the PR changes and determined that no new changelog entry is needed.

Todo List:

  • [x] Read diff.patch file to understand changes
  • [x] Determine if changes are user-visible
  • [x] Create or update changelog entry if needed
  • [x] Commit and push changes

Analysis:

This PR contains no new user-visible changes that require a changelog entry beyond what's already documented in the existing changelog/dev/2025-06-30.md file.

Changes in this update:

  • Added test files (test_prompt_step_auto_tools.py, test_tool_runner.py)
  • Added feature flag client implementation
  • Minor fixes and refactoring

User-visible features already documented:

  • ✅ Tool calling functionality
  • ✅ New prompt step parameters (tool_choice, unwrap, auto_run_tools, disable_cache)
  • ✅ New session parameters (auto_run_tools, forward_tool_calls)
  • ✅ Anthropic tool support (computer_20241022, text_editor_20241022, bash_20241022)
  • ✅ Feature flags integration using Unleash

Conclusion: The existing changelog entry from June 30, 2025 comprehensively covers all user-visible changes in this PR. No additional changelog entry is needed.

github-actions[bot] avatar Jul 01 '25 19:07 github-actions[bot]

Claude finished @Ahmad-mtos's task —— View job


Pull Request Review: Tool Calls in Tasks and Sessions

Conducting comprehensive review of PR #1453 that adds tool calling functionality to tasks and sessions.

Review Checklist:

  • [x] Read repository guidelines (CLAUDE.md/AGENTS.md)
  • [x] Analyze core tool calling implementation
  • [x] Address existing reviewer comments from whiterabbit1983
  • [x] Review security vulnerabilities (especially eval() usage)
  • [x] Check error handling and logging practices
  • [x] Assess performance implications
  • [x] Review test coverage
  • [x] Provide comprehensive feedback and recommendations

Strengths

1. Architecture & Design

  • Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
  • Type safety: Extensive use of Pydantic models and beartype decorators throughout
  • Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
  • Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

  • Proper exception wrapping with context preservation (IntegrationExecutionException)
  • Graceful fallbacks for unknown tools (creates dummy error responses)
  • Activity logging integration for debugging

3. Integration Points

  • Seamless integration with existing workflow execution system
  • Proper Temporal activity patterns maintained
  • LLM provider abstraction through LiteLLM

🚨 Critical Issues

1. Critical Security Vulnerability (expressions.py:90)

extra_lambdas[k] = eval(v)  # Direct eval() bypasses simpleeval sandbox!

Risk: IMMEDIATE SECURITY THREAT - Allows arbitrary code execution
Impact: Full system compromise, data exfiltration, service disruption
Status: BLOCKS MERGE

2. IndexError Vulnerability (workflow/task_execution/__init__.py:451)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes non-empty list. If len(messages) == 0, raises IndexError
Fix: if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:

3. Type Safety Issues (tool_runner.py)

Lines 54, 82: Missing null checks before accessing optional attributes

# Line 54
"parameters": tool.api_call.params_schema.model_dump(exclude_none=True),  # params_schema can be None
# Line 82  
"parameters": tool.function.parameters,  # function can be None

⚠️ Additional Security & Performance Concerns

Network Security

  • Excessive HTTP timeouts (600s) in integrations.py:24,43 could enable DoS attacks
  • No response size limits could cause memory exhaustion
  • No input validation on JSON payloads (tool_runner.py:131)

Performance Issues

  • Tool map recreation on every LLM call (tool_runner.py:166-172) - should cache
  • Sequential tool processing - could benefit from parallel execution for independent tools
  • No rate limiting on tool execution

📝 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.py:66-69Valid Concern

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise  # Error logged but re-raised without additional context

Recommendation: Consider wrapping in custom exception with more context.

2. Async client in integrations.py:24Code is Correct

Code correctly uses AsyncClient, but 600-second timeout is excessive.
Recommendation: Reduce to 30-60 seconds with retry logic.

3. F-string in integration service:38Minor Style Issue

The current code if m.method == method: is clear and doesn't require f-string conversion.

4. len(messages) == 0 checkCritical Issue (Addressed above)


📊 Test Coverage Analysis

Existing Tests

  • test_tool_runner.py: Good coverage of format_tool function with various tool types
  • test_prompt_step_auto_tools.py: Feature flag integration testing
  • Basic tool call construction tests

Missing Critical Tests

  1. No tests for run_llm_with_tools tool loop logic
  2. No tests for error scenarios in tool execution
  3. No tests for security boundaries (malicious inputs)
  4. No integration tests for the full tool calling pipeline
  5. No tests for tool resolution and mapping logic

🔧 Recommendations

Must Fix Before Merge (Blockers)

  1. 🚨 CRITICAL: Fix eval() usage - Replace with safe alternatives in expressions.py:90
  2. Fix IndexError - Add length check in workflow execution
  3. Add null checks - Fix type safety issues in tool_runner.py:54,82
  4. Add JSON size limits - Prevent JSON bomb attacks

Should Fix Soon (High Priority)

  1. Reduce HTTP timeouts - Configure reasonable timeout values (30-60s)
  2. Add input validation - Validate tool arguments before processing
  3. Implement response size limits - Prevent memory exhaustion
  4. Add security tests - Test malicious input handling

Nice to Have (Medium Priority)

  1. Performance optimization - Cache tool maps and consider parallel execution
  2. Enhanced logging - Add structured logging with correlation IDs
  3. Rate limiting - Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 4/10

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities and type safety issues that make it unsafe for production deployment.

Positives:

  • Well-structured architecture with clear separation of concerns
  • Proper use of feature flags for gradual rollout
  • Good integration with existing systems
  • Comprehensive test coverage for tool formatting

Critical Issues:

  • 🚨 CRITICAL SECURITY VULNERABILITY: Direct eval() usage allows arbitrary code execution
  • IndexError bug that could crash the system
  • Type safety issues that could cause runtime errors
  • Excessive timeouts that could be exploited

Recommendation: ❌ DO NOT MERGE until:

  1. Critical security vulnerability in expressions.py is fixed
  2. IndexError bug is patched
  3. Type safety issues are resolved
  4. Basic input validation is added

The direct eval() usage represents an immediate security threat that could lead to full system compromise. This must be addressed before any deployment considerations.


claude[bot] avatar Jul 01 '25 19:07 claude[bot]

Claude finished @Ahmad-mtos's task —— View job


Changelog Bot Analysis ✅

Analyzed the PR changes and determined that no new changelog entry is needed.

Todo List:

  • [x] Read diff.patch file to understand changes
  • [x] Determine if changes are user-visible
  • [x] Create or update changelog entry if needed
  • [x] Commit and push changes

Analysis:

This PR contains no new user-visible changes that require a changelog entry beyond what's already documented in the existing changelog/dev/2025-06-30.md file.

Changes in this update:

  • Added test files (test_prompt_step_auto_tools.py, test_tool_runner.py)
  • Added feature flag client implementation
  • Minor fixes and refactoring

User-visible features already documented:

  • ✅ Tool calling functionality
  • ✅ New prompt step parameters (tool_choice, unwrap, auto_run_tools, disable_cache)
  • ✅ New session parameters (auto_run_tools, forward_tool_calls)
  • ✅ Anthropic tool support (computer_20241022, text_editor_20241022, bash_20241022)
  • ✅ Feature flags integration using Unleash

Conclusion: The existing changelog entry from June 30, 2025 comprehensively covers all user-visible changes in this PR. No additional changelog entry is needed.


github-actions[bot] avatar Jul 01 '25 19:07 github-actions[bot]