julep feat(agents-api, integrations-service): tool calls in tasks and sessions

trafficstars

feat(agents-api): add convert_to_openai_tool function for integration tools
refactor: Lint agents-api (CI)
fix(integrations-service): refactor integration tool naming on conversion
feat(agents-api): add tool_runner.py that handles tool calling
chore(agents-api): refactor prompt_step.py to use tool_runner functionalities
lint
refactor: Lint agents-api (CI)

May 24 '25 09:05 Ahmad-mtos

Claude finished @Ahmad-mtos's task —— View job

Changelog Bot Analysis ✅

Analyzed the PR changes and created a changelog entry for user-visible features.

Todo List:

[x] Read diff.patch file to understand changes
[x] Determine if changes are user-visible
[x] Create changelog entry if needed
[x] Commit and push changes

Analysis:

This PR introduces significant user-visible changes related to tool calling functionality:

New Features Added:

Tool calling functionality in tasks and sessions
Integration tools conversion to OpenAI-compatible format
New tool runner system for handling tool execution
Enhanced prompt steps with tool runner capabilities

Actions Taken:

✅ Created changelog entry: changelog/dev/2025-06-30.md
✅ Committed changes: 226bee8

The changelog entry documents these user-facing improvements to tool calling capabilities in the Julep platform.

Jun 30 '25 12:06 github-actions[bot]

Claude finished @Ahmad-mtos's task —— View job

Pull Request Review: Tool Calls in Tasks and Sessions

I've completed a comprehensive review of this PR that adds tool calling functionality to tasks and sessions. Here's my detailed analysis:

Review Checklist:

[x] Analyze core tool calling implementation (tool_runner.py, tool_executor.py)
[x] Review integration service changes and tool conversion logic
[x] Examine security considerations for tool execution
[x] Check error handling and logging practices
[x] Assess performance implications
[x] Review feature flags integration
[x] Analyze API changes and backwards compatibility
[x] Check test coverage and missing tests
[x] Address existing review comments from whiterabbit1983

✅ Strengths

1. Architecture & Design

Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
Type safety: Extensive use of Pydantic models and beartype decorators throughout
Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

Proper exception wrapping with context preservation (IntegrationExecutionException)
Graceful fallbacks for unknown tools (creates dummy error responses)
Activity logging integration for debugging

3. Integration Points

Seamless integration with existing workflow execution system
Proper Temporal activity patterns maintained
LLM provider abstraction through LiteLLM

⚠️ Issues & Concerns

1. Critical Issues

Potential Index Error in Workflow (`workflow/task_execution/init.py:451`)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes messages list is non-empty. If len(messages) == 0, this will raise an IndexError.
Fix: Add length check before accessing last element.

Missing Error Logging (`tool_executor.py:66-69`)

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise

Issue: Error is logged but then re-raised without additional context.
Recommendation: Consider whether to add structured error handling or wrap in custom exception.

Non-Async HTTP Client (`integrations.py:24`)

async with AsyncClient(timeout=600) as client:

Issue: Code is correct but timeout of 600 seconds (10 minutes) is very long.
Recommendation: Consider shorter timeout with retry logic.

2. Security Concerns

Tool Execution Security

Input validation: Tool arguments are passed through JSON parsing but need additional validation
Sandbox concerns: simpleeval mentioned in comments but not applied to all tool inputs
Setup/arguments merging: Multiple sources merged without clear precedence (tool_runner.py:136-140)

Integration Service Calls

Timeout vulnerability: 600-second timeout could be exploited for DoS
Error information leakage: Full error messages from integrations might expose sensitive info

3. Performance Issues

Tool Resolution Inefficiency (`tool_runner.py:166-172`)

tool_map = {}
for t in tools:
    if t.type == "integration" and t.integration is not None:
        tool_map[t.integration.provider] = t
    else:
        tool_map[t.name] = t

Issue: Tool map recreation on every LLM call instead of caching.
Impact: O(n) lookup cost for each tool resolution.

Synchronous Tool Processing (`tool_runner.py:185-208`)

Issue: Tools are processed sequentially in the loop.
Recommendation: Consider parallel execution for independent tool calls.

4. Code Quality Issues

Inconsistent Error Handling

Some functions use custom exceptions, others use generic exceptions
Error context not always preserved in chain

Magic Strings (`integrations/get_integration_tool.py:39`)

if m.method == method:

Issue: Should use f-string for consistency (addresses whiterabbit1983's comment).

Type Annotation Gaps

tool_runner.py:137: setup variable type could be more specific
Return types in some functions could be more precise

🔍 Test Coverage Analysis

Missing Tests

No tests found for tool_runner.py functionality
No tests for run_llm_with_tools tool loop logic
No tests for feature flag integration with tools
No tests for error scenarios in tool execution
No integration tests for the full tool calling pipeline

Test Recommendations

# Suggested test areas:
- test_tool_runner.py: Unit tests for format_tool, run_llm_with_tools
- test_tool_execution_integration.py: End-to-end tool calling tests
- test_feature_flags_tools.py: Feature flag behavior with tools
- test_tool_error_handling.py: Error scenarios and recovery

📝 Addressing Review Comments

1. whiterabbit1983's Comments

Error logging in tool_executor.py ✅ Addressed Above

Status: Valid concern - error is logged but context could be improved

Async client in integrations.py ✅ Code is Correct

Status: Code already uses AsyncClient correctly

F-string in integration service ✅ Minor Issue

Fix: Change to f-string for consistency

len(messages) == 0 check ✅ Critical Issue

Status: Major issue that needs immediate fix

🚀 Recommendations

High Priority

Fix IndexError: Add length check before messages[-1] access
Add comprehensive tests: At minimum, unit tests for core tool runner logic
Security review: Validate all tool inputs and implement proper sandboxing
Performance optimization: Cache tool maps and consider parallel execution

Medium Priority

Error handling standardization: Use consistent exception types across modules
Timeout configuration: Make integration timeouts configurable
Logging improvements: Add structured logging with correlation IDs

Low Priority

Code style: Convert remaining string formatting to f-strings
Type annotations: Improve specificity of return types
Documentation: Add docstrings for public methods

📊 Overall Assessment

Score: 7/10

This PR introduces valuable tool calling functionality with good architectural principles. However, it has several critical issues that need addressing before merge:

Critical bugs (IndexError) must be fixed
Test coverage is insufficient for such a complex feature
Security implications need careful review
Performance optimizations would benefit production usage

The feature flag approach for gradual rollout is excellent and shows good production awareness.

Recommendation: Address critical issues and add basic test coverage before merging.

Jun 30 '25 12:06 claude[bot]

Claude finished @Ahmad-mtos's task —— View job

Pull Request Review: Tool Calls in Tasks and Sessions

I've completed a comprehensive review of this PR that adds tool calling functionality to tasks and sessions. Here's my detailed analysis:

Review Checklist:

[x] Analyze core tool calling implementation (tool_runner.py, tool_executor.py)
[x] Review integration service changes and tool conversion logic
[x] Examine security considerations for tool execution
[x] Check error handling and logging practices
[x] Assess performance implications
[x] Review feature flags integration
[x] Analyze API changes and backwards compatibility
[x] Check test coverage and missing tests
[x] Address existing review comments from whiterabbit1983

✅ Strengths

1. Architecture & Design

Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
Type safety: Extensive use of Pydantic models and beartype decorators throughout
Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

Proper exception wrapping with context preservation (IntegrationExecutionException)
Graceful fallbacks for unknown tools (creates dummy error responses)
Activity logging integration for debugging

3. Integration Points

Seamless integration with existing workflow execution system
Proper Temporal activity patterns maintained
LLM provider abstraction through LiteLLM

⚠️ Issues & Concerns

1. Critical Issues

Potential Index Error in Workflow (`workflow/task_execution/init.py:451`)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes messages list is non-empty. If len(messages) == 0, this will raise an IndexError.
Fix: Add length check before accessing last element.

Missing Error Logging (`tool_executor.py:66-69`)

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise

Issue: Error is logged but then re-raised without additional context.
Recommendation: Consider whether to add structured error handling or wrap in custom exception.

Non-Async HTTP Client (`integrations.py:24`)

async with AsyncClient(timeout=600) as client:

Issue: Code is correct but timeout of 600 seconds (10 minutes) is very long.
Recommendation: Consider shorter timeout with retry logic.

2. Security Concerns

Tool Execution Security

Input validation: Tool arguments are passed through JSON parsing but need additional validation
Sandbox concerns: simpleeval mentioned in comments but not applied to all tool inputs
Setup/arguments merging: Multiple sources merged without clear precedence (tool_runner.py:136-140)

Integration Service Calls

Timeout vulnerability: 600-second timeout could be exploited for DoS
Error information leakage: Full error messages from integrations might expose sensitive info

3. Performance Issues

Tool Resolution Inefficiency (`tool_runner.py:166-172`)

tool_map = {}
for t in tools:
    if t.type == "integration" and t.integration is not None:
        tool_map[t.integration.provider] = t
    else:
        tool_map[t.name] = t

Issue: Tool map recreation on every LLM call instead of caching.
Impact: O(n) lookup cost for each tool resolution.

Synchronous Tool Processing (`tool_runner.py:185-208`)

Issue: Tools are processed sequentially in the loop.
Recommendation: Consider parallel execution for independent tool calls.

4. Code Quality Issues

Inconsistent Error Handling

Some functions use custom exceptions, others use generic exceptions
Error context not always preserved in chain

Magic Strings (`integrations/get_integration_tool.py:39`)

if m.method == method:

Issue: Should use f-string for consistency (addresses whiterabbit1983's comment).

Type Annotation Gaps

tool_runner.py:137: setup variable type could be more specific
Return types in some functions could be more precise

🔍 Test Coverage Analysis

Missing Tests

No tests found for tool_runner.py functionality
No tests for run_llm_with_tools tool loop logic
No tests for feature flag integration with tools
No tests for error scenarios in tool execution
No integration tests for the full tool calling pipeline

Test Recommendations

# Suggested test areas:
- test_tool_runner.py: Unit tests for format_tool, run_llm_with_tools
- test_tool_execution_integration.py: End-to-end tool calling tests
- test_feature_flags_tools.py: Feature flag behavior with tools
- test_tool_error_handling.py: Error scenarios and recovery

📝 Addressing Review Comments

1. whiterabbit1983's Comments

Error logging in tool_executor.py ✅ Addressed Above

Status: Valid concern - error is logged but context could be improved

Async client in integrations.py ✅ Code is Correct

Status: Code already uses AsyncClient correctly

F-string in integration service ✅ Minor Issue

Fix: Change to f-string for consistency

len(messages) == 0 check ✅ Critical Issue

Status: Major issue that needs immediate fix

🚀 Recommendations

High Priority

Fix IndexError: Add length check before messages[-1] access
Add comprehensive tests: At minimum, unit tests for core tool runner logic
Security review: Validate all tool inputs and implement proper sandboxing
Performance optimization: Cache tool maps and consider parallel execution

Medium Priority

Error handling standardization: Use consistent exception types across modules
Timeout configuration: Make integration timeouts configurable
Logging improvements: Add structured logging with correlation IDs

Low Priority

Code style: Convert remaining string formatting to f-strings
Type annotations: Improve specificity of return types
Documentation: Add docstrings for public methods

📊 Overall Assessment

Score: 7/10

This PR introduces valuable tool calling functionality with good architectural principles. However, it has several critical issues that need addressing before merge:

Critical bugs (IndexError) must be fixed
Test coverage is insufficient for such a complex feature
Security implications need careful review
Performance optimizations would benefit production usage

The feature flag approach for gradual rollout is excellent and shows good production awareness.

Recommendation: Address critical issues and add basic test coverage before merging.

Jun 30 '25 12:06 claude[bot]

Claude finished @Ahmad-mtos's task —— View job

Pull Request Review: Tool Calls in Tasks and Sessions

I've completed a comprehensive review of this PR that adds tool calling functionality to tasks and sessions. Here's my detailed analysis:

Review Checklist:

[x] Analyze core tool calling implementation (tool_runner.py, tool_executor.py)
[x] Review integration service changes and tool conversion logic
[x] Examine security considerations for tool execution
[x] Check error handling and logging practices
[x] Assess performance implications
[x] Review feature flags integration
[x] Analyze API changes and backwards compatibility
[x] Check test coverage and missing tests
[x] Address existing review comments from whiterabbit1983

✅ Strengths

1. Architecture & Design

Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
Type safety: Extensive use of Pydantic models and beartype decorators throughout
Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

Proper exception wrapping with context preservation (IntegrationExecutionException)
Graceful fallbacks for unknown tools (creates dummy error responses)
Activity logging integration for debugging

3. Integration Points

Seamless integration with existing workflow execution system
Proper Temporal activity patterns maintained
LLM provider abstraction through LiteLLM

⚠️ Critical Issues

1. IndexError Vulnerability (`workflow/task_execution/init.py:451`)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes messages list is non-empty. If len(messages) == 0, this will raise an IndexError.
Fix: Add length check before accessing last element:

if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:

2. Security Vulnerabilities

Critical: Direct eval() Usage (`expressions.py:76-90`)

extra_lambdas[k] = eval(v)  # Bypasses simpleeval sandbox!

Risk: Allows arbitrary code execution, completely undermining the sandbox.

JSON Bomb Vulnerability (`tool_runner.py:131`)

arguments_str = json.loads(call.function.arguments)

Risk: No size limits on JSON parsing could lead to memory exhaustion attacks.

Argument Merging Without Validation (`execute_integration.py:53,63`)

arguments = arguments | (integration.arguments or {}) | merged_tool_args.get(tool_name, {})

Risk: Malicious arguments could override legitimate ones without validation.

3. Performance Issues

Tool Resolution Inefficiency (`tool_runner.py:166-172`)

tool_map = {}
for t in tools:
    if t.type == "integration" and t.integration is not None:
        tool_map[t.integration.provider] = t
    else:
        tool_map[t.name] = t

Issue: Tool map recreation on every LLM call instead of caching.

Sequential Tool Processing (`tool_runner.py:185-208`)

Issue: Tools are processed sequentially in the loop.
Recommendation: Consider parallel execution for independent tool calls.

🔍 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.py ✅ Valid Concern

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise

Issue: Error is logged but context could be improved with structured logging.

2. Async client in integrations.py ✅ Code is Correct

The code correctly uses AsyncClient, but the 600-second timeout is excessive:

async with AsyncClient(timeout=600) as client:  # 10 minutes!

Recommendation: Reduce to 30-60 seconds with retry logic.

3. F-string in integration service ✅ Minor Style Issue

Line 39 should use f-string for consistency:

# Current
if m.method == method:
# Should be  
if f"{m.method}" == method:  # or just leave as-is since it's already clear

4. len(messages) == 0 check ✅ Critical Issue (Addressed above)

🚨 Additional Security Concerns

Network Security

Excessive HTTP timeouts (10 minutes) could enable DoS attacks
No response size limits could cause memory exhaustion
Error information disclosure may leak sensitive details

Input Validation

Missing size limits on tool arguments and setup parameters
Insufficient sanitization of integration responses
Weak timeout enforcement in expression evaluation

📊 Test Coverage Analysis

Existing Tests ✅

test_tool_queries.py: CRUD operations for tool management
test_tool_call_step.py: Basic tool call construction and formatting

Missing Critical Tests ❌

No tests for tool_runner.py functionality
No tests for run_llm_with_tools tool loop logic
No tests for feature flag integration with tools
No tests for error scenarios in tool execution
No integration tests for the full tool calling pipeline
No security tests for malicious input handling

Test Recommendations

# Suggested new test files:
- tests/test_tool_runner.py: Core tool runner functionality
- tests/test_tool_security.py: Security vulnerability tests
- tests/test_tool_integration.py: End-to-end tool calling tests
- tests/test_tool_error_handling.py: Error scenarios and recovery

🔧 Recommendations

Must Fix Before Merge (Blockers)

Fix IndexError: Add messages length check in workflow execution
Remove eval() usage: Replace direct eval with safe alternatives
Add JSON size limits: Prevent JSON bomb attacks
Add basic test coverage: At minimum, unit tests for core tool runner logic

Should Fix Soon (High Priority)

Reduce HTTP timeouts: Configure reasonable timeout values
Implement response size limits: Prevent memory exhaustion
Add input validation: Validate tool arguments before processing
Improve error handling: Sanitize error messages to prevent information disclosure

Nice to Have (Medium Priority)

Performance optimization: Cache tool maps and consider parallel execution
Enhanced logging: Add structured logging with correlation IDs
Rate limiting: Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 6/10

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities and bugs that must be addressed before merging.

Positives:

Well-structured architecture with clear separation of concerns
Proper use of feature flags for gradual rollout
Good integration with existing systems
Type safety through Pydantic models

Critical Issues:

Security vulnerabilities that could lead to arbitrary code execution
IndexError bug that could crash the system
Insufficient test coverage for such a complex feature
Performance inefficiencies that could impact production

Recommendation: DO NOT MERGE until critical security vulnerabilities and the IndexError bug are fixed, and basic test coverage is added.

Jun 30 '25 15:06 claude[bot]

Claude finished @Ahmad-mtos's task —— View job

Changelog Bot Analysis ✅

Analyzed the PR changes and updated the changelog entry with comprehensive details about new user-visible features.

Todo List:

[x] Read diff.patch file to understand changes
[x] Determine if changes are user-visible
[x] Update changelog entry with detailed features
[x] Commit and push changes

Analysis:

This PR introduces significant user-visible changes related to tool calling functionality:

New Features Added:

New prompt step parameters: tool_choice, unwrap, auto_run_tools, disable_cache
Session parameters: auto_run_tools and forward_tool_calls
Support for Anthropic tool types: computer, text editor, and bash tools
Feature flags integration using Unleash for controlled rollouts
Enhanced tool runner system and OpenAI-compatible tool conversion

Actions Taken:

✅ Updated changelog entry: changelog/dev/2025-06-30.md
✅ Committed detailed changes: 0e4d1e9

The updated changelog entry now provides users with specific details about all the new tool calling parameters and capabilities introduced in this PR.

Jun 30 '25 15:06 github-actions[bot]

Claude finished @Ahmad-mtos's task —— View job

Pull Request Review: Tool Calls in Tasks and Sessions

I've completed a comprehensive review of this PR that adds tool calling functionality to tasks and sessions. Here's my detailed analysis:

Review Checklist:

[x] Read repository guidelines (CLAUDE.md/AGENTS.md)
[x] Analyze core tool calling implementation (tool_runner.py, tool_executor.py)
[x] Review integration service changes and tool conversion logic
[x] Examine security considerations for tool execution
[x] Check error handling and logging practices
[x] Assess performance implications
[x] Review feature flags integration
[x] Analyze API changes and backwards compatibility
[x] Check test coverage and missing tests
[x] Address existing review comments from whiterabbit1983

✅ Strengths

1. Architecture & Design

Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
Type safety: Extensive use of Pydantic models and beartype decorators throughout
Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

Proper exception wrapping with context preservation (IntegrationExecutionException)
Graceful fallbacks for unknown tools (creates dummy error responses)
Activity logging integration for debugging

3. Integration Points

Seamless integration with existing workflow execution system
Proper Temporal activity patterns maintained
LLM provider abstraction through LiteLLM

🚨 Critical Security Vulnerabilities

1. CRITICAL: Direct eval() Usage (`expressions.py:90`)

extra_lambdas[k] = eval(v)  # Bypasses simpleeval sandbox!

Risk: IMMEDIATE SECURITY THREAT - Allows arbitrary code execution, completely undermining the sandbox.
Impact: Full system compromise, data exfiltration, service disruption
Status: BLOCKS MERGE - This must be fixed before deployment

2. IndexError Vulnerability (`workflow/task_execution/init.py:451`)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes messages list is non-empty. If len(messages) == 0, this will raise an IndexError.
Fix: Add length check: if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:

3. JSON Bomb Vulnerability (`tool_runner.py:131`)

arguments_str = json.loads(call.function.arguments)

Risk: No size limits on JSON parsing could lead to memory exhaustion attacks.

⚠️ Additional Security Concerns

Argument Merging Without Validation (`execute_integration.py:53,63`)

arguments = arguments | (integration.arguments or {}) | merged_tool_args.get(tool_name, {})
setup = setup | integration_setup | merged_tool_setup.get(tool_name, {})

Risk: Malicious arguments could override legitimate ones without validation.

Excessive Timeouts (`integrations.py:24,43`)

async with AsyncClient(timeout=600) as client:  # 10 minutes!

Risk: Could enable DoS attacks; should be reduced to 30-60 seconds with retry logic.

📝 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.py ✅ Valid Concern

Status: Error is logged but context could be improved with structured logging.

2. Async client in integrations.py ✅ Code is Correct

Status: Code correctly uses AsyncClient, but timeout is excessive (600s → 30-60s recommended).

3. F-string in integration service ✅ Minor Style Issue

Line 38: Code is already clear and doesn't require f-string conversion.

4. len(messages) == 0 check ✅ Critical Issue

Status: Confirmed critical IndexError vulnerability that could crash the system.

🔍 Performance Issues

Tool Resolution Inefficiency (`tool_runner.py:166-172`)

tool_map = {}
for t in tools:
    if t.type == "integration" and t.integration is not None:
        tool_map[t.integration.provider] = t
    else:
        tool_map[t.name] = t

Issue: Tool map recreation on every LLM call instead of caching.
Impact: O(n) lookup cost for each tool resolution.

Sequential Tool Processing (`tool_runner.py:185-208`)

Issue: Tools are processed sequentially in the loop.
Recommendation: Consider parallel execution for independent tool calls.

📊 Test Coverage Analysis

Existing Tests ✅

test_tool_queries.py: CRUD operations for tool management
test_tool_call_step.py: Basic tool call construction and formatting

Missing Critical Tests ❌

No tests for tool_runner.py functionality
No tests for run_llm_with_tools tool loop logic
No tests for feature flag integration with tools
No tests for error scenarios in tool execution
No integration tests for the full tool calling pipeline
No security tests for malicious input handling

Test Recommendations

# Suggested new test files:
- tests/test_tool_runner.py: Core tool runner functionality
- tests/test_tool_security.py: Security vulnerability tests
- tests/test_tool_integration.py: End-to-end tool calling tests
- tests/test_tool_error_handling.py: Error scenarios and recovery

🔧 Recommendations

Must Fix Before Merge (Blockers)

🚨 CRITICAL: Remove eval() usage - Replace direct eval with safe alternatives in expressions.py:90
Fix IndexError - Add messages length check in workflow execution
Add JSON size limits - Prevent JSON bomb attacks
Add basic test coverage - At minimum, unit tests for core tool runner logic

Should Fix Soon (High Priority)

Reduce HTTP timeouts - Configure reasonable timeout values (30-60s)
Implement response size limits - Prevent memory exhaustion
Add input validation - Validate tool arguments before processing
Improve error handling - Sanitize error messages to prevent information disclosure

Nice to Have (Medium Priority)

Performance optimization - Cache tool maps and consider parallel execution
Enhanced logging - Add structured logging with correlation IDs
Rate limiting - Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 4/10 (Updated from previous reviews)

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities that make it unsafe for production deployment.

Positives:

Well-structured architecture with clear separation of concerns
Proper use of feature flags for gradual rollout
Good integration with existing systems
Type safety through Pydantic models

Critical Issues:

🚨 CRITICAL SECURITY VULNERABILITY: Direct eval() usage allows arbitrary code execution
IndexError bug that could crash the system
Insufficient test coverage for such a complex feature
Performance inefficiencies that could impact production

Recommendation: ❌ DO NOT MERGE until the critical security vulnerability in expressions.py is fixed, the IndexError bug is resolved, and basic test coverage is added.

The direct eval() usage represents an immediate security threat that could lead to full system compromise. This must be addressed before any deployment considerations.

Jun 30 '25 15:06 claude[bot]

Claude finished @Ahmad-mtos's task —— View job

Changelog Bot Analysis ✅

Analyzed the PR changes and updated the changelog entry with comprehensive details about new user-visible features.

Todo List:

[x] Read diff.patch file to understand changes
[x] Determine if changes are user-visible
[x] Update changelog entry with detailed features
[x] Commit and push changes

Analysis:

This PR introduces significant user-visible changes related to tool calling functionality:

New Features Added:

New prompt step parameters: tool_choice, unwrap, auto_run_tools, disable_cache
Session parameters: auto_run_tools and forward_tool_calls
Support for Anthropic tool types: computer, text editor, and bash tools
Feature flags integration using Unleash for controlled rollouts
Enhanced tool runner system and OpenAI-compatible tool conversion

Actions Taken:

✅ Updated changelog entry: changelog/dev/2025-06-30.md
✅ Committed detailed changes: 7244770

The updated changelog entry now provides users with specific details about all the new tool calling parameters and capabilities introduced in this PR.

Jun 30 '25 15:06 github-actions[bot]

Claude finished @Ahmad-mtos's task —— View job

Pull Request Review: Tool Calls in Tasks and Sessions

Conducting comprehensive review of PR #1453 that adds tool calling functionality to tasks and sessions.

Review Checklist:

[x] Read repository guidelines (CLAUDE.md/AGENTS.md)
[x] Analyze core tool calling implementation
[x] Address CI type checking failures
[x] Review security vulnerabilities (especially eval() usage)
[x] Check error handling and logging practices
[x] Assess performance implications
[x] Review test coverage
[x] Address existing reviewer comments from whiterabbit1983
[x] Provide comprehensive feedback and recommendations

✅ Strengths

1. Architecture & Design

Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
Type safety: Extensive use of Pydantic models and beartype decorators throughout
Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

Proper exception wrapping with context preservation (IntegrationExecutionException)
Graceful fallbacks for unknown tools (creates dummy error responses)
Activity logging integration for debugging

3. Integration Points

Seamless integration with existing workflow execution system
Proper Temporal activity patterns maintained
LLM provider abstraction through LiteLLM

🚨 Critical Issues

1. CI Type Checking Failures

The build is currently failing with several type errors that must be fixed:

feature_flags.py:329, 340

# Line 329
result = self._client.get_boolean_value(...)  # _client can be None
# Line 340
return default_value  # Expected bool, got Optional[bool]

Fix: Add null checks and ensure return type is always bool.

tool_runner.py:54, 82

# Line 54
"parameters": tool.api_call.params_schema.model_dump(exclude_none=True),  # params_schema can be None
# Line 82  
"parameters": tool.function.parameters,  # function can be None

Fix: Add null checks before accessing attributes.

2. Critical Security Vulnerability (`expressions.py:90`)

extra_lambdas[k] = eval(v)  # Direct eval() bypasses simpleeval sandbox!

Risk: IMMEDIATE SECURITY THREAT - Allows arbitrary code execution
Impact: Full system compromise, data exfiltration, service disruption
Status: BLOCKS MERGE

3. IndexError Vulnerability (`workflow/task_execution/init.py:451`)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes non-empty list. If len(messages) == 0, raises IndexError
Fix: if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:

⚠️ Additional Security & Performance Concerns

Network Security

Excessive HTTP timeouts (600s) in integrations.py:24,43 could enable DoS attacks
No response size limits could cause memory exhaustion
No input validation on JSON payloads (tool_runner.py:131)

Performance Issues

Tool map recreation on every LLM call (tool_runner.py:166-172) - should cache
Sequential tool processing - could benefit from parallel execution for independent tools
No rate limiting on tool execution

📝 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.py:66-69 ✅ Valid Concern

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise  # Error logged but re-raised without additional context

Recommendation: Consider wrapping in custom exception with more context.

2. Async client in integrations.py:24 ✅ Code is Correct

Code correctly uses AsyncClient, but 600-second timeout is excessive.
Recommendation: Reduce to 30-60 seconds with retry logic.

3. F-string in integration service:38 ✅ Minor Style Issue

The current code if m.method == method: is clear and doesn't require f-string conversion.

4. len(messages) == 0 check ✅ Critical Issue (Addressed above)

📊 Test Coverage Analysis

Existing Tests ✅

test_tool_runner.py: Good coverage of format_tool function with various tool types
test_tool_queries.py: CRUD operations for tool management
test_tool_call_step.py: Basic tool call construction

Missing Critical Tests ❌

No tests for run_llm_with_tools tool loop logic
No tests for feature flag integration with tools
No tests for error scenarios in tool execution
No tests for security boundaries (malicious inputs)
No integration tests for the full tool calling pipeline
No tests for tool resolution and mapping logic

🔧 Recommendations

Must Fix Before Merge (Blockers)

🚨 CRITICAL: Fix eval() usage - Replace with safe alternatives in expressions.py:90
Fix CI type errors - Add null checks in feature_flags.py and tool_runner.py
Fix IndexError - Add length check in workflow execution
Add JSON size limits - Prevent JSON bomb attacks

Should Fix Soon (High Priority)

Reduce HTTP timeouts - Configure reasonable timeout values (30-60s)
Add input validation - Validate tool arguments before processing
Implement response size limits - Prevent memory exhaustion
Add security tests - Test malicious input handling

Nice to Have (Medium Priority)

Performance optimization - Cache tool maps and consider parallel execution
Enhanced logging - Add structured logging with correlation IDs
Rate limiting - Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 4/10

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities and type safety issues that make it unsafe for production deployment.

Positives:

Well-structured architecture with clear separation of concerns
Proper use of feature flags for gradual rollout
Good integration with existing systems
Comprehensive test coverage for tool formatting

Critical Issues:

🚨 CRITICAL SECURITY VULNERABILITY: Direct eval() usage allows arbitrary code execution
CI failures due to type checking errors that break the build
IndexError bug that could crash the system
Excessive timeouts that could be exploited

Recommendation: ❌ DO NOT MERGE until:

Critical security vulnerability in expressions.py is fixed
CI type checking errors are resolved
IndexError bug is patched
Basic input validation is added

The direct eval() usage represents an immediate security threat that could lead to full system compromise. This must be addressed before any deployment considerations.

Jul 01 '25 18:07 claude[bot]

Claude finished @Ahmad-mtos's task —— View job

Changelog Bot Analysis ✅

Analyzed the PR changes and determined that no new changelog entry is needed.

Todo List:

[x] Read diff.patch file to understand changes
[x] Determine if changes are user-visible
[x] Create or update changelog entry if needed
[x] Commit and push changes

Analysis:

This PR contains no new user-visible changes that require a changelog entry beyond what's already documented in the existing changelog/dev/2025-06-30.md file.

Changes in this update:

Added test files (test_prompt_step_auto_tools.py, test_tool_runner.py)
Added feature flag client implementation
Minor fixes and refactoring

User-visible features already documented:

✅ Tool calling functionality
✅ New prompt step parameters (tool_choice, unwrap, auto_run_tools, disable_cache)
✅ New session parameters (auto_run_tools, forward_tool_calls)
✅ Anthropic tool support (computer_20241022, text_editor_20241022, bash_20241022)
✅ Feature flags integration using Unleash

Conclusion: The existing changelog entry from June 30, 2025 comprehensively covers all user-visible changes in this PR. No additional changelog entry is needed.

Jul 01 '25 18:07 github-actions[bot]

Claude finished @Ahmad-mtos's task —— View job

Pull Request Review: Tool Calls in Tasks and Sessions

Conducting comprehensive review of PR #1453 that adds tool calling functionality to tasks and sessions.

Review Checklist:

[x] Read repository guidelines (AGENTS.md/CLAUDE.md)
[x] Analyze core tool calling implementation
[x] Address existing reviewer comments from whiterabbit1983
[x] Review security vulnerabilities (especially eval() usage)
[x] Check error handling and logging practices
[x] Assess performance implications
[x] Review test coverage
[x] Provide comprehensive feedback and recommendations

✅ Strengths

1. Architecture & Design

Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
Type safety: Extensive use of Pydantic models and beartype decorators throughout
Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

Proper exception wrapping with context preservation (IntegrationExecutionException)
Graceful fallbacks for unknown tools (creates dummy error responses)
Activity logging integration for debugging

3. Integration Points

Seamless integration with existing workflow execution system
Proper Temporal activity patterns maintained
LLM provider abstraction through LiteLLM

🚨 Critical Issues

1. Critical Security Vulnerability (`expressions.py:90`)

extra_lambdas[k] = eval(v)  # Direct eval() bypasses simpleeval sandbox!

Risk: IMMEDIATE SECURITY THREAT - Allows arbitrary code execution
Impact: Full system compromise, data exfiltration, service disruption
Status: BLOCKS MERGE

2. IndexError Vulnerability (`workflow/task_execution/init.py:451`)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes non-empty list. If len(messages) == 0, raises IndexError
Fix: if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:

3. Type Safety Issues (`tool_runner.py`)

Lines 54, 82: Missing null checks before accessing optional attributes

# Line 54
"parameters": tool.api_call.params_schema.model_dump(exclude_none=True),  # params_schema can be None
# Line 82  
"parameters": tool.function.parameters,  # function can be None

⚠️ Additional Security & Performance Concerns

Network Security

Excessive HTTP timeouts (600s) in integrations.py:24,43 could enable DoS attacks
No response size limits could cause memory exhaustion
No input validation on JSON payloads (tool_runner.py:131)

Performance Issues

Tool map recreation on every LLM call (tool_runner.py:166-172) - should cache
Sequential tool processing - could benefit from parallel execution for independent tools
No rate limiting on tool execution

📝 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.py:66-69 ✅ Valid Concern

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise  # Error logged but re-raised without additional context

Recommendation: Consider wrapping in custom exception with more context.

2. Async client in integrations.py:24 ✅ Code is Correct

Code correctly uses AsyncClient, but 600-second timeout is excessive.
Recommendation: Reduce to 30-60 seconds with retry logic.

3. F-string in integration service:38 ✅ Minor Style Issue

The current code if m.method == method: is clear and doesn't require f-string conversion.

4. len(messages) == 0 check ✅ Critical Issue (Addressed above)

📊 Test Coverage Analysis

Existing Tests ✅

test_tool_runner.py: Good coverage of format_tool function with various tool types
test_prompt_step_auto_tools.py: Feature flag integration testing
Basic tool call construction tests

Missing Critical Tests ❌

No tests for run_llm_with_tools tool loop logic
No tests for error scenarios in tool execution
No tests for security boundaries (malicious inputs)
No integration tests for the full tool calling pipeline
No tests for tool resolution and mapping logic

🔧 Recommendations

Must Fix Before Merge (Blockers)

🚨 CRITICAL: Fix eval() usage - Replace with safe alternatives in expressions.py:90
Fix IndexError - Add length check in workflow execution
Add null checks - Fix type safety issues in tool_runner.py:54,82
Add JSON size limits - Prevent JSON bomb attacks

Should Fix Soon (High Priority)

Reduce HTTP timeouts - Configure reasonable timeout values (30-60s)
Add input validation - Validate tool arguments before processing
Implement response size limits - Prevent memory exhaustion
Add security tests - Test malicious input handling

Nice to Have (Medium Priority)

Performance optimization - Cache tool maps and consider parallel execution
Enhanced logging - Add structured logging with correlation IDs
Rate limiting - Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 4/10

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities and type safety issues that make it unsafe for production deployment.

Positives:

Well-structured architecture with clear separation of concerns
Proper use of feature flags for gradual rollout
Good integration with existing systems
Comprehensive test coverage for tool formatting

Critical Issues:

🚨 CRITICAL SECURITY VULNERABILITY: Direct eval() usage allows arbitrary code execution
IndexError bug that could crash the system
Type safety issues that could cause runtime errors
Excessive timeouts that could be exploited

Recommendation: ❌ DO NOT MERGE until:

Critical security vulnerability in expressions.py is fixed
IndexError bug is patched
Type safety issues are resolved
Basic input validation is added

The direct eval() usage represents an immediate security threat that could lead to full system compromise. This must be addressed before any deployment considerations.

Jul 01 '25 18:07 claude[bot]

Claude encountered an error —— View job

Changelog Bot Analysis

Analyzing the PR changes to determine if a changelog entry is needed.

Todo List:

[ ] Read diff.patch file to understand changes
[ ] Determine if changes are user-visible
[ ] Create or update changelog entry if needed
[ ] Commit and push changes

Jul 01 '25 18:07 github-actions[bot]

Claude finished @Ahmad-mtos's task —— View job

Pull Request Review: Tool Calls in Tasks and Sessions

Conducting comprehensive review of PR #1453 that adds tool calling functionality to tasks and sessions.

Review Checklist:

[x] Read repository guidelines (CLAUDE.md/AGENTS.md)
[x] Analyze core tool calling implementation
[x] Address existing reviewer comments from whiterabbit1983
[x] Review security vulnerabilities (especially eval() usage)
[x] Check error handling and logging practices
[x] Assess performance implications
[x] Review test coverage
[x] Provide comprehensive feedback and recommendations

✅ Strengths

1. Architecture & Design

Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
Type safety: Extensive use of Pydantic models and beartype decorators throughout
Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

Proper exception wrapping with context preservation (IntegrationExecutionException)
Graceful fallbacks for unknown tools (creates dummy error responses)
Activity logging integration for debugging

3. Integration Points

Seamless integration with existing workflow execution system
Proper Temporal activity patterns maintained
LLM provider abstraction through LiteLLM

🚨 Critical Issues

1. Critical Security Vulnerability (`expressions.py:90`)

extra_lambdas[k] = eval(v)  # Direct eval() bypasses simpleeval sandbox!

Risk: IMMEDIATE SECURITY THREAT - Allows arbitrary code execution
Impact: Full system compromise, data exfiltration, service disruption
Status: BLOCKS MERGE

2. IndexError Vulnerability (`workflow/task_execution/init.py:451`)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes non-empty list. If len(messages) == 0, raises IndexError
Fix: if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:

3. Type Safety Issues (`tool_runner.py`)

Lines 54, 82: Missing null checks before accessing optional attributes

# Line 54
"parameters": tool.api_call.params_schema.model_dump(exclude_none=True),  # params_schema can be None
# Line 82  
"parameters": tool.function.parameters,  # function can be None

⚠️ Additional Security & Performance Concerns

Network Security

Excessive HTTP timeouts (600s) in integrations.py:24,43 could enable DoS attacks
No response size limits could cause memory exhaustion
No input validation on JSON payloads (tool_runner.py:131)

Performance Issues

Tool map recreation on every LLM call (tool_runner.py:166-172) - should cache
Sequential tool processing - could benefit from parallel execution for independent tools
No rate limiting on tool execution

📝 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.py:66-69 ✅ Valid Concern

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise  # Error logged but re-raised without additional context

Recommendation: Consider wrapping in custom exception with more context.

2. Async client in integrations.py:24 ✅ Code is Correct

Code correctly uses AsyncClient, but 600-second timeout is excessive.
Recommendation: Reduce to 30-60 seconds with retry logic.

3. F-string in integration service:38 ✅ Minor Style Issue

The current code if m.method == method: is clear and doesn't require f-string conversion.

4. len(messages) == 0 check ✅ Critical Issue (Addressed above)

📊 Test Coverage Analysis

Existing Tests ✅

test_tool_runner.py: Good coverage of format_tool function with various tool types
test_prompt_step_auto_tools.py: Feature flag integration testing
Basic tool call construction tests

Missing Critical Tests ❌

No tests for run_llm_with_tools tool loop logic
No tests for error scenarios in tool execution
No tests for security boundaries (malicious inputs)
No integration tests for the full tool calling pipeline
No tests for tool resolution and mapping logic

🔧 Recommendations

Must Fix Before Merge (Blockers)

🚨 CRITICAL: Fix eval() usage - Replace with safe alternatives in expressions.py:90
Fix IndexError - Add length check in workflow execution
Add null checks - Fix type safety issues in tool_runner.py:54,82
Add JSON size limits - Prevent JSON bomb attacks

Should Fix Soon (High Priority)

Reduce HTTP timeouts - Configure reasonable timeout values (30-60s)
Add input validation - Validate tool arguments before processing
Implement response size limits - Prevent memory exhaustion
Add security tests - Test malicious input handling

Nice to Have (Medium Priority)

Performance optimization - Cache tool maps and consider parallel execution
Enhanced logging - Add structured logging with correlation IDs
Rate limiting - Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 4/10

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities and type safety issues that make it unsafe for production deployment.

Positives:

Well-structured architecture with clear separation of concerns
Proper use of feature flags for gradual rollout
Good integration with existing systems
Comprehensive test coverage for tool formatting

Critical Issues:

🚨 CRITICAL SECURITY VULNERABILITY: Direct eval() usage allows arbitrary code execution
IndexError bug that could crash the system
Type safety issues that could cause runtime errors
Excessive timeouts that could be exploited

Recommendation: ❌ DO NOT MERGE until:

Critical security vulnerability in expressions.py is fixed
IndexError bug is patched
Type safety issues are resolved
Basic input validation is added

The direct eval() usage represents an immediate security threat that could lead to full system compromise. This must be addressed before any deployment considerations.

Jul 01 '25 18:07 claude[bot]

Claude finished @Ahmad-mtos's task —— View job

Changelog Bot Analysis ✅

Analyzed the PR changes and determined that no new changelog entry is needed.

Todo List:

[x] Read diff.patch file to understand changes
[x] Determine if changes are user-visible
[x] Create or update changelog entry if needed
[x] Commit and push changes

Analysis:

This PR contains no new user-visible changes that require a changelog entry beyond what's already documented in the existing changelog/dev/2025-06-30.md file.

Changes in this update:

Added test files (test_prompt_step_auto_tools.py, test_tool_runner.py)
Added feature flag client implementation
Minor fixes and refactoring

User-visible features already documented:

✅ Tool calling functionality
✅ New prompt step parameters (tool_choice, unwrap, auto_run_tools, disable_cache)
✅ New session parameters (auto_run_tools, forward_tool_calls)
✅ Anthropic tool support (computer_20241022, text_editor_20241022, bash_20241022)
✅ Feature flags integration using Unleash

Conclusion: The existing changelog entry from June 30, 2025 comprehensively covers all user-visible changes in this PR. No additional changelog entry is needed.

Jul 01 '25 18:07 github-actions[bot]

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪

🧪 PR contains tests

🔒 Security concerns

Sensitive information exposure:
The feature flags configuration includes API tokens and URLs that are passed through environment variables. While this is a standard practice, ensure that the unleash_api_token and other credentials are properly secured and not logged in plain text. The custom headers in UnleashProvider._initialize_unleash() method directly uses the API token which could be exposed in debug logs.

⚡ Recommended focus areas for review

Error Handling

The feature flag client has extensive error handling but may silently fall back to defaults in production scenarios where proper error visibility is needed. The initialization errors are logged but the client continues operating with potentially unexpected behavior.

except ImportError:
    logger.warning(
        "⚠️ UnleashClient not available, feature flags will use default values"
    )
    self._unleash_client = None
except Exception as e:
    logger.error(f"❌ Failed to initialize Unleash client: {e}")
    logger.warning("⚠️ Feature flags will use default values")
    self._unleash_client = None

Type Safety

The convert_litellm_to_chosen_tool_call function performs JSON parsing and dictionary manipulation that could fail with malformed input. The function assumes specific structure in tool_spec without proper validation.

arguments_str = json.loads(call.function.arguments)

tool_spec = tool.model_dump()
if "id" in tool_spec:
    tool_spec.pop("id")

setup = getattr(tool_spec, f"{tool.type}", {}).get("setup", {})
if setup:
    setup = setup.model_dump()
tool_spec[f"{tool.type}"]["setup"] = setup

# TODO: add computer_20241022, text_editor_20241022, bash_20241022
tool_spec[f"{tool.type}"]["arguments"] = arguments_str

return BaseChosenToolCall(
    id=call.id,
    **tool_spec,
)

Logic Complexity

The prompt step handling logic has become complex with feature flag branching and different response structure handling. The nested try-catch blocks and multiple conditional paths increase the risk of edge case bugs.

try:
    if get_feature_flag_value(
        "auto_tool_calls_prompt_step",
        developer_id=self.context.execution_input.developer_id,
    ):
        if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:
            workflow.logger.debug(f"Prompt step: Received response: {messages}")
            return WorkflowResult(state=PartialTransition(output=messages))

        # TODO: make sure to include filtered function tool calls in the last message, or filter them from 2nd message
        tool_calls_input = messages[-1]["tool_calls"]
    else:
        message = messages
        if (
            step.unwrap
            or not step.auto_run_tools
            or message["choices"][0]["finish_reason"] != "tool_calls"
        ):
            workflow.logger.debug(f"Prompt step: Received response: {message}")
            return WorkflowResult(state=PartialTransition(output=message))

        choice = message["choices"][0]
        tool_calls_input = choice["message"]["tool_calls"]
except (KeyError, IndexError, TypeError) as e:
    workflow.logger.error(f"Prompt step: Error parsing response structure: {e}")
    msg = f"Invalid response structure in prompt step: {e}"
    raise ApplicationError(msg) from e
except Exception as e:
    workflow.logger.error(f"Prompt step: Unexpected error: {e}")
    msg = f"Unexpected error in prompt step: {e}"
    raise ApplicationError(msg) from e

Jul 01 '25 18:07 qodo-merge-for-open-source[bot]

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Fix incorrect response and request schemas The response schema for creating a tool should return a proper tool object, not a string enum. The request body schema should also reference a tool creation schema, not an agent creation schema. src/openapi.yaml [593-617] /agents/{id}/tools: post: operationId: AgentToolsRoute_create description: Create a new resource owned by the given parent parameters: - name: id in: path required: true description: ID of parent resource schema: $ref: '#/components/schemas/Common.uuid' responses: '201': description: The request has succeeded and a new resource has been created as a result. content: application/json: schema: - type: string - enum: - - Create a new tool for this agent + $ref: '#/components/schemas/Tools.Tool' requestBody: required: true content: application/json: schema: - $ref: '#/components/schemas/Agents.CreateAgentRequest' + $ref: '#/components/schemas/Tools.CreateToolRequest' `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies that the `POST /agents/{id}/tools` endpoint uses an incorrect request body schema (`Agents.CreateAgentRequest` instead of one for tools) and an incorrect response schema (a string instead of the created tool object), which are critical API definition errors.	High
	Return proper tool objects in responses The response schemas for PUT and PATCH operations should return the actual tool object after modification, not string enums. This ensures consistent API behavior and proper response handling. src/openapi.yaml [619-680] /agents/{id}/tools/{child_id}: put: operationId: AgentToolsRoute_update description: Update a resource owned by the given parent using its id parameters: ... responses: '200': description: The request has succeeded. content: application/json: schema: - type: string - enum: - - Update an existing tool (overwrite existing values) + $ref: '#/components/schemas/Tools.Tool' patch: operationId: AgentToolsRoute_patch description: Patch a resource owned by the given parent using its id parameters: ... responses: '200': description: The request has succeeded. content: application/json: schema: - type: string - enum: - - Update an existing tool (merges with existing values) + $ref: '#/components/schemas/Tools.Tool' `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies that the `PUT` and `PATCH` endpoints for tools return a placeholder string instead of the updated tool object, which is a critical error in the API definition that needs to be fixed for consistency and usability.	High
	✅ ~~Add null check for params_schema~~ Suggestion Impact: The suggestion was implemented but with a different approach. Instead of using a ternary operator inline, the commit extracts the null check into separate lines with a params variable, achieving the same goal of preventing AttributeError when params_schema is None. code diff: `+ params = {} + if tool.api_call.params_schema is not None: + params = tool.api_call.params_schema.model_dump(exclude_none=True) return { "type": "function", "function": { "name": tool.name, "description": tool.description, - "parameters": tool.api_call.params_schema.model_dump(exclude_none=True), + "parameters": params,` Add null check for `params_schema` before calling `model_dump()` to prevent AttributeError when the schema is None. This ensures the function handles cases where API call tools don't have parameter schemas defined. src/agents-api/agents_api/common/utils/tool_runner.py [48-56] `if tool.type == "api_call" and tool.api_call is not None: return { "type": "function", "function": { "name": tool.name, "description": tool.description, - "parameters": tool.api_call.params_schema.model_dump(exclude_none=True), + "parameters": tool.api_call.params_schema.model_dump(exclude_none=True) if tool.api_call.params_schema else {}, }, }` `[Suggestion processed]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies that `tool.api_call.params_schema` can be `None`, and calling `model_dump()` on it would cause an `AttributeError`, so adding a check prevents a potential crash.	Medium
	Add None check before UUID conversion Add validation to ensure the field value is not None before attempting UUID conversion to prevent ValueError when UUID() receives None. This prevents runtime errors when optional UUID fields are present but have None values. src/agents-api/agents_api/activities/execute_system.py [62-64] `for field in uuid_fields: - if field in arguments and not isinstance(arguments[field], UUID): + if field in arguments and arguments[field] is not None and not isinstance(arguments[field], UUID): arguments[field] = UUID(arguments[field])` [ ] Apply / Chat Suggestion importance[1-10]: 7 __ Why: The suggestion correctly points out that `arguments[field]` could be `None`, which would cause `UUID(None)` to raise a `ValueError`, and adding a `None` check improves the code's robustness.	Medium
	Fix external volume dependency issue The external volume `unleash_db_data` may not exist on first run, causing container startup failures. Consider making it a regular named volume or add initialization instructions. src/feature-flags/docker-compose.yml [48-50] `volumes: unleash_db_data: - external: true + driver: local` [ ] Apply / Chat Suggestion importance[1-10]: 7 __ Why: This suggestion correctly identifies that using an external volume requires manual setup, and removing this dependency improves the ease of use for developers setting up the project.	Medium
[ ] Update

Jul 01 '25 18:07 qodo-merge-for-open-source[bot]

Claude finished @Ahmad-mtos's task —— View job

Pull Request Review: Tool Calls in Tasks and Sessions

Conducting comprehensive review of PR #1453 that adds tool calling functionality to tasks and sessions.

Review Checklist:

[x] Read repository guidelines (CLAUDE.md/AGENTS.md)
[x] Analyze core tool calling implementation
[x] Address existing reviewer comments from whiterabbit1983
[x] Review security vulnerabilities (especially eval() usage)
[x] Check error handling and logging practices
[x] Assess performance implications
[x] Review test coverage
[x] Provide comprehensive feedback and recommendations

✅ Strengths

1. Architecture & Design

Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
Type safety: Extensive use of Pydantic models and beartype decorators throughout
Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

Proper exception wrapping with context preservation (IntegrationExecutionException)
Graceful fallbacks for unknown tools (creates dummy error responses)
Activity logging integration for debugging

3. Integration Points

Seamless integration with existing workflow execution system
Proper Temporal activity patterns maintained
LLM provider abstraction through LiteLLM

🚨 Critical Issues

1. Critical Security Vulnerability (`expressions.py:90`)

extra_lambdas[k] = eval(v)  # Direct eval() bypasses simpleeval sandbox!

Risk: IMMEDIATE SECURITY THREAT - Allows arbitrary code execution
Impact: Full system compromise, data exfiltration, service disruption
Status: BLOCKS MERGE

2. IndexError Vulnerability (`workflow/task_execution/init.py:451`)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes non-empty list. If len(messages) == 0, raises IndexError
Fix: if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:

3. Type Safety Issues (`tool_runner.py`)

Lines 54, 82: Missing null checks before accessing optional attributes

# Line 54
"parameters": tool.api_call.params_schema.model_dump(exclude_none=True),  # params_schema can be None
# Line 82  
"parameters": tool.function.parameters,  # function can be None

⚠️ Additional Security & Performance Concerns

Network Security

Excessive HTTP timeouts (600s) in integrations.py:24,43 could enable DoS attacks
No response size limits could cause memory exhaustion
No input validation on JSON payloads (tool_runner.py:131)

Performance Issues

Tool map recreation on every LLM call (tool_runner.py:166-172) - should cache
Sequential tool processing - could benefit from parallel execution for independent tools
No rate limiting on tool execution

📝 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.py:66-69 ✅ Valid Concern

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise  # Error logged but re-raised without additional context

Recommendation: Consider wrapping in custom exception with more context.

2. Async client in integrations.py:24 ✅ Code is Correct

Code correctly uses AsyncClient, but 600-second timeout is excessive.
Recommendation: Reduce to 30-60 seconds with retry logic.

3. F-string in integration service:38 ✅ Minor Style Issue

The current code if m.method == method: is clear and doesn't require f-string conversion.

4. len(messages) == 0 check ✅ Critical Issue (Addressed above)

📊 Test Coverage Analysis

Existing Tests ✅

test_tool_runner.py: Good coverage of format_tool function with various tool types
test_prompt_step_auto_tools.py: Feature flag integration testing
Basic tool call construction tests

Missing Critical Tests ❌

No tests for run_llm_with_tools tool loop logic
No tests for error scenarios in tool execution
No tests for security boundaries (malicious inputs)
No integration tests for the full tool calling pipeline
No tests for tool resolution and mapping logic

🔧 Recommendations

Must Fix Before Merge (Blockers)

🚨 CRITICAL: Fix eval() usage - Replace with safe alternatives in expressions.py:90
Fix IndexError - Add length check in workflow execution
Add null checks - Fix type safety issues in tool_runner.py:54,82
Add JSON size limits - Prevent JSON bomb attacks

Should Fix Soon (High Priority)

Reduce HTTP timeouts - Configure reasonable timeout values (30-60s)
Add input validation - Validate tool arguments before processing
Implement response size limits - Prevent memory exhaustion
Add security tests - Test malicious input handling

Nice to Have (Medium Priority)

Performance optimization - Cache tool maps and consider parallel execution
Enhanced logging - Add structured logging with correlation IDs
Rate limiting - Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 4/10

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities and type safety issues that make it unsafe for production deployment.

Positives:

Well-structured architecture with clear separation of concerns
Proper use of feature flags for gradual rollout
Good integration with existing systems
Comprehensive test coverage for tool formatting

Critical Issues:

🚨 CRITICAL SECURITY VULNERABILITY: Direct eval() usage allows arbitrary code execution
IndexError bug that could crash the system
Type safety issues that could cause runtime errors
Excessive timeouts that could be exploited

Recommendation: ❌ DO NOT MERGE until:

Critical security vulnerability in expressions.py is fixed
IndexError bug is patched
Type safety issues are resolved
Basic input validation is added

The direct eval() usage represents an immediate security threat that could lead to full system compromise. This must be addressed before any deployment considerations.

Jul 01 '25 19:07 claude[bot]

Claude finished @Ahmad-mtos's task —— View job

Changelog Bot Analysis ✅

Analyzed the PR changes and determined that no new changelog entry is needed.

Todo List:

[x] Read diff.patch file to understand changes
[x] Determine if changes are user-visible
[x] Create or update changelog entry if needed
[x] Commit and push changes

Analysis:

This PR contains no new user-visible changes that require a changelog entry beyond what's already documented in the existing changelog/dev/2025-06-30.md file.

Changes in this update:

Added test files (test_prompt_step_auto_tools.py, test_tool_runner.py)
Added feature flag client implementation
Minor fixes and refactoring

User-visible features already documented:

✅ Tool calling functionality
✅ New prompt step parameters (tool_choice, unwrap, auto_run_tools, disable_cache)
✅ New session parameters (auto_run_tools, forward_tool_calls)
✅ Anthropic tool support (computer_20241022, text_editor_20241022, bash_20241022)
✅ Feature flags integration using Unleash

Conclusion: The existing changelog entry from June 30, 2025 comprehensively covers all user-visible changes in this PR. No additional changelog entry is needed.

Jul 01 '25 19:07 github-actions[bot]

Claude finished @Ahmad-mtos's task —— View job

Pull Request Review: Tool Calls in Tasks and Sessions

Conducting comprehensive review of PR #1453 that adds tool calling functionality to tasks and sessions.

Review Checklist:

[x] Read repository guidelines (CLAUDE.md/AGENTS.md)
[x] Analyze core tool calling implementation
[x] Address existing reviewer comments from whiterabbit1983
[x] Review security vulnerabilities (especially eval() usage)
[x] Check error handling and logging practices
[x] Assess performance implications
[x] Review test coverage
[x] Provide comprehensive feedback and recommendations

✅ Strengths

1. Architecture & Design

Clean separation of concerns: tool_runner.py handles orchestration, tool_executor.py handles specific execution
Type safety: Extensive use of Pydantic models and beartype decorators throughout
Feature flag integration: Proper OpenFeature implementation for safe gradual rollout
Modular design: Tools are properly abstracted with clear interfaces

2. Error Handling

Proper exception wrapping with context preservation (IntegrationExecutionException)
Graceful fallbacks for unknown tools (creates dummy error responses)
Activity logging integration for debugging

3. Integration Points

Seamless integration with existing workflow execution system
Proper Temporal activity patterns maintained
LLM provider abstraction through LiteLLM

🚨 Critical Issues

1. Critical Security Vulnerability (`expressions.py:90`)

extra_lambdas[k] = eval(v)  # Direct eval() bypasses simpleeval sandbox!

Risk: IMMEDIATE SECURITY THREAT - Allows arbitrary code execution
Impact: Full system compromise, data exfiltration, service disruption
Status: BLOCKS MERGE

2. IndexError Vulnerability (`workflow/task_execution/init.py:451`)

if step.unwrap or not step.auto_run_tools or messages[-1]["tool_calls"] is None:

Issue: messages[-1] assumes non-empty list. If len(messages) == 0, raises IndexError
Fix: if step.unwrap or not step.auto_run_tools or not messages or messages[-1]["tool_calls"] is None:

3. Type Safety Issues (`tool_runner.py`)

Lines 54, 82: Missing null checks before accessing optional attributes

# Line 54
"parameters": tool.api_call.params_schema.model_dump(exclude_none=True),  # params_schema can be None
# Line 82  
"parameters": tool.function.parameters,  # function can be None

⚠️ Additional Security & Performance Concerns

Network Security

Excessive HTTP timeouts (600s) in integrations.py:24,43 could enable DoS attacks
No response size limits could cause memory exhaustion
No input validation on JSON payloads (tool_runner.py:131)

Performance Issues

Tool map recreation on every LLM call (tool_runner.py:166-172) - should cache
Sequential tool processing - could benefit from parallel execution for independent tools
No rate limiting on tool execution

📝 Addressing Review Comments

whiterabbit1983's Comments

1. Error logging in tool_executor.py:66-69 ✅ Valid Concern

except Exception as e:
    if activity.in_activity():
        activity.logger.error(f"Error in execute_web_search_tool: {e}")
    raise  # Error logged but re-raised without additional context

Recommendation: Consider wrapping in custom exception with more context.

2. Async client in integrations.py:24 ✅ Code is Correct

Code correctly uses AsyncClient, but 600-second timeout is excessive.
Recommendation: Reduce to 30-60 seconds with retry logic.

3. F-string in integration service:38 ✅ Minor Style Issue

The current code if m.method == method: is clear and doesn't require f-string conversion.

4. len(messages) == 0 check ✅ Critical Issue (Addressed above)

📊 Test Coverage Analysis

Existing Tests ✅

test_tool_runner.py: Good coverage of format_tool function with various tool types
test_prompt_step_auto_tools.py: Feature flag integration testing
Basic tool call construction tests

Missing Critical Tests ❌

No tests for run_llm_with_tools tool loop logic
No tests for error scenarios in tool execution
No tests for security boundaries (malicious inputs)
No integration tests for the full tool calling pipeline
No tests for tool resolution and mapping logic

🔧 Recommendations

Must Fix Before Merge (Blockers)

🚨 CRITICAL: Fix eval() usage - Replace with safe alternatives in expressions.py:90
Fix IndexError - Add length check in workflow execution
Add null checks - Fix type safety issues in tool_runner.py:54,82
Add JSON size limits - Prevent JSON bomb attacks

Should Fix Soon (High Priority)

Reduce HTTP timeouts - Configure reasonable timeout values (30-60s)
Add input validation - Validate tool arguments before processing
Implement response size limits - Prevent memory exhaustion
Add security tests - Test malicious input handling

Nice to Have (Medium Priority)

Performance optimization - Cache tool maps and consider parallel execution
Enhanced logging - Add structured logging with correlation IDs
Rate limiting - Implement per-user/agent tool execution limits

📈 Overall Assessment

Score: 4/10

This PR introduces valuable tool calling functionality with good architectural foundations, but has several critical security vulnerabilities and type safety issues that make it unsafe for production deployment.

Positives:

Well-structured architecture with clear separation of concerns
Proper use of feature flags for gradual rollout
Good integration with existing systems
Comprehensive test coverage for tool formatting

Critical Issues:

🚨 CRITICAL SECURITY VULNERABILITY: Direct eval() usage allows arbitrary code execution
IndexError bug that could crash the system
Type safety issues that could cause runtime errors
Excessive timeouts that could be exploited

Recommendation: ❌ DO NOT MERGE until:

Critical security vulnerability in expressions.py is fixed
IndexError bug is patched
Type safety issues are resolved
Basic input validation is added

The direct eval() usage represents an immediate security threat that could lead to full system compromise. This must be addressed before any deployment considerations.

Jul 01 '25 19:07 claude[bot]

Claude finished @Ahmad-mtos's task —— View job

Changelog Bot Analysis ✅

Analyzed the PR changes and determined that no new changelog entry is needed.

Todo List:

[x] Read diff.patch file to understand changes
[x] Determine if changes are user-visible
[x] Create or update changelog entry if needed
[x] Commit and push changes

Analysis:

This PR contains no new user-visible changes that require a changelog entry beyond what's already documented in the existing changelog/dev/2025-06-30.md file.

Changes in this update:

Added test files (test_prompt_step_auto_tools.py, test_tool_runner.py)
Added feature flag client implementation
Minor fixes and refactoring

User-visible features already documented:

✅ Tool calling functionality
✅ New prompt step parameters (tool_choice, unwrap, auto_run_tools, disable_cache)
✅ New session parameters (auto_run_tools, forward_tool_calls)
✅ Anthropic tool support (computer_20241022, text_editor_20241022, bash_20241022)
✅ Feature flags integration using Unleash

Conclusion: The existing changelog entry from June 30, 2025 comprehensively covers all user-visible changes in this PR. No additional changelog entry is needed.

Jul 01 '25 19:07 github-actions[bot]

julep julep copied to clipboard

feat(agents-api, integrations-service): tool calls in tasks and sessions

Changelog Bot Analysis ✅

Todo List:

Analysis:

Actions Taken:

Pull Request Review: Tool Calls in Tasks and Sessions

Review Checklist:

✅ Strengths

1. Architecture & Design

2. Error Handling

3. Integration Points

⚠️ Issues & Concerns

1. Critical Issues

Potential Index Error in Workflow (workflow/task_execution/__init__.py:451)

Missing Error Logging (tool_executor.py:66-69)

Non-Async HTTP Client (integrations.py:24)

2. Security Concerns

Tool Execution Security

Integration Service Calls

3. Performance Issues

Tool Resolution Inefficiency (tool_runner.py:166-172)

Synchronous Tool Processing (tool_runner.py:185-208)

4. Code Quality Issues

Inconsistent Error Handling

Magic Strings (integrations/get_integration_tool.py:39)

Type Annotation Gaps

🔍 Test Coverage Analysis

Missing Tests

Test Recommendations

📝 Addressing Review Comments

1. whiterabbit1983's Comments

Error logging in tool_executor.py ✅ Addressed Above

Async client in integrations.py ✅ Code is Correct

F-string in integration service ✅ Minor Issue

len(messages) == 0 check ✅ Critical Issue

🚀 Recommendations

High Priority

Medium Priority

Low Priority

📊 Overall Assessment

Pull Request Review: Tool Calls in Tasks and Sessions

Review Checklist:

✅ Strengths

1. Architecture & Design

2. Error Handling

3. Integration Points

⚠️ Issues & Concerns

1. Critical Issues

Potential Index Error in Workflow (workflow/task_execution/__init__.py:451)

Missing Error Logging (tool_executor.py:66-69)

Non-Async HTTP Client (integrations.py:24)

2. Security Concerns

Tool Execution Security

Integration Service Calls

3. Performance Issues

Tool Resolution Inefficiency (tool_runner.py:166-172)

Synchronous Tool Processing (tool_runner.py:185-208)

4. Code Quality Issues

Inconsistent Error Handling

Magic Strings (integrations/get_integration_tool.py:39)

Type Annotation Gaps

🔍 Test Coverage Analysis

Missing Tests

Test Recommendations

📝 Addressing Review Comments

1. whiterabbit1983's Comments

Error logging in tool_executor.py ✅ Addressed Above

Async client in integrations.py ✅ Code is Correct

F-string in integration service ✅ Minor Issue

len(messages) == 0 check ✅ Critical Issue

🚀 Recommendations

High Priority

Medium Priority

Low Priority

📊 Overall Assessment

Pull Request Review: Tool Calls in Tasks and Sessions

Review Checklist:

✅ Strengths

1. Architecture & Design

julep
julep copied to clipboard

Potential Index Error in Workflow (`workflow/task_execution/init.py:451`)

Missing Error Logging (`tool_executor.py:66-69`)

Non-Async HTTP Client (`integrations.py:24`)

Tool Resolution Inefficiency (`tool_runner.py:166-172`)

Synchronous Tool Processing (`tool_runner.py:185-208`)

Magic Strings (`integrations/get_integration_tool.py:39`)

Potential Index Error in Workflow (`workflow/task_execution/init.py:451`)

Missing Error Logging (`tool_executor.py:66-69`)

Non-Async HTTP Client (`integrations.py:24`)

Tool Resolution Inefficiency (`tool_runner.py:166-172`)

Synchronous Tool Processing (`tool_runner.py:185-208`)

Magic Strings (`integrations/get_integration_tool.py:39`)

1. IndexError Vulnerability (`workflow/task_execution/init.py:451`)

Critical: Direct eval() Usage (`expressions.py:76-90`)

JSON Bomb Vulnerability (`tool_runner.py:131`)

Argument Merging Without Validation (`execute_integration.py:53,63`)

Tool Resolution Inefficiency (`tool_runner.py:166-172`)

Sequential Tool Processing (`tool_runner.py:185-208`)

1. CRITICAL: Direct eval() Usage (`expressions.py:90`)

2. IndexError Vulnerability (`workflow/task_execution/init.py:451`)

3. JSON Bomb Vulnerability (`tool_runner.py:131`)

Argument Merging Without Validation (`execute_integration.py:53,63`)

Excessive Timeouts (`integrations.py:24,43`)

Tool Resolution Inefficiency (`tool_runner.py:166-172`)

Sequential Tool Processing (`tool_runner.py:185-208`)