ag2 icon indicating copy to clipboard operation
ag2 copied to clipboard

feat: Gpt 5.1 shell tool support

Open priyansh4320 opened this issue 1 month ago β€’ 2 comments

Why are these changes needed?

this PR adds support for GPT 5.1 shell tool support on OAI responses

Related issue number

#2229

Checks

  • [ ] I've included any doc changes needed for https://docs.ag2.ai/. See https://docs.ag2.ai/latest/docs/contributor-guide/documentation/ to build and test documentation locally.
  • [ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
  • [ ] I've made sure all auto checks have passed.

priyansh4320 avatar Nov 27 '25 20:11 priyansh4320

πŸ“ Documentation Analysis

All docs are up to date! πŸŽ‰


βœ… Latest commit analyzed: 5b5d86dd33e32598514564c908605808f87ff12a | Powered by Joggr

joggrbot[bot] avatar Nov 27 '25 20:11 joggrbot[bot]

@claude review this shell tool changes in this PR, ignore apply patch tool.

priyansh4320 avatar Dec 13 '25 19:12 priyansh4320

Claude finished @priyansh4320's task in 1m 56s β€”β€” View job


Code Review: Shell Tool Implementation

I've completed a thorough review of the shell tool implementation. Overall, this is a well-designed feature with strong security considerations. Below are my findings:


🎯 Summary

The shell tool implementation provides a secure way to execute shell commands through the GPT-5.1 Responses API. The code demonstrates good security practices with multiple layers of protection, comprehensive testing, and clear documentation.


βœ… Strengths

1. Security Architecture (shell_tool.py)

  • Multi-layered security: Command pattern filtering, working directory restriction, whitelist/blacklist, and path restrictions
  • Comprehensive dangerous patterns: Covers critical scenarios like root filesystem deletion, disk device operations, fork bombs, and filesystem formatting
  • Flexible sandboxing: Configurable workspace_dir, allowed_paths, allowed_commands, and denied_commands
  • Good defaults: Secure by default with enable_command_filtering=True

2. Code Quality

  • Clean, well-structured code with clear separation of concerns
  • Proper use of dataclasses for CmdResult, ShellCallOutcome, and ShellCommandOutput
  • Good type hints throughout
  • Comprehensive docstrings

3. Test Coverage

  • Excellent test coverage at 89.77% for shell_tool.py (test/tools/experimental/shell/test_shell_tool.py)
  • Tests cover initialization, validation, execution, security, and integration scenarios
  • Good use of pytest fixtures and temporary directories

4. Integration Quality (openai_responses.py)

  • Clean integration with OpenAI Responses API
  • Proper handling of shell_call extraction and execution
  • Good separation between message processing and shell execution

πŸ”§ Issues & Recommendations

Critical Issues

None found.

High Priority Issues

1. Duplicate Variable Assignment (shell_tool.py:194)

test_path = test_path = os.path.expanduser(match) if match.startswith("~") else match

This line has a duplicate assignment. Should be:

test_path = os.path.expanduser(match) if match.startswith("~") else match

Location: autogen/tools/experimental/shell/shell_tool.py:194

2. Inconsistent Default Values in _execute_shell_operation (openai_responses.py:750-756)

The initialization logic has problematic defaults:

allowed_paths=allowed_paths if allowed_paths else [],

This sets allowed_paths=[] when allowed_paths=None, but ShellExecutor.__init__ expects None to mean "use default ['**']". This means passing allowed_paths=None will result in an empty list instead of the default wildcard.

Recommendation: Change to:

allowed_paths=allowed_paths if allowed_paths is not None else None,

Same issue with other parameters on lines 752, 754-756.

Location: autogen/oai/openai_responses.py:750-756

Medium Priority Issues

3. Inconsistent Shell Executor Reuse Logic (openai_responses.py:747-779)

The executor is created once but then updated with each call. This could lead to unexpected behavior if different calls have different security requirements. Consider either:

  • Creating a new executor for each call (simpler, more predictable)
  • Or document clearly that the executor is shared across calls and settings persist

Location: autogen/oai/openai_responses.py:747-779

4. Path Validation Regex Can Be Bypassed (shell_tool.py:189)

The path validation regex r"(?:^|\s)([/~]|\.\./)[^\s]*" only catches paths that:

  • Start with /, ~, or ../
  • Are preceded by whitespace or at the start

This misses:

  • Paths like cat ./file.txt (doesn't start with /, ~, or ../)
  • Paths with quotes: cat "/some/path"
  • Environment variables: cat $HOME/file

Recommendation: Either improve the path extraction logic or clearly document the limitations.

Location: autogen/tools/experimental/shell/shell_tool.py:189-199

5. Missing Validation for Timeout Values (shell_tool.py:201-234)

The run method doesn't validate that timeout is positive. Negative or zero timeouts could cause issues.

Recommendation: Add validation:

if timeout is not None and timeout <= 0:
    raise ValueError("Timeout must be positive")

Location: autogen/tools/experimental/shell/shell_tool.py:201-234

Low Priority Issues

6. Code Duplication in content_str (code_utils.py:95-100)

The shell_call handling code is very similar to apply_patch_call handling. Consider extracting common logic.

Location: autogen/code_utils.py:95-100

7. Missing Edge Case Tests

While test coverage is good, consider adding tests for:

  • Commands with special characters (quotes, backticks, semicolons)
  • Commands with environment variables
  • Very long command outputs
  • Unicode in command output
  • Commands that produce binary output

8. Documentation Improvements

  • The notebook example is excellent, but consider adding:
    • More details on the security model in the main docstring
    • Examples of how to properly configure sandboxing for production
    • Clarification on the difference between allowed_paths patterns and path validation in commands

πŸ”’ Security Analysis

Well Handled

βœ… Dangerous command patterns are comprehensive
βœ… Multiple security layers (whitelist, blacklist, patterns, paths)
βœ… Good defaults (filtering enabled by default)
βœ… Workspace directory isolation
βœ… Timeout protection against hanging commands

Potential Concerns

  1. Shell Injection: Commands are executed with shell=True which allows shell metacharacters. While dangerous patterns help, consider documenting that this is inherently risky for untrusted input.

  2. Path Traversal: The path validation is a best-effort approach but can be bypassed with symbolic links, environment variables, or shell features like command substitution.

  3. Resource Limits: No protection against commands that consume excessive CPU, memory, or disk I/O beyond timeouts.

Recommendation: Document these limitations clearly and recommend running in containers or VMs for untrusted code.


πŸ“‹ Code Style & Best Practices

Good Practices

  • Consistent use of type hints
  • Good error messages
  • Proper use of dataclasses
  • Clean separation of concerns

Minor Suggestions

  • Consider using pathlib.Path more consistently instead of mixing with os.path
  • The model_dump() methods on dataclasses could use asdict() for consistency

πŸ§ͺ Test Quality

The test suite is comprehensive and well-organized:

  • Clear test names following naming conventions
  • Good use of pytest fixtures
  • Tests cover happy path, error cases, and edge cases
  • Integration tests verify the full workflow

Test Statistics: 682 lines of tests for 273 lines of implementation (2.5:1 ratio) - excellent!


πŸ“ Documentation Quality

  • Notebook example (agentchat_gpt5.1_shell_tool_example.ipynb) is well-written with clear examples
  • Docstrings are comprehensive and helpful
  • Good inline comments explaining security patterns
  • README could benefit from more details on security model

🎯 Recommendations Summary

Must Fix (Before Merge)

  1. Fix duplicate variable assignment at shell_tool.py:194
  2. Fix inconsistent default values in openai_responses.py:750-756

Should Fix (High Priority)

  1. Clarify/fix shell executor reuse logic
  2. Document path validation limitations or improve the logic

Nice to Have

  1. Add timeout validation
  2. Reduce code duplication in content_str
  3. Add more edge case tests
  4. Enhance security documentation

✨ Overall Assessment

Rating: 8.5/10

This is a solid implementation with thoughtful security design. The main issues are minor bugs and documentation gaps rather than fundamental design problems. The extensive test coverage and security considerations demonstrate careful engineering.

Approval Status: βœ… Approve after fixing the two must-fix issues


claude[bot] avatar Dec 13 '25 19:12 claude[bot]

@claude added changes from previous review: Must Fix (Before Merge) Fix duplicate variable assignment at shell_tool.py:194 Fix inconsistent default values in openai_responses.py:750-756 Should Fix (High Priority) Clarify/fix shell executor reuse logic Document path validation limitations or improve the logic

now review these changes also review apply patch files.

priyansh4320 avatar Dec 14 '25 07:12 priyansh4320

Claude finished @priyansh4320's task in 2m 13s β€”β€” View job


Code Review: Shell Tool and Apply Patch Changes

I've completed a comprehensive review of the changes addressing previous feedback and the apply_patch implementation.


βœ… Previous Issues - Status Check

Must Fix Issues

1. βœ… FIXED - Duplicate variable assignment at shell_tool.py:194

Status: Still present, but I see this is actually NOT a duplicate. Let me clarify:

test_path = os.path.expanduser(match) if match.startswith("~") else match

This line at 194 is correct. My previous review incorrectly identified this as test_path = test_path = ..., but upon re-reading, the actual code is fine. The conditional assignment is proper and doesn't have a duplicate. No fix needed - my original review was incorrect on this point.

2. ❌ NOT FIXED - Inconsistent default values in openai_responses.py:750-756

The issue still exists. Lines 753, 755-756 use this pattern:

allowed_paths=allowed_paths if allowed_paths is not None else None,
denied_commands=denied_commands if denied_commands else [],
enable_command_filtering=enable_command_filtering if enable_command_filtering else True,

Problem:

  • Line 753: allowed_paths if allowed_paths is not None else None - CORRECT βœ…
  • Line 755: denied_commands if denied_commands else [] - INCORRECT ❌
    • If denied_commands=None, this becomes []
    • But ShellExecutor.__init__ expects None to use default []
    • This breaks the default value chain
  • Line 756: enable_command_filtering if enable_command_filtering else True - INCORRECT ❌
    • If enable_command_filtering=False, this becomes True
    • This is a logic bug - you can't disable filtering even when explicitly passing False

Fix Required:

# Line 755 should be:
denied_commands=denied_commands if denied_commands is not None else [],

# Line 756 should be:
enable_command_filtering=enable_command_filtering if enable_command_filtering is not None else True,

Should Fix Issues

3. ⚠️ PARTIALLY ADDRESSED - Shell executor reuse logic

Lines 750-782 still reuse the same executor and mutate it, but I see you're now checking is not None for allowed_paths (line 767), which helps. However, the executor reuse still has potential issues:

Remaining concerns:

  • Executor state persists across calls, which could cause unexpected behavior if different calls have different security requirements
  • Lines 755-756 and 772-782 update the executor in place, making it hard to reason about its state

Recommendation: Either:

  1. Create a fresh executor for each call (simpler, more predictable), OR
  2. Document clearly that the executor is shared and settings accumulate across calls

4. βœ… FIXED - Path validation documentation

Excellent work! The README now clearly documents the limitations at lines 100-110:

**Important Limitations**: Path validation in commands uses regex pattern...

**Paths that may bypass validation**:
- Simple relative paths: `cat file.txt`
- Quoted paths: `cat "/some/path"`
- Environment variables: `cat $HOME/file`

**Recommendation**: Use `allowed_paths` patterns to restrict access...

This is exactly what was needed. βœ…


πŸ” Apply Patch Tool Review

Architecture & Design - ⭐⭐⭐⭐⭐

The apply_patch implementation is exceptionally well-designed:

  1. Clean Protocol Pattern: PatchEditor protocol (lines 22-89) provides excellent abstraction
  2. V4A Diff Engine: Custom _V4ADiffApplier (lines 92-179) is robust and doesn't depend on external tools
  3. Workspace Security: Same security model as shell tool with allowed_paths validation
  4. Async Support: Comprehensive async/sync dual implementation

Code Quality - ⭐⭐⭐⭐⭐

Strengths:

  • Clean separation of concerns: diff parsing, path validation, file operations
  • Excellent error handling with descriptive messages
  • Type hints throughout
  • Well-structured dataclasses

apply_patch_tool.py highlights:

  • Protocol-based design allows custom editors (lines 22-89)
  • Regex-based hunk parsing is simple and effective (line 96)
  • Path validation uses PurePath.match() for secure pattern matching (line 233)
  • Proper use of Path.relative_to() for path traversal protection (line 245)

Security Analysis - ⭐⭐⭐⭐⭐

Well Handled: βœ… Path traversal protection using Path.relative_to() (lines 240-247)
βœ… Workspace containment validation
βœ… Glob pattern-based access control with allowed_paths
βœ… Proper handling of symbolic links (resolved paths)
βœ… Parent directory creation with parents=True, exist_ok=True is safe

No security issues found in the apply_patch implementation.

Test Coverage - ⭐⭐⭐⭐⭐

From test_apply_patch_tool.py:

  • Comprehensive V4A diff applier tests (lines 20-99+)
  • Tests for create, update, delete operations
  • Tests for multiple hunks, edge cases
  • Good coverage of error conditions

From Codecov report:

  • apply_patch_tool.py: 75.75% coverage
  • Missing coverage appears to be in error paths and async variants

Integration Quality - ⭐⭐⭐⭐½

The integration in openai_responses.py is solid:

Strengths:

  • Clean extraction and execution flow (lines 394-458)
  • Proper separation: _extract_apply_patch_calls β†’ _execute_apply_patch_calls
  • Handles both async and sync modes correctly
  • Validation of operation types before execution (line 313)
  • Proper error wrapping in ApplyPatchCallOutput format

Minor Issue (lines 320-333):

def _run_async_in_thread(coro):
    """Run an async coroutine in a separate thread with its own event loop."""
    result = {}
    def runner():
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        result["value"] = loop.run_until_complete(coro)
        loop.close()
    t = threading.Thread(target=runner)
    t.start()
    t.join()
    return result["value"]

This is a workaround for running async code when there's already a running loop (e.g., Jupyter). While functional, it's a bit hacky. Consider using asyncio.to_thread() or documenting why this approach is necessary.


πŸ› New Issues Found

High Priority

1. Logic Bug in enable_command_filtering (openai_responses.py:756, 778-781)

Location: openai_responses.py:756 and 778-781

# Line 756 - Initialization
enable_command_filtering=enable_command_filtering if enable_command_filtering else True,

# Lines 778-781 - Update
self._shell_executor.enable_command_filtering = (
    enable_command_filtering
    if enable_command_filtering is not None
    else self._shell_executor.enable_command_filtering
)

Problem: Line 756 uses if enable_command_filtering else True, which means:

  • enable_command_filtering=False β†’ becomes True ❌
  • You cannot disable command filtering even when explicitly passing False

Fix:

# Line 756 should be:
enable_command_filtering=enable_command_filtering if enable_command_filtering is not None else True,

Note that the update logic (lines 778-781) is correct - it uses is not None.

2. Inconsistent denied_commands Default (openai_responses.py:755)

Location: openai_responses.py:755

denied_commands=denied_commands if denied_commands else [],

Problem: Should check is not None instead of truthiness:

  • denied_commands=None β†’ becomes [] ❌
  • denied_commands=[] β†’ becomes [] βœ…

While this might work in practice (both None and [] result in empty list), it's inconsistent with the pattern used for other parameters.

Fix:

denied_commands=denied_commands if denied_commands is not None else [],

Medium Priority

3. Code Duplication in content_str (code_utils.py:86-100)

The shell_call and apply_patch_call handling is very similar. Consider extracting common logic:

elif item["type"] == "apply_patch_call":
    operation = item.get("operation", {})
    operation_type = operation.get("type", "unknown_operation")
    path = operation.get("path", "unknown_path")
    status = item.get("status", "unknown_status")
    diff = operation.get("diff", "unknown_diff")
    rst.append(f"<apply_patch_call: {operation_type} on {path} (status: {status}) diff: {diff}>")

elif item["type"] == "shell_call":
    call_id = item.get("call_id", "<unknown>")
    action = item.get("action", {})
    commands = action.get("commands", [])
    rst.append(f"[shell_call id={call_id} commands={commands}]")

Both blocks extract nested dicts and format them. A helper function would reduce duplication.

4. Missing Validation for Timeout (shell_tool.py:201-234)

The run method doesn't validate that timeout is positive:

def run(self, cmd: str, timeout: float | None = None) -> CmdResult:
    # Validate command before execution
    self._validate_command(cmd)
    
    t = timeout or self.default_timeout
    # No validation that t > 0!

Recommendation:

if timeout is not None and timeout <= 0:
    raise ValueError("Timeout must be positive")

Low Priority

5. Redundant Conditional in shell_tool.py:194

if match.startswith("~"):
    test_path = os.path.expanduser(match) if match.startswith("~") else match

The outer if already checks match.startswith("~"), so the inner ternary is redundant. Could simplify to:

if match.startswith("~"):
    test_path = os.path.expanduser(match)
else:
    test_path = match

But this is extremely minor and the current code works fine.


πŸ“Š Overall Assessment

Shell Tool: ⭐⭐⭐⭐ (8.5/10)

  • Strong security design with defense-in-depth
  • Excellent documentation of limitations
  • Critical bug: Cannot disable command filtering (line 756)
  • Good test coverage (89.77%)

Apply Patch Tool: ⭐⭐⭐⭐⭐ (9.5/10)

  • Excellent architecture with protocol-based design
  • Clean, well-tested implementation
  • Strong security model
  • Comprehensive async support
  • Good test coverage (75.75%)

Integration: ⭐⭐⭐⭐ (8/10)

  • Clean separation of concerns
  • Proper error handling
  • Async workaround is functional but hacky
  • Executor reuse logic could be clearer

🎯 Recommendations Summary

Must Fix Before Merge

  1. ❌ Fix enable_command_filtering logic bug at openai_responses.py:756
  2. ❌ Fix denied_commands default value at openai_responses.py:755

Should Fix (High Priority)

  1. Document or refactor shell executor reuse behavior
  2. Consider simplifying async thread workaround or document why it's needed

Nice to Have

  1. Add timeout validation in ShellExecutor.run()
  2. Reduce code duplication in content_str()
  3. Add more edge case tests (unicode, binary output, very long outputs)

✨ Highlights

What I Really Like:

  1. Path validation documentation - Excellent transparency about limitations
  2. Apply patch architecture - Protocol-based design is exemplary
  3. V4A diff applier - Custom implementation avoids external dependencies
  4. Comprehensive async support - Both tools properly support async/sync modes
  5. Test quality - Well-organized tests with clear naming

Key Improvements from Previous Review:

  • βœ… Path validation limitations are now clearly documented
  • βœ… Apply patch implementation is production-ready
  • βœ… Security model is well-thought-out for both tools

πŸ“ˆ Code Coverage Analysis

From Codecov report:

  • shell_tool.py: 89.77% (9 lines missing)
  • apply_patch_tool.py: 75.75% (56 lines missing)
  • openai_responses.py: 79.56% (57 lines missing)

The coverage is good overall. Missing lines appear to be primarily error paths and edge cases.


Final Verdict

Overall Rating: ⭐⭐⭐⭐½ (8.5/10)

This is a high-quality implementation with strong security design and excellent documentation. The apply_patch tool is production-ready. The shell tool has 2 critical bugs that must be fixed before merge (lines 755-756 in openai_responses.py).

Approval Status: ⚠️ Approve with required changes

Required changes:

  1. Fix enable_command_filtering logic at line 756
  2. Fix denied_commands default at line 755

Once these two lines are fixed, this PR will be ready to merge.


claude[bot] avatar Dec 14 '25 07:12 claude[bot]

Nice PR. @marklysze @yiranwu0 could you help review this when you have time?

qingyun-wu avatar Dec 14 '25 20:12 qingyun-wu

Codecov Report

:x: Patch coverage is 79.13386% with 53 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
autogen/a2a/client.py 25.80% 21 Missing and 2 partials :warning:
autogen/oai/openai_responses.py 86.32% 9 Missing and 7 partials :warning:
autogen/tools/experimental/shell/shell_tool.py 89.77% 6 Missing and 3 partials :warning:
autogen/code_utils.py 0.00% 4 Missing and 1 partial :warning:
Files with missing lines Coverage Ξ”
autogen/agentchat/conversable_agent.py 65.12% <100.00%> (ΓΈ)
autogen/oai/client.py 54.18% <100.00%> (ΓΈ)
autogen/oai/gemini_types.py 50.00% <100.00%> (+0.74%) :arrow_up:
autogen/oai/openai_utils.py 62.97% <ΓΈ> (ΓΈ)
autogen/token_count_utils.py 64.33% <ΓΈ> (ΓΈ)
autogen/tools/experimental/shell/__init__.py 100.00% <100.00%> (ΓΈ)
autogen/code_utils.py 73.74% <0.00%> (-1.36%) :arrow_down:
autogen/tools/experimental/shell/shell_tool.py 89.77% <89.77%> (ΓΈ)
autogen/oai/openai_responses.py 77.37% <86.32%> (+2.94%) :arrow_up:
autogen/a2a/client.py 63.97% <25.80%> (+0.78%) :arrow_up:
:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Dec 16 '25 18:12 codecov[bot]