ag2 feat: Gpt 5.1 shell tool support

Why are these changes needed?

this PR adds support for GPT 5.1 shell tool support on OAI responses

Related issue number

#2229

Checks

[ ] I've included any doc changes needed for https://docs.ag2.ai/. See https://docs.ag2.ai/latest/docs/contributor-guide/documentation/ to build and test documentation locally.
[ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
[ ] I've made sure all auto checks have passed.

Nov 27 '25 20:11 priyansh4320

📝 Documentation Analysis

All docs are up to date! 🎉

✅ Latest commit analyzed: 5b5d86dd33e32598514564c908605808f87ff12a | Powered by Joggr

Nov 27 '25 20:11 joggrbot[bot]

@claude review this shell tool changes in this PR, ignore apply patch tool.

Dec 13 '25 19:12 priyansh4320

Claude finished @priyansh4320's task in 1m 56s —— View job

Code Review: Shell Tool Implementation

I've completed a thorough review of the shell tool implementation. Overall, this is a well-designed feature with strong security considerations. Below are my findings:

🎯 Summary

The shell tool implementation provides a secure way to execute shell commands through the GPT-5.1 Responses API. The code demonstrates good security practices with multiple layers of protection, comprehensive testing, and clear documentation.

✅ Strengths

1. Security Architecture (shell_tool.py)

Multi-layered security: Command pattern filtering, working directory restriction, whitelist/blacklist, and path restrictions
Comprehensive dangerous patterns: Covers critical scenarios like root filesystem deletion, disk device operations, fork bombs, and filesystem formatting
Flexible sandboxing: Configurable workspace_dir, allowed_paths, allowed_commands, and denied_commands
Good defaults: Secure by default with enable_command_filtering=True

2. Code Quality

Clean, well-structured code with clear separation of concerns
Proper use of dataclasses for CmdResult, ShellCallOutcome, and ShellCommandOutput
Good type hints throughout
Comprehensive docstrings

3. Test Coverage

Excellent test coverage at 89.77% for shell_tool.py (test/tools/experimental/shell/test_shell_tool.py)
Tests cover initialization, validation, execution, security, and integration scenarios
Good use of pytest fixtures and temporary directories

4. Integration Quality (openai_responses.py)

Clean integration with OpenAI Responses API
Proper handling of shell_call extraction and execution
Good separation between message processing and shell execution

🔧 Issues & Recommendations

Critical Issues

None found.

High Priority Issues

1. Duplicate Variable Assignment (shell_tool.py:194)

test_path = test_path = os.path.expanduser(match) if match.startswith("~") else match

This line has a duplicate assignment. Should be:

test_path = os.path.expanduser(match) if match.startswith("~") else match

Location: autogen/tools/experimental/shell/shell_tool.py:194

2. Inconsistent Default Values in _execute_shell_operation (openai_responses.py:750-756)

The initialization logic has problematic defaults:

allowed_paths=allowed_paths if allowed_paths else [],

This sets allowed_paths=[] when allowed_paths=None, but ShellExecutor.__init__ expects None to mean "use default ['**']". This means passing allowed_paths=None will result in an empty list instead of the default wildcard.

Recommendation: Change to:

allowed_paths=allowed_paths if allowed_paths is not None else None,

Same issue with other parameters on lines 752, 754-756.

Location: autogen/oai/openai_responses.py:750-756

Medium Priority Issues

3. Inconsistent Shell Executor Reuse Logic (openai_responses.py:747-779)

The executor is created once but then updated with each call. This could lead to unexpected behavior if different calls have different security requirements. Consider either:

Creating a new executor for each call (simpler, more predictable)
Or document clearly that the executor is shared across calls and settings persist

Location: autogen/oai/openai_responses.py:747-779

4. Path Validation Regex Can Be Bypassed (shell_tool.py:189)

The path validation regex r"(?:^|\s)([/~]|\.\./)[^\s]*" only catches paths that:

Start with /, ~, or ../
Are preceded by whitespace or at the start

This misses:

Paths like cat ./file.txt (doesn't start with /, ~, or ../)
Paths with quotes: cat "/some/path"
Environment variables: cat $HOME/file

Recommendation: Either improve the path extraction logic or clearly document the limitations.

Location: autogen/tools/experimental/shell/shell_tool.py:189-199

5. Missing Validation for Timeout Values (shell_tool.py:201-234)

The run method doesn't validate that timeout is positive. Negative or zero timeouts could cause issues.

Recommendation: Add validation:

if timeout is not None and timeout <= 0:
    raise ValueError("Timeout must be positive")

Location: autogen/tools/experimental/shell/shell_tool.py:201-234

Low Priority Issues

6. Code Duplication in content_str (code_utils.py:95-100)

The shell_call handling code is very similar to apply_patch_call handling. Consider extracting common logic.

Location: autogen/code_utils.py:95-100

7. Missing Edge Case Tests

While test coverage is good, consider adding tests for:

Commands with special characters (quotes, backticks, semicolons)
Commands with environment variables
Very long command outputs
Unicode in command output
Commands that produce binary output

8. Documentation Improvements

The notebook example is excellent, but consider adding:
- More details on the security model in the main docstring
- Examples of how to properly configure sandboxing for production
- Clarification on the difference between allowed_paths patterns and path validation in commands

🔒 Security Analysis

Well Handled

✅ Dangerous command patterns are comprehensive
✅ Multiple security layers (whitelist, blacklist, patterns, paths)
✅ Good defaults (filtering enabled by default)
✅ Workspace directory isolation
✅ Timeout protection against hanging commands

Potential Concerns

Shell Injection: Commands are executed with shell=True which allows shell metacharacters. While dangerous patterns help, consider documenting that this is inherently risky for untrusted input.
Path Traversal: The path validation is a best-effort approach but can be bypassed with symbolic links, environment variables, or shell features like command substitution.
Resource Limits: No protection against commands that consume excessive CPU, memory, or disk I/O beyond timeouts.

Recommendation: Document these limitations clearly and recommend running in containers or VMs for untrusted code.

📋 Code Style & Best Practices

Good Practices

Consistent use of type hints
Good error messages
Proper use of dataclasses
Clean separation of concerns

Minor Suggestions

Consider using pathlib.Path more consistently instead of mixing with os.path
The model_dump() methods on dataclasses could use asdict() for consistency

🧪 Test Quality

The test suite is comprehensive and well-organized:

Clear test names following naming conventions
Good use of pytest fixtures
Tests cover happy path, error cases, and edge cases
Integration tests verify the full workflow

Test Statistics: 682 lines of tests for 273 lines of implementation (2.5:1 ratio) - excellent!

📝 Documentation Quality

Notebook example (agentchat_gpt5.1_shell_tool_example.ipynb) is well-written with clear examples
Docstrings are comprehensive and helpful
Good inline comments explaining security patterns
README could benefit from more details on security model

🎯 Recommendations Summary

Must Fix (Before Merge)

Fix duplicate variable assignment at shell_tool.py:194
Fix inconsistent default values in openai_responses.py:750-756

Should Fix (High Priority)

Clarify/fix shell executor reuse logic
Document path validation limitations or improve the logic

Nice to Have

Add timeout validation
Reduce code duplication in content_str
Add more edge case tests
Enhance security documentation

✨ Overall Assessment

Rating: 8.5/10

This is a solid implementation with thoughtful security design. The main issues are minor bugs and documentation gaps rather than fundamental design problems. The extensive test coverage and security considerations demonstrate careful engineering.

Approval Status: ✅ Approve after fixing the two must-fix issues

Dec 13 '25 19:12 claude[bot]

@claude added changes from previous review: Must Fix (Before Merge) Fix duplicate variable assignment at shell_tool.py:194 Fix inconsistent default values in openai_responses.py:750-756 Should Fix (High Priority) Clarify/fix shell executor reuse logic Document path validation limitations or improve the logic

now review these changes also review apply patch files.

Dec 14 '25 07:12 priyansh4320

Claude finished @priyansh4320's task in 2m 13s —— View job

Code Review: Shell Tool and Apply Patch Changes

I've completed a comprehensive review of the changes addressing previous feedback and the apply_patch implementation.

✅ Previous Issues - Status Check

Must Fix Issues

1. ✅ FIXED - Duplicate variable assignment at shell_tool.py:194

Status: Still present, but I see this is actually NOT a duplicate. Let me clarify:

test_path = os.path.expanduser(match) if match.startswith("~") else match

This line at 194 is correct. My previous review incorrectly identified this as test_path = test_path = ..., but upon re-reading, the actual code is fine. The conditional assignment is proper and doesn't have a duplicate. No fix needed - my original review was incorrect on this point.

2. ❌ NOT FIXED - Inconsistent default values in openai_responses.py:750-756

The issue still exists. Lines 753, 755-756 use this pattern:

allowed_paths=allowed_paths if allowed_paths is not None else None,
denied_commands=denied_commands if denied_commands else [],
enable_command_filtering=enable_command_filtering if enable_command_filtering else True,

Problem:

Line 753: allowed_paths if allowed_paths is not None else None - CORRECT ✅
Line 755: denied_commands if denied_commands else [] - INCORRECT ❌
- If denied_commands=None, this becomes []
- But ShellExecutor.__init__ expects None to use default []
- This breaks the default value chain
Line 756: enable_command_filtering if enable_command_filtering else True - INCORRECT ❌
- If enable_command_filtering=False, this becomes True
- This is a logic bug - you can't disable filtering even when explicitly passing False

Fix Required:

# Line 755 should be:
denied_commands=denied_commands if denied_commands is not None else [],

# Line 756 should be:
enable_command_filtering=enable_command_filtering if enable_command_filtering is not None else True,

Should Fix Issues

3. ⚠️ PARTIALLY ADDRESSED - Shell executor reuse logic

Lines 750-782 still reuse the same executor and mutate it, but I see you're now checking is not None for allowed_paths (line 767), which helps. However, the executor reuse still has potential issues:

Remaining concerns:

Executor state persists across calls, which could cause unexpected behavior if different calls have different security requirements
Lines 755-756 and 772-782 update the executor in place, making it hard to reason about its state

Recommendation: Either:

Create a fresh executor for each call (simpler, more predictable), OR
Document clearly that the executor is shared and settings accumulate across calls

4. ✅ FIXED - Path validation documentation

Excellent work! The README now clearly documents the limitations at lines 100-110:

**Important Limitations**: Path validation in commands uses regex pattern...

**Paths that may bypass validation**:
- Simple relative paths: `cat file.txt`
- Quoted paths: `cat "/some/path"`
- Environment variables: `cat $HOME/file`

**Recommendation**: Use `allowed_paths` patterns to restrict access...

This is exactly what was needed. ✅

🔍 Apply Patch Tool Review

Architecture & Design - ⭐⭐⭐⭐⭐

The apply_patch implementation is exceptionally well-designed:

Clean Protocol Pattern: PatchEditor protocol (lines 22-89) provides excellent abstraction
V4A Diff Engine: Custom _V4ADiffApplier (lines 92-179) is robust and doesn't depend on external tools
Workspace Security: Same security model as shell tool with allowed_paths validation
Async Support: Comprehensive async/sync dual implementation

Code Quality - ⭐⭐⭐⭐⭐

Strengths:

Clean separation of concerns: diff parsing, path validation, file operations
Excellent error handling with descriptive messages
Type hints throughout
Well-structured dataclasses

apply_patch_tool.py highlights:

Protocol-based design allows custom editors (lines 22-89)
Regex-based hunk parsing is simple and effective (line 96)
Path validation uses PurePath.match() for secure pattern matching (line 233)
Proper use of Path.relative_to() for path traversal protection (line 245)

Security Analysis - ⭐⭐⭐⭐⭐

Well Handled: ✅ Path traversal protection using Path.relative_to() (lines 240-247)
✅ Workspace containment validation
✅ Glob pattern-based access control with allowed_paths
✅ Proper handling of symbolic links (resolved paths)
✅ Parent directory creation with parents=True, exist_ok=True is safe

No security issues found in the apply_patch implementation.

Test Coverage - ⭐⭐⭐⭐⭐

From test_apply_patch_tool.py:

Comprehensive V4A diff applier tests (lines 20-99+)
Tests for create, update, delete operations
Tests for multiple hunks, edge cases
Good coverage of error conditions

From Codecov report:

apply_patch_tool.py: 75.75% coverage
Missing coverage appears to be in error paths and async variants

Integration Quality - ⭐⭐⭐⭐½

The integration in openai_responses.py is solid:

Strengths:

Clean extraction and execution flow (lines 394-458)
Proper separation: _extract_apply_patch_calls → _execute_apply_patch_calls
Handles both async and sync modes correctly
Validation of operation types before execution (line 313)
Proper error wrapping in ApplyPatchCallOutput format

Minor Issue (lines 320-333):

def _run_async_in_thread(coro):
    """Run an async coroutine in a separate thread with its own event loop."""
    result = {}
    def runner():
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        result["value"] = loop.run_until_complete(coro)
        loop.close()
    t = threading.Thread(target=runner)
    t.start()
    t.join()
    return result["value"]

This is a workaround for running async code when there's already a running loop (e.g., Jupyter). While functional, it's a bit hacky. Consider using asyncio.to_thread() or documenting why this approach is necessary.

🐛 New Issues Found

High Priority

1. Logic Bug in enable_command_filtering (openai_responses.py:756, 778-781)

Location: openai_responses.py:756 and 778-781

# Line 756 - Initialization
enable_command_filtering=enable_command_filtering if enable_command_filtering else True,

# Lines 778-781 - Update
self._shell_executor.enable_command_filtering = (
    enable_command_filtering
    if enable_command_filtering is not None
    else self._shell_executor.enable_command_filtering
)

Problem: Line 756 uses if enable_command_filtering else True, which means:

enable_command_filtering=False → becomes True ❌
You cannot disable command filtering even when explicitly passing False

Fix:

# Line 756 should be:
enable_command_filtering=enable_command_filtering if enable_command_filtering is not None else True,

Note that the update logic (lines 778-781) is correct - it uses is not None.

2. Inconsistent denied_commands Default (openai_responses.py:755)

Location: openai_responses.py:755

denied_commands=denied_commands if denied_commands else [],

Problem: Should check is not None instead of truthiness:

denied_commands=None → becomes [] ❌
denied_commands=[] → becomes [] ✅

While this might work in practice (both None and [] result in empty list), it's inconsistent with the pattern used for other parameters.

Fix:

denied_commands=denied_commands if denied_commands is not None else [],

Medium Priority

3. Code Duplication in content_str (code_utils.py:86-100)

The shell_call and apply_patch_call handling is very similar. Consider extracting common logic:

elif item["type"] == "apply_patch_call":
    operation = item.get("operation", {})
    operation_type = operation.get("type", "unknown_operation")
    path = operation.get("path", "unknown_path")
    status = item.get("status", "unknown_status")
    diff = operation.get("diff", "unknown_diff")
    rst.append(f"<apply_patch_call: {operation_type} on {path} (status: {status}) diff: {diff}>")

elif item["type"] == "shell_call":
    call_id = item.get("call_id", "<unknown>")
    action = item.get("action", {})
    commands = action.get("commands", [])
    rst.append(f"[shell_call id={call_id} commands={commands}]")

Both blocks extract nested dicts and format them. A helper function would reduce duplication.

4. Missing Validation for Timeout (shell_tool.py:201-234)

The run method doesn't validate that timeout is positive:

def run(self, cmd: str, timeout: float | None = None) -> CmdResult:
    # Validate command before execution
    self._validate_command(cmd)
    
    t = timeout or self.default_timeout
    # No validation that t > 0!

Recommendation:

if timeout is not None and timeout <= 0:
    raise ValueError("Timeout must be positive")

Low Priority

5. Redundant Conditional in shell_tool.py:194

if match.startswith("~"):
    test_path = os.path.expanduser(match) if match.startswith("~") else match

The outer if already checks match.startswith("~"), so the inner ternary is redundant. Could simplify to:

if match.startswith("~"):
    test_path = os.path.expanduser(match)
else:
    test_path = match

But this is extremely minor and the current code works fine.

📊 Overall Assessment

Shell Tool: ⭐⭐⭐⭐ (8.5/10)

Strong security design with defense-in-depth
Excellent documentation of limitations
Critical bug: Cannot disable command filtering (line 756)
Good test coverage (89.77%)

Apply Patch Tool: ⭐⭐⭐⭐⭐ (9.5/10)

Excellent architecture with protocol-based design
Clean, well-tested implementation
Strong security model
Comprehensive async support
Good test coverage (75.75%)

Integration: ⭐⭐⭐⭐ (8/10)

Clean separation of concerns
Proper error handling
Async workaround is functional but hacky
Executor reuse logic could be clearer

🎯 Recommendations Summary

Must Fix Before Merge

❌ Fix enable_command_filtering logic bug at openai_responses.py:756
❌ Fix denied_commands default value at openai_responses.py:755

Should Fix (High Priority)

Document or refactor shell executor reuse behavior
Consider simplifying async thread workaround or document why it's needed

Nice to Have

Add timeout validation in ShellExecutor.run()
Reduce code duplication in content_str()
Add more edge case tests (unicode, binary output, very long outputs)

✨ Highlights

What I Really Like:

Path validation documentation - Excellent transparency about limitations
Apply patch architecture - Protocol-based design is exemplary
V4A diff applier - Custom implementation avoids external dependencies
Comprehensive async support - Both tools properly support async/sync modes
Test quality - Well-organized tests with clear naming

Key Improvements from Previous Review:

✅ Path validation limitations are now clearly documented
✅ Apply patch implementation is production-ready
✅ Security model is well-thought-out for both tools

📈 Code Coverage Analysis

From Codecov report:

shell_tool.py: 89.77% (9 lines missing)
apply_patch_tool.py: 75.75% (56 lines missing)
openai_responses.py: 79.56% (57 lines missing)

The coverage is good overall. Missing lines appear to be primarily error paths and edge cases.

Final Verdict

Overall Rating: ⭐⭐⭐⭐½ (8.5/10)

This is a high-quality implementation with strong security design and excellent documentation. The apply_patch tool is production-ready. The shell tool has 2 critical bugs that must be fixed before merge (lines 755-756 in openai_responses.py).

Approval Status: ⚠️ Approve with required changes

Required changes:

Fix enable_command_filtering logic at line 756
Fix denied_commands default at line 755

Once these two lines are fixed, this PR will be ready to merge.

Dec 14 '25 07:12 claude[bot]

Nice PR. @marklysze @yiranwu0 could you help review this when you have time?

Dec 14 '25 20:12 qingyun-wu

Codecov Report

:x: Patch coverage is 79.13386% with 53 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
autogen/a2a/client.py	25.80%	21 Missing and 2 partials :warning:
autogen/oai/openai_responses.py	86.32%	9 Missing and 7 partials :warning:
autogen/tools/experimental/shell/shell_tool.py	89.77%	6 Missing and 3 partials :warning:
autogen/code_utils.py	0.00%	4 Missing and 1 partial :warning:

Files with missing lines	Coverage Δ
autogen/agentchat/conversable_agent.py	`65.12% <100.00%> (ø)`
autogen/oai/client.py	`54.18% <100.00%> (ø)`
autogen/oai/gemini_types.py	`50.00% <100.00%> (+0.74%)`	:arrow_up:
autogen/oai/openai_utils.py	`62.97% <ø> (ø)`
autogen/token_count_utils.py	`64.33% <ø> (ø)`
autogen/tools/experimental/shell/__init__.py	`100.00% <100.00%> (ø)`
autogen/code_utils.py	`73.74% <0.00%> (-1.36%)`	:arrow_down:
autogen/tools/experimental/shell/shell_tool.py	`89.77% <89.77%> (ø)`
autogen/oai/openai_responses.py	`77.37% <86.32%> (+2.94%)`	:arrow_up:
autogen/a2a/client.py	`63.97% <25.80%> (+0.78%)`	:arrow_up:

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Dec 16 '25 18:12 codecov[bot]