feat: Gpt 5.1 shell tool support
Why are these changes needed?
this PR adds support for GPT 5.1 shell tool support on OAI responses
Related issue number
#2229
Checks
- [ ] I've included any doc changes needed for https://docs.ag2.ai/. See https://docs.ag2.ai/latest/docs/contributor-guide/documentation/ to build and test documentation locally.
- [ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
- [ ] I've made sure all auto checks have passed.
π Documentation Analysis
All docs are up to date! π
β Latest commit analyzed: 5b5d86dd33e32598514564c908605808f87ff12a | Powered by Joggr
@claude review this shell tool changes in this PR, ignore apply patch tool.
Claude finished @priyansh4320's task in 1m 56s ββ View job
Code Review: Shell Tool Implementation
I've completed a thorough review of the shell tool implementation. Overall, this is a well-designed feature with strong security considerations. Below are my findings:
π― Summary
The shell tool implementation provides a secure way to execute shell commands through the GPT-5.1 Responses API. The code demonstrates good security practices with multiple layers of protection, comprehensive testing, and clear documentation.
β Strengths
1. Security Architecture (shell_tool.py)
- Multi-layered security: Command pattern filtering, working directory restriction, whitelist/blacklist, and path restrictions
- Comprehensive dangerous patterns: Covers critical scenarios like root filesystem deletion, disk device operations, fork bombs, and filesystem formatting
- Flexible sandboxing: Configurable workspace_dir, allowed_paths, allowed_commands, and denied_commands
- Good defaults: Secure by default with enable_command_filtering=True
2. Code Quality
- Clean, well-structured code with clear separation of concerns
- Proper use of dataclasses for
CmdResult,ShellCallOutcome, andShellCommandOutput - Good type hints throughout
- Comprehensive docstrings
3. Test Coverage
- Excellent test coverage at 89.77% for shell_tool.py (test/tools/experimental/shell/test_shell_tool.py)
- Tests cover initialization, validation, execution, security, and integration scenarios
- Good use of pytest fixtures and temporary directories
4. Integration Quality (openai_responses.py)
- Clean integration with OpenAI Responses API
- Proper handling of shell_call extraction and execution
- Good separation between message processing and shell execution
π§ Issues & Recommendations
Critical Issues
None found.
High Priority Issues
1. Duplicate Variable Assignment (shell_tool.py:194)
test_path = test_path = os.path.expanduser(match) if match.startswith("~") else match
This line has a duplicate assignment. Should be:
test_path = os.path.expanduser(match) if match.startswith("~") else match
Location: autogen/tools/experimental/shell/shell_tool.py:194
2. Inconsistent Default Values in _execute_shell_operation (openai_responses.py:750-756)
The initialization logic has problematic defaults:
allowed_paths=allowed_paths if allowed_paths else [],
This sets allowed_paths=[] when allowed_paths=None, but ShellExecutor.__init__ expects None to mean "use default ['**']". This means passing allowed_paths=None will result in an empty list instead of the default wildcard.
Recommendation: Change to:
allowed_paths=allowed_paths if allowed_paths is not None else None,
Same issue with other parameters on lines 752, 754-756.
Location: autogen/oai/openai_responses.py:750-756
Medium Priority Issues
3. Inconsistent Shell Executor Reuse Logic (openai_responses.py:747-779)
The executor is created once but then updated with each call. This could lead to unexpected behavior if different calls have different security requirements. Consider either:
- Creating a new executor for each call (simpler, more predictable)
- Or document clearly that the executor is shared across calls and settings persist
Location: autogen/oai/openai_responses.py:747-779
4. Path Validation Regex Can Be Bypassed (shell_tool.py:189)
The path validation regex r"(?:^|\s)([/~]|\.\./)[^\s]*" only catches paths that:
- Start with
/,~, or../ - Are preceded by whitespace or at the start
This misses:
- Paths like
cat ./file.txt(doesn't start with/,~, or../) - Paths with quotes:
cat "/some/path" - Environment variables:
cat $HOME/file
Recommendation: Either improve the path extraction logic or clearly document the limitations.
Location: autogen/tools/experimental/shell/shell_tool.py:189-199
5. Missing Validation for Timeout Values (shell_tool.py:201-234)
The run method doesn't validate that timeout is positive. Negative or zero timeouts could cause issues.
Recommendation: Add validation:
if timeout is not None and timeout <= 0:
raise ValueError("Timeout must be positive")
Location: autogen/tools/experimental/shell/shell_tool.py:201-234
Low Priority Issues
6. Code Duplication in content_str (code_utils.py:95-100)
The shell_call handling code is very similar to apply_patch_call handling. Consider extracting common logic.
Location: autogen/code_utils.py:95-100
7. Missing Edge Case Tests
While test coverage is good, consider adding tests for:
- Commands with special characters (quotes, backticks, semicolons)
- Commands with environment variables
- Very long command outputs
- Unicode in command output
- Commands that produce binary output
8. Documentation Improvements
- The notebook example is excellent, but consider adding:
- More details on the security model in the main docstring
- Examples of how to properly configure sandboxing for production
- Clarification on the difference between
allowed_pathspatterns and path validation in commands
π Security Analysis
Well Handled
β
Dangerous command patterns are comprehensive
β
Multiple security layers (whitelist, blacklist, patterns, paths)
β
Good defaults (filtering enabled by default)
β
Workspace directory isolation
β
Timeout protection against hanging commands
Potential Concerns
-
Shell Injection: Commands are executed with
shell=Truewhich allows shell metacharacters. While dangerous patterns help, consider documenting that this is inherently risky for untrusted input. -
Path Traversal: The path validation is a best-effort approach but can be bypassed with symbolic links, environment variables, or shell features like command substitution.
-
Resource Limits: No protection against commands that consume excessive CPU, memory, or disk I/O beyond timeouts.
Recommendation: Document these limitations clearly and recommend running in containers or VMs for untrusted code.
π Code Style & Best Practices
Good Practices
- Consistent use of type hints
- Good error messages
- Proper use of dataclasses
- Clean separation of concerns
Minor Suggestions
- Consider using
pathlib.Pathmore consistently instead of mixing withos.path - The
model_dump()methods on dataclasses could useasdict()for consistency
π§ͺ Test Quality
The test suite is comprehensive and well-organized:
- Clear test names following naming conventions
- Good use of pytest fixtures
- Tests cover happy path, error cases, and edge cases
- Integration tests verify the full workflow
Test Statistics: 682 lines of tests for 273 lines of implementation (2.5:1 ratio) - excellent!
π Documentation Quality
- Notebook example (agentchat_gpt5.1_shell_tool_example.ipynb) is well-written with clear examples
- Docstrings are comprehensive and helpful
- Good inline comments explaining security patterns
- README could benefit from more details on security model
π― Recommendations Summary
Must Fix (Before Merge)
- Fix duplicate variable assignment at shell_tool.py:194
- Fix inconsistent default values in openai_responses.py:750-756
Should Fix (High Priority)
- Clarify/fix shell executor reuse logic
- Document path validation limitations or improve the logic
Nice to Have
- Add timeout validation
- Reduce code duplication in content_str
- Add more edge case tests
- Enhance security documentation
β¨ Overall Assessment
Rating: 8.5/10
This is a solid implementation with thoughtful security design. The main issues are minor bugs and documentation gaps rather than fundamental design problems. The extensive test coverage and security considerations demonstrate careful engineering.
Approval Status: β Approve after fixing the two must-fix issues
@claude added changes from previous review: Must Fix (Before Merge) Fix duplicate variable assignment at shell_tool.py:194 Fix inconsistent default values in openai_responses.py:750-756 Should Fix (High Priority) Clarify/fix shell executor reuse logic Document path validation limitations or improve the logic
now review these changes also review apply patch files.
Claude finished @priyansh4320's task in 2m 13s ββ View job
Code Review: Shell Tool and Apply Patch Changes
I've completed a comprehensive review of the changes addressing previous feedback and the apply_patch implementation.
β Previous Issues - Status Check
Must Fix Issues
1. β FIXED - Duplicate variable assignment at shell_tool.py:194
Status: Still present, but I see this is actually NOT a duplicate. Let me clarify:
test_path = os.path.expanduser(match) if match.startswith("~") else match
This line at 194 is correct. My previous review incorrectly identified this as test_path = test_path = ..., but upon re-reading, the actual code is fine. The conditional assignment is proper and doesn't have a duplicate. No fix needed - my original review was incorrect on this point.
2. β NOT FIXED - Inconsistent default values in openai_responses.py:750-756
The issue still exists. Lines 753, 755-756 use this pattern:
allowed_paths=allowed_paths if allowed_paths is not None else None,
denied_commands=denied_commands if denied_commands else [],
enable_command_filtering=enable_command_filtering if enable_command_filtering else True,
Problem:
- Line 753:
allowed_paths if allowed_paths is not None else None- CORRECT β - Line 755:
denied_commands if denied_commands else []- INCORRECT β- If
denied_commands=None, this becomes[] - But
ShellExecutor.__init__expectsNoneto use default[] - This breaks the default value chain
- If
- Line 756:
enable_command_filtering if enable_command_filtering else True- INCORRECT β- If
enable_command_filtering=False, this becomesTrue - This is a logic bug - you can't disable filtering even when explicitly passing
False
- If
Fix Required:
# Line 755 should be:
denied_commands=denied_commands if denied_commands is not None else [],
# Line 756 should be:
enable_command_filtering=enable_command_filtering if enable_command_filtering is not None else True,
Should Fix Issues
3. β οΈ PARTIALLY ADDRESSED - Shell executor reuse logic
Lines 750-782 still reuse the same executor and mutate it, but I see you're now checking is not None for allowed_paths (line 767), which helps. However, the executor reuse still has potential issues:
Remaining concerns:
- Executor state persists across calls, which could cause unexpected behavior if different calls have different security requirements
- Lines 755-756 and 772-782 update the executor in place, making it hard to reason about its state
Recommendation: Either:
- Create a fresh executor for each call (simpler, more predictable), OR
- Document clearly that the executor is shared and settings accumulate across calls
4. β FIXED - Path validation documentation
Excellent work! The README now clearly documents the limitations at lines 100-110:
**Important Limitations**: Path validation in commands uses regex pattern...
**Paths that may bypass validation**:
- Simple relative paths: `cat file.txt`
- Quoted paths: `cat "/some/path"`
- Environment variables: `cat $HOME/file`
**Recommendation**: Use `allowed_paths` patterns to restrict access...
This is exactly what was needed. β
π Apply Patch Tool Review
Architecture & Design - βββββ
The apply_patch implementation is exceptionally well-designed:
- Clean Protocol Pattern:
PatchEditorprotocol (lines 22-89) provides excellent abstraction - V4A Diff Engine: Custom
_V4ADiffApplier(lines 92-179) is robust and doesn't depend on external tools - Workspace Security: Same security model as shell tool with
allowed_pathsvalidation - Async Support: Comprehensive async/sync dual implementation
Code Quality - βββββ
Strengths:
- Clean separation of concerns: diff parsing, path validation, file operations
- Excellent error handling with descriptive messages
- Type hints throughout
- Well-structured dataclasses
apply_patch_tool.py highlights:
- Protocol-based design allows custom editors (lines 22-89)
- Regex-based hunk parsing is simple and effective (line 96)
- Path validation uses
PurePath.match()for secure pattern matching (line 233) - Proper use of
Path.relative_to()for path traversal protection (line 245)
Security Analysis - βββββ
Well Handled:
β
Path traversal protection using Path.relative_to() (lines 240-247)
β
Workspace containment validation
β
Glob pattern-based access control with allowed_paths
β
Proper handling of symbolic links (resolved paths)
β
Parent directory creation with parents=True, exist_ok=True is safe
No security issues found in the apply_patch implementation.
Test Coverage - βββββ
From test_apply_patch_tool.py:
- Comprehensive V4A diff applier tests (lines 20-99+)
- Tests for create, update, delete operations
- Tests for multiple hunks, edge cases
- Good coverage of error conditions
From Codecov report:
apply_patch_tool.py: 75.75% coverage- Missing coverage appears to be in error paths and async variants
Integration Quality - ββββΒ½
The integration in openai_responses.py is solid:
Strengths:
- Clean extraction and execution flow (lines 394-458)
- Proper separation:
_extract_apply_patch_callsβ_execute_apply_patch_calls - Handles both async and sync modes correctly
- Validation of operation types before execution (line 313)
- Proper error wrapping in
ApplyPatchCallOutputformat
Minor Issue (lines 320-333):
def _run_async_in_thread(coro):
"""Run an async coroutine in a separate thread with its own event loop."""
result = {}
def runner():
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
result["value"] = loop.run_until_complete(coro)
loop.close()
t = threading.Thread(target=runner)
t.start()
t.join()
return result["value"]
This is a workaround for running async code when there's already a running loop (e.g., Jupyter). While functional, it's a bit hacky. Consider using asyncio.to_thread() or documenting why this approach is necessary.
π New Issues Found
High Priority
1. Logic Bug in enable_command_filtering (openai_responses.py:756, 778-781)
Location: openai_responses.py:756 and 778-781
# Line 756 - Initialization
enable_command_filtering=enable_command_filtering if enable_command_filtering else True,
# Lines 778-781 - Update
self._shell_executor.enable_command_filtering = (
enable_command_filtering
if enable_command_filtering is not None
else self._shell_executor.enable_command_filtering
)
Problem: Line 756 uses if enable_command_filtering else True, which means:
enable_command_filtering=Falseβ becomesTrueβ- You cannot disable command filtering even when explicitly passing
False
Fix:
# Line 756 should be:
enable_command_filtering=enable_command_filtering if enable_command_filtering is not None else True,
Note that the update logic (lines 778-781) is correct - it uses is not None.
2. Inconsistent denied_commands Default (openai_responses.py:755)
Location: openai_responses.py:755
denied_commands=denied_commands if denied_commands else [],
Problem: Should check is not None instead of truthiness:
denied_commands=Noneβ becomes[]βdenied_commands=[]β becomes[]β
While this might work in practice (both None and [] result in empty list), it's inconsistent with the pattern used for other parameters.
Fix:
denied_commands=denied_commands if denied_commands is not None else [],
Medium Priority
3. Code Duplication in content_str (code_utils.py:86-100)
The shell_call and apply_patch_call handling is very similar. Consider extracting common logic:
elif item["type"] == "apply_patch_call":
operation = item.get("operation", {})
operation_type = operation.get("type", "unknown_operation")
path = operation.get("path", "unknown_path")
status = item.get("status", "unknown_status")
diff = operation.get("diff", "unknown_diff")
rst.append(f"<apply_patch_call: {operation_type} on {path} (status: {status}) diff: {diff}>")
elif item["type"] == "shell_call":
call_id = item.get("call_id", "<unknown>")
action = item.get("action", {})
commands = action.get("commands", [])
rst.append(f"[shell_call id={call_id} commands={commands}]")
Both blocks extract nested dicts and format them. A helper function would reduce duplication.
4. Missing Validation for Timeout (shell_tool.py:201-234)
The run method doesn't validate that timeout is positive:
def run(self, cmd: str, timeout: float | None = None) -> CmdResult:
# Validate command before execution
self._validate_command(cmd)
t = timeout or self.default_timeout
# No validation that t > 0!
Recommendation:
if timeout is not None and timeout <= 0:
raise ValueError("Timeout must be positive")
Low Priority
5. Redundant Conditional in shell_tool.py:194
if match.startswith("~"):
test_path = os.path.expanduser(match) if match.startswith("~") else match
The outer if already checks match.startswith("~"), so the inner ternary is redundant. Could simplify to:
if match.startswith("~"):
test_path = os.path.expanduser(match)
else:
test_path = match
But this is extremely minor and the current code works fine.
π Overall Assessment
Shell Tool: ββββ (8.5/10)
- Strong security design with defense-in-depth
- Excellent documentation of limitations
- Critical bug: Cannot disable command filtering (line 756)
- Good test coverage (89.77%)
Apply Patch Tool: βββββ (9.5/10)
- Excellent architecture with protocol-based design
- Clean, well-tested implementation
- Strong security model
- Comprehensive async support
- Good test coverage (75.75%)
Integration: ββββ (8/10)
- Clean separation of concerns
- Proper error handling
- Async workaround is functional but hacky
- Executor reuse logic could be clearer
π― Recommendations Summary
Must Fix Before Merge
- β Fix
enable_command_filteringlogic bug at openai_responses.py:756 - β Fix
denied_commandsdefault value at openai_responses.py:755
Should Fix (High Priority)
- Document or refactor shell executor reuse behavior
- Consider simplifying async thread workaround or document why it's needed
Nice to Have
- Add timeout validation in ShellExecutor.run()
- Reduce code duplication in content_str()
- Add more edge case tests (unicode, binary output, very long outputs)
β¨ Highlights
What I Really Like:
- Path validation documentation - Excellent transparency about limitations
- Apply patch architecture - Protocol-based design is exemplary
- V4A diff applier - Custom implementation avoids external dependencies
- Comprehensive async support - Both tools properly support async/sync modes
- Test quality - Well-organized tests with clear naming
Key Improvements from Previous Review:
- β Path validation limitations are now clearly documented
- β Apply patch implementation is production-ready
- β Security model is well-thought-out for both tools
π Code Coverage Analysis
From Codecov report:
- shell_tool.py: 89.77% (9 lines missing)
- apply_patch_tool.py: 75.75% (56 lines missing)
- openai_responses.py: 79.56% (57 lines missing)
The coverage is good overall. Missing lines appear to be primarily error paths and edge cases.
Final Verdict
Overall Rating: ββββΒ½ (8.5/10)
This is a high-quality implementation with strong security design and excellent documentation. The apply_patch tool is production-ready. The shell tool has 2 critical bugs that must be fixed before merge (lines 755-756 in openai_responses.py).
Approval Status: β οΈ Approve with required changes
Required changes:
- Fix
enable_command_filteringlogic at line 756 - Fix
denied_commandsdefault at line 755
Once these two lines are fixed, this PR will be ready to merge.
Nice PR. @marklysze @yiranwu0 could you help review this when you have time?
Codecov Report
:x: Patch coverage is 79.13386% with 53 lines in your changes missing coverage. Please review.
| Files with missing lines | Coverage Ξ | |
|---|---|---|
| autogen/agentchat/conversable_agent.py | 65.12% <100.00%> (ΓΈ) |
|
| autogen/oai/client.py | 54.18% <100.00%> (ΓΈ) |
|
| autogen/oai/gemini_types.py | 50.00% <100.00%> (+0.74%) |
:arrow_up: |
| autogen/oai/openai_utils.py | 62.97% <ΓΈ> (ΓΈ) |
|
| autogen/token_count_utils.py | 64.33% <ΓΈ> (ΓΈ) |
|
| autogen/tools/experimental/shell/__init__.py | 100.00% <100.00%> (ΓΈ) |
|
| autogen/code_utils.py | 73.74% <0.00%> (-1.36%) |
:arrow_down: |
| autogen/tools/experimental/shell/shell_tool.py | 89.77% <89.77%> (ΓΈ) |
|
| autogen/oai/openai_responses.py | 77.37% <86.32%> (+2.94%) |
:arrow_up: |
| autogen/a2a/client.py | 63.97% <25.80%> (+0.78%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.