Open
Majdoddin
opened this issue 1 year ago
•
9 comments
User description
This PR is based on the observation that ChatGPT is not able to do the computation necessary to calculate the output of the whole code for a given input (in web interface mode, it often resorts to generating and running a script).
Therefore three enhancements:
In the "Iterate on Public Tests" phase, keep a log of the call stack while running the code, and if the code output doesn't match the test output, the LLM should first analyse the log (in YAML):
Check formatting of final output.
For each function call check the parameters are valid,
and check if function output is correct,
and given an incorrect function output or a raised exception, what is its cause.
Based on the analysis, the LLM should then generated the corrected code.
In "Initial Code Solution" phase, the LLM first generates a code structure, with function signatures, and comments according to the generated algorithm. In the next step, the LLM generates the function bodies and adds imports.
In the "Generate Additional AI Tests", the LLM generates test inputs to cover various challenging cases, sorted by difficulty. The test outputs are not generated, because, as discussed earlier, the LLM is not able to compute them. Special provision is needed to generate very long inputs necessary to test the runtime. This stage is not implemented yet.
The vision is to enable the LLM to connect to the debugger and run debugging sessions, with breakpoints, watches, ...
Comparison
I ran branch main and this PR on the same problem (see the attached logs). Both initially generated incorrect code, but the PR was able to successfully debug the code after 2 attempts (in total 9 LLM inferences), while the main could not get the correct code, even after 26 LLM inferences.
Delete Two Elements - 60 - main.logDelete Two Elements - 60 - PR.log
Type
enhancement, bug_fix
Description
Implemented new stages for code structure generation and function body generation.
Added debugging mechanism with function call logging and YAML serialization.
Enhanced solution selection and possible solutions generation with additional YAML keys.
Modified AI test generation to focus on input generation.
Updated configuration files with new prompts for debugging, code fixing, and more.
Changed file logging mode from overwrite to append.
Changes walkthrough
Relevant files
Enhancement
7 files
coding_competitor.py
Integrate New Stages for Code Generation and Debugging
alpha_codium/gen/coding_competitor.py
Added new stages for code structure generation and function body generation.
Replaced initial code generation and public test evaluation with new stages.
Added execution of public tests after code generation.
PR-Agent was enabled for this repository. To use it, please link your git user with your CodiumAI identity here.
PR Review
PR feedback
⏱️ Estimated effort to review [1-5]
4, because the PR introduces significant changes across multiple files, including new features and modifications to existing logic. The complexity of the changes, especially those related to debugging and code generation, requires careful review to ensure correctness and adherence to project standards.
🧪 Relevant tests
No
🔍 Possible issues
The compare_titles function in maj/another_sorting_problem.py uses a custom comparator but does not return an integer (-1, 0, 1) as expected by Python's sorting functions when using cmp_to_key. This could lead to incorrect sorting behavior.
The exec_code function in alpha_codium/gen/stages/debug.py modifies the built-in input and print functions but does not restore them, which could affect other parts of the code that rely on these functions.
The use of pass at the end of the try block in alpha_codium/gen/coding_competitor.py is unnecessary and could be removed for clarity.
The run_public_tests function in alpha_codium/gen/stages/run_public_tests.py has a success variable that is set but never used, which could be an oversight or unnecessary code.
In several configuration files (e.g., alpha_codium/settings/code_contests_prompts_generate_ai_tests.toml), the frequency_penalty parameter is added, but its impact on the behavior of the AI models used should be carefully considered to ensure it aligns with the intended use cases.
🔒 Security concerns
No
✨ Usage guide:
Overview:
The review tool scans the PR code changes, and generates a PR review. The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on any PR.
When commenting, to edit configurations related to the review tool (pr_reviewer section), use the following template:
The review tool can be configured with extra instructions, which can be used to guide the model to a feedback tailored to the needs of your project.
Be specific, clear, and concise in the instructions. With extra instructions, you are the prompter. Specify the relevant sub-tool, and the relevant aspects of the PR that you want to emphasize.
Examples for extra instructions:
[pr_reviewer] # /review #
extra_instructions="""
In the 'possible issues' section, emphasize the following:
- Does the code logic cover relevant edge cases?
- Is the code logic clear and easy to understand?
- Is the code logic efficient?
...
"""
Use triple quotes to write multi-line instructions. Use bullet points to make the instructions more readable.
How to enable\disable automation
When you first install PR-Agent app, the default mode for the review tool is:
pr_commands = ["/review", ...]
meaning the review tool will run automatically on every PR, with the default configuration.
Edit this field to enable/disable the tool, or to change the used configurations
Auto-labels
The review tool can auto-generate two specific types of labels for a PR:
a possible security issue label, that detects possible security issues (enable_review_labels_security flag)
a Review effort [1-5]: x label, where x is the estimated effort to review the PR (enable_review_labels_effort flag)
Extra sub-tools
The review tool provides a collection of possible feedbacks about a PR.
It is recommended to review the possible options, and choose the ones relevant for your use case.
Some of the feature that are disabled by default are quite useful, and should be considered for enabling. For example:
require_score_review, require_soc2_ticket, and more.
Auto-approve PRs
By invoking:
/review auto_approve
The tool will automatically approve the PR, and add a comment with the approval.
To ensure safety, the auto-approval feature is disabled by default. To enable auto-approval, you need to actively set in a pre-defined configuration file the following:
[pr_reviewer]
enable_auto_approval = true
(this specific flag cannot be set with a command line argument, only in the configuration file, committed to the repository)
You can also enable auto-approval only if the PR meets certain requirements, such as that the estimated_review_effort is equal or below a certain threshold, by adjusting the flag:
[pr_reviewer]
maximal_review_effort = 5
More PR-Agent commands
To invoke the PR-Agent, add a comment using one of the following commands:
/review: Request a review of your Pull Request.
/describe: Update the PR title and description based on the contents of the PR.
-exec(code, candidate_module.__dict__)
+try:
+ exec(code, candidate_module.__dict__)
+except SyntaxError as e:
+ logger.error(f"Syntax error in provided code: {e}")
+ raise
Simplify character comparison logic in sorting function.
Instead of manually comparing characters in compare_titles, consider using the cmp_to_key function from the functools module to simplify the sorting logic.
-for i in range(len(title1)):
- if i % 2 == 0: # Odd position (0-indexed)
- if title1[i] != title2[i]:
- return ord(title1[i]) - ord(title2[i])
- else: # Even position (0-indexed)
- if title1[i] != title2[i]:
- return ord(title2[i]) - ord(title1[i])
-return 0
+# Assuming the improved logic is implemented in a separate function
+return cmp_to_key(your_new_comparison_function)(title1, title2)
Improve clarity and conciseness of guidance on choosing the best solution.
Consider rephrasing the guidance to emphasize the importance of insight and simplicity in the solution. The current phrasing "Don't just pick the most efficient solution. The main consideration is that the solution has the most insightfull key observation and can fully solve the problem in a simple and robust manner." could be made more concise and clear.
-Don't just pick the most efficient solution. The main consideration is that the solution has the most insightfull key observation and can fully solve the problem in a simple and robust manner.
+Prioritize solutions with insightful observations and simplicity, ensuring they fully and robustly solve the problem.
Enhance guidelines for commenting on function purposes and interactions.
The guideline "Skip the function bodies, just comment which part of algorithm it implements, which other generated functions it calls, and what it returns." could be enhanced by specifying that comments should also briefly mention any significant assumptions or preconditions for each function.
-Skip the function bodies, just comment which part of algorithm it implements, which other generated functions it calls, and what it returns.
+Skip the function bodies, but include comments detailing the part of the algorithm implemented, any other generated functions it calls, what it returns, and any significant assumptions or preconditions.
Add a default value for the explanation field in the InputOutput class.
Consider adding a default value for the explanation field in the InputOutput class to ensure consistency and avoid potential errors when examples are missing explanations.
-explanation: str = Field(description="Short explanation why the examples are in correct format.")
+explanation: str = Field(default="", description="Short explanation why the examples are in correct format.")
Add a description for the frequency_penalty parameter.
Add a description for the frequency_penalty parameter to clarify its purpose and impact on the code contest prompt solving process.
Clarify instructions for reporting issues not directly related to function calls or output formatting.
In the section that outlines the debugging task, it would be beneficial to clarify the expectations around the analysis of the call stack and the output. Specifically, the instruction "If you think the false output has other cause, like false algorithm, or that a function comments are false, say it." could be expanded to guide the user on how to report such findings effectively.
-If you think the false output has other cause, like false algorithm, or that a function comments are false, say it.
+If you identify other causes for incorrect output, such as errors in the algorithm or inaccuracies in function comments, please provide a detailed explanation.
Improve the clarity and correctness of instructions for generating AI tests.
The instruction "All the inputs should be valid, explicit, and can be directly inputted to the code. Double check them, and validate if they strictly match the problem description ans rules." contains a typo and could be more clearly worded. Consider revising for clarity and correctness.
-All the inputs should be valid, explicit, and can be directly inputted to the code. Double check them, and validate if they strictly match the problem description ans rules.
+Ensure all inputs are valid, explicit, and directly usable by the code. Double-check to confirm they strictly adhere to the problem description and rules.
best practice
Emphasize the importance of handling edge cases in function implementations.
The guideline "Double-check each function. It should implement the part of algorithm in its comments, and generalize to any valid parameters, and not just the provided examples." could be improved by explicitly mentioning the importance of considering edge cases and ensuring the function handles them appropriately.
-Double-check each function. It should implement the part of algorithm in its comments, and generalize to any valid parameters, and not just the provided examples.
+Carefully review each function to ensure it implements the algorithm as described in its comments, generalizes to any valid parameters, and correctly handles edge cases.
Align frequency_penalty values for consistency across configurations.
Ensure consistency in the configuration by aligning the frequency_penalty values across different TOML files if the intent is to maintain similar behavior in solving direct and regular code contest prompts.
-frequency_penalty = 0.1
+frequency_penalty = 0.1 # Ensure this value aligns with similar configurations in other TOML files for consistency.
Use environment variables for model and cache directory configurations.
Consider using environment variables or a configuration management system to dynamically set the model and private_dataset_cache_dir paths to facilitate easier switching between models and managing cache directories across different environments.
-model="gpt-4-0125-preview"
-private_dataset_cache_dir="~/ai/alphacodium"
+model=env.get("MODEL", "gpt-4-0125-preview") # Use environment variable or default
+private_dataset_cache_dir=env.get("CACHE_DIR", "~/ai/alphacodium") # Use environment variable or default
bug
Correct typo in the remove_brute_force_solutions setting.
Correct the typo in the remove_brute_force_solutions setting to ensure the configuration is correctly applied and brute force solutions are appropriately managed according to the intended settings.
-remove_brute_force_solutions=false
+remove_brute_force_solutions=false # Corrected typo from "remove_bruce_force_solutions"
✨ Usage guide:
Overview:
The improve tool scans the PR code changes, and automatically generates suggestions for improving the PR code. The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on a PR.
When commenting, to edit configurations related to the improve tool (pr_code_suggestions section), use the following template:
meaning the improve tool will run automatically on every PR, with summarization enabled. Delete this line to disable the tool from running automatically.
Utilizing extra instructions
Extra instructions are very important for the improve tool, since they enable to guide the model to suggestions that are more relevant to the specific needs of the project.
Be specific, clear, and concise in the instructions. With extra instructions, you are the prompter. Specify relevant aspects that you want the model to focus on.
Examples for extra instructions:
[pr_code_suggestions] # /improve #
extra_instructions="""
Emphasize the following aspects:
- Does the code logic cover relevant edge cases?
- Is the code logic clear and easy to understand?
- Is the code logic efficient?
...
"""
Use triple quotes to write multi-line instructions. Use bullet points to make the instructions more readable.
A note on code suggestions quality
While the current AI for code is getting better and better (GPT-4), it's not flawless. Not all the suggestions will be perfect, and a user should not accept all of them automatically.
Suggestions are not meant to be simplistic. Instead, they aim to give deep feedback and raise questions, ideas and thoughts to the user, who can then use his judgment, experience, and understanding of the code base.
Recommended to use the 'extra_instructions' field to guide the model to suggestions that are more relevant to the specific needs of the project, or use the custom suggestions :gem: tool
With large PRs, best quality will be obtained by using 'improve --extended' mode.
More PR-Agent commands
To invoke the PR-Agent, add a comment using one of the following commands:
/review: Request a review of your Pull Request.
/describe: Update the PR title and description based on the contents of the PR.
@mrT23 I really appreciate your interest.
Sure, the code is ready to try.
The reason for draft mode is that
1- The stages "generate additional AI tests" and "Iterate on AI tests" are out of band. TODO:
-- The generated AI tests should have just input, because the LLM is not able to compute the output.
-- Running the code against AI tests should employ the function level debugging.
2- I added a quick module to run the generated code. It works, but better to integrate it with alpha_codium/code_contests/eval/
PR-Agent was enabled for this repository. To use it, please link your git user with your CodiumAI identity here.
PR Review
⏱️ Estimated effort to review [1-5]
4, because the PR introduces significant changes across multiple files, including new functionalities, changes to existing processes, and the addition of debugging capabilities. The complexity and breadth of these changes necessitate a thorough review to ensure correctness, performance, and alignment with the project's architecture and coding standards.
🧪 Relevant tests
No
🔍 Possible issues
Possible Bug: The implementation of run_public_tests in run_public_tests.py relies on a retry mechanism with a fixed number of iterations (max_iter). This could lead to non-deterministic behavior and potentially infinite loops if the underlying issue causing a test to fail is not resolved within the allowed attempts.
Performance Concern: The debugging and function call logging mechanism introduced in debug.py could significantly impact performance, especially for complex codebases or when processing a large number of function calls. The overhead of logging every function call and its details might not be suitable for all environments.
Code Quality: There are several instances where comments are used to disable code blocks (e.g., # # generate ai tests (only inputs) in coding_competitor.py). This approach can lead to confusion and maintenance challenges. It would be better to remove unused code or clarify its purpose if it's meant to be re-enabled later.
Consistency Issue: The change from 'w' to 'a' in the file mode for logging setup in __init__.py of the log module could lead to logs being appended indefinitely, potentially causing issues with log file management and disk space usage.
🔒 Security concerns
No
✨ Review tool usage guide:
Overview:
The review tool scans the PR code changes, and generates a PR review. The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on any PR.
When commenting, to edit configurations related to the review tool (pr_reviewer section), use the following template:
The review tool can be configured with extra instructions, which can be used to guide the model to a feedback tailored to the needs of your project.
Be specific, clear, and concise in the instructions. With extra instructions, you are the prompter. Specify the relevant sub-tool, and the relevant aspects of the PR that you want to emphasize.
Examples for extra instructions:
[pr_reviewer] # /review #
extra_instructions="""
In the 'possible issues' section, emphasize the following:
- Does the code logic cover relevant edge cases?
- Is the code logic clear and easy to understand?
- Is the code logic efficient?
...
"""
Use triple quotes to write multi-line instructions. Use bullet points to make the instructions more readable.
How to enable\disable automation
When you first install PR-Agent app, the default mode for the review tool is:
pr_commands = ["/review", ...]
meaning the review tool will run automatically on every PR, with the default configuration.
Edit this field to enable/disable the tool, or to change the used configurations
Auto-labels
The review tool can auto-generate two specific types of labels for a PR:
a possible security issue label, that detects possible security issues (enable_review_labels_security flag)
a Review effort [1-5]: x label, where x is the estimated effort to review the PR (enable_review_labels_effort flag)
Extra sub-tools
The review tool provides a collection of possible feedbacks about a PR.
It is recommended to review the possible options, and choose the ones relevant for your use case.
Some of the feature that are disabled by default are quite useful, and should be considered for enabling. For example:
require_score_review, require_soc2_ticket, and more.
Auto-approve PRs
By invoking:
/review auto_approve
The tool will automatically approve the PR, and add a comment with the approval.
To ensure safety, the auto-approval feature is disabled by default. To enable auto-approval, you need to actively set in a pre-defined configuration file the following:
[pr_reviewer]
enable_auto_approval = true
(this specific flag cannot be set with a command line argument, only in the configuration file, committed to the repository)
You can also enable auto-approval only if the PR meets certain requirements, such as that the estimated_review_effort is equal or below a certain threshold, by adjusting the flag:
[pr_reviewer]
maximal_review_effort = 5
More PR-Agent commands
To invoke the PR-Agent, add a comment using one of the following commands:
/review: Request a review of your Pull Request.
/describe: Update the PR title and description based on the contents of the PR.
Use more descriptive variable names and avoid shadowing built-in functions.
Consider using a more descriptive variable name than iter to avoid confusion with built-in functions and improve code readability. Additionally, ensure that the loop variable does not shadow the built-in iter function.
-for iter in range(max_iter):
+for attempt in range(max_iter):
Use is None for None checks to follow Pythonic practices.
Instead of using output == None, it's more Pythonic to use output is None to check for None values. This change enhances readability and follows Python's recommended practices.
-if output == None or (output.strip() != outp.strip()):
+if output is None or (output.strip() != outp.strip()):
Encourage the use of modular design principles in code structure.
Add a guideline to encourage the use of modular design principles when dividing the code into sub-functions, to enhance code readability and maintainability.
-You must divide the generated code into small sub-functions, with meaningful names, parameters and functionality.
+Divide the generated code into small sub-functions, applying modular design principles. Ensure each function has a clear purpose, meaningful names, and well-defined parameters.
Adjust the default verbosity level to balance feedback and information overload.
Consider setting the verbosity_level to 1 by default to balance between providing enough feedback for debugging and avoiding overwhelming users with too much information.
Enhance the custom print function to handle sep correctly.
To ensure that the custom print function correctly handles multiple arguments and keyword arguments, consider using sep from kwargs or defaulting to ' ' if not provided. This change ensures that the custom print function behaves more like the built-in print.
Prevent indefinite log file growth by using a rotating file handler.
Consider using a rotating file handler or setting a file size limit for the log file to prevent it from growing indefinitely. This can be achieved by using RotatingFileHandler from the logging module instead of FileHandler.
Encourage exploring efficient algorithms before resorting to brute force solutions.
Consider rephrasing the guideline about brute force solutions to encourage exploring more efficient algorithms before resorting to brute force. This can foster a deeper understanding of the problem and promote the development of more sophisticated solutions.
-- Give an efficient brute force solution, if you do not find a better algorithm.
+- Explore efficient algorithms first. If none are found, then consider a brute force solution as a last resort.
Add a guideline to ensure the diversity of the generated tests.
Add a guideline to ensure the diversity of the generated tests, emphasizing the importance of covering a wide range of scenarios, including edge cases and typical use cases.
-Try to cover cases that are not covered by the original tests, or are challenging for this implementation. Also include a test for large inputs.
+Ensure the diversity of the generated tests by covering a wide range of scenarios, including edge cases, typical use cases, and large inputs.
Clarification
Clarify guidelines on providing feedback for the chosen solution.
Clarify the guideline about not changing the selected solution to specify that while the chosen solution should not be altered, constructive feedback on how to improve or optimize the solution is encouraged.
-Do not change the selected solution.
+Do not change the selected solution. However, providing constructive feedback on potential improvements or optimizations is encouraged.
✨ Improve tool usage guide:
Overview:
The improve tool scans the PR code changes, and automatically generates suggestions for improving the PR code. The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on a PR.
When commenting, to edit configurations related to the improve tool (pr_code_suggestions section), use the following template:
meaning the improve tool will run automatically on every PR, with summarization enabled. Delete this line to disable the tool from running automatically.
Utilizing extra instructions
Extra instructions are very important for the improve tool, since they enable to guide the model to suggestions that are more relevant to the specific needs of the project.
Be specific, clear, and concise in the instructions. With extra instructions, you are the prompter. Specify relevant aspects that you want the model to focus on.
Examples for extra instructions:
[pr_code_suggestions] # /improve #
extra_instructions="""
Emphasize the following aspects:
- Does the code logic cover relevant edge cases?
- Is the code logic clear and easy to understand?
- Is the code logic efficient?
...
"""
Use triple quotes to write multi-line instructions. Use bullet points to make the instructions more readable.
A note on code suggestions quality
While the current AI for code is getting better and better (GPT-4), it's not flawless. Not all the suggestions will be perfect, and a user should not accept all of them automatically.
Suggestions are not meant to be simplistic. Instead, they aim to give deep feedback and raise questions, ideas and thoughts to the user, who can then use his judgment, experience, and understanding of the code base.
Recommended to use the 'extra_instructions' field to guide the model to suggestions that are more relevant to the specific needs of the project, or use the custom suggestions :gem: tool
With large PRs, best quality will be obtained by using 'improve --extended' mode.
More PR-Agent commands
To invoke the PR-Agent, add a comment using one of the following commands:
/review: Request a review of your Pull Request.
/describe: Update the PR title and description based on the contents of the PR.