FLAML Agent improvement

Comments from @gagb: Some observations:

the agent many times starts to suggest shell commands which makes the code fail. Especially as the conversation gets longer
Sometimes the user responds with empty strings and the code agent never returns terminal and the code gets stuck in a loop. Also happens when lang=unknown eg cuz the agent didn't wrap the python code in codeblockss
The code fails if the context size > 8k Original comment: https://github.com/microsoft/FLAML/commit/3b3dd60730931c793e88e4b2aa870fa0192db3f5#diff-9ac9829642f8aa5ad3ed717f7f60eabedf33210195465c1f6473cd2cfd4cd2af

PR microsoft/FLAML#1025

### Tasks
- [ ] https://github.com/microsoft/autogen/issues/9

May 10 '23 02:05 qingyun-wu

@gagb The second problem should have been addressed in the latest PR. Let me know if you still have this observation.

May 12 '23 01:05 qingyun-wu

More feedback based on integration with tinyRA and using gpt-3.5-turbo:

Drift: The conversation may drift and start to execute code that unrelated to the goal and possibly very unsafe. We need more safety checks on the code it suggests.
Memory refreshing: Others have found that occasionally refreshing agent memory with goal can help.
Guaranteed structured output: Currently there are no guarantees that the coding agent will output a python code block (or even use code blocks). This can cause the conversation to fail.
Shell agent: Currently agent can't execute shell commands to succeed (e.g., pip commands to install python packages).

May 25 '23 17:05 gagb

@gagb The second problem should have been addressed in the latest PR. Let me know if you still have this observation.

I think I still happens with gpt-3.5. I haven't been able to test with gpt-4 because I don't have access to it. I am working on a feature to share failure cases from tinyRA easily.

May 25 '23 20:05 gagb