webllama icon indicating copy to clipboard operation
webllama copied to clipboard

Terminal interactions, future of web navigation, and antagonistic webpages

Open TimeLordRaps opened this issue 10 months ago • 6 comments

I'm interested in this model as a catch all for a project I am working on similar to Devin. I think as Devin like clones, and similar apparatuses become available terminal commands will also have a highlighted importance.

Earlier today before seeing this model I had the idea of something akin to an inverse of this model existing that generates web pages on the fly for the user to see where essentially every action a user would take would be fed into the server, the model, which will respond with a valid html based on the base architecture, which is also ai designed long term, of the website. Hence the existence of antagonistic webpage "agents", let's call them something other than agent though, because they are more like architects, so I guess that.

  1. I think this is the future of web related content.
  2. I think websites will build defense efforts intentionally designed through this matter to defend against autonomy of agents.
  3. I think autonomous agents will win out by the selective matter of GPU concentration and constraints, why because they started first, most websites won't be antagonistic, thus autonomy will have a larger space to build generality from.
  4. Generality to the point of 9's of reliability leads to agents that are most likely better than anything that could defend while being suitable for humans, assuming vision.

What are your thoughts on these matters, and what datasets would fit best in the intersection of these ideas creating the true future ai architect agent that goes to design new websites in tandem with its abilities to use them?

TimeLordRaps avatar Apr 24 '24 21:04 TimeLordRaps

Thanks for the interesting discussion! I believe that red teaming security vs automation will be an important research subject, esp. concerning autonomous agents. Since webllama aims to build human-centric agents (i.e. interact through dialogue rather than single instruction), it may be less affected, but some may still attempt to use conversational web agents for automation.

Regarding your idea of generating html, it reminds me of Deepmind's Genie, but for websites. I think it's an interesting area of research that will take time to yield fruitful results. Perhaps websites like weblinx, mind2web, and environments like webarena and visualwebarena could be used to build such a "Web" genie.

xhluca avatar Apr 24 '24 22:04 xhluca

I agree that it is very similar to genie, though I feel like genie was more at the pixel level.

Direct front end fine-tuned Architects seems more tractable in the near term compared to genie, as well as any type of backend level integration of architects shortly following. What do you think would be more opportunistic in the near term regarding focusing the architects on: front end elements off of static foundations of components or integrating the whole model into a single cohesive agentic framework that architects sites from the ground up, so to speak, and is agentic generally across sites. Based on the current state of the art which I would say is this by the basis of your reports as well as publicly available datasets?

TimeLordRaps avatar Apr 24 '24 22:04 TimeLordRaps

I think it'll be quite challenging to tell which approach would be better near/long term, front end elements off of static foundations of components would be limited due to the scoped functionality of the components, their limitations by design, etc. However the existing structure of the components might make the result easier to evaluate. OTH single-framework approach might only rely on architects that are more general and that will have simpler inductive biases, but are harder to evaluate and could be more prone to fail. Overall both venues are worth exploring (probably diff groups working on different directions)

xhluca avatar Apr 25 '24 04:04 xhluca

Thank you for your very thorough analyses.

I had not considered evaluability.

Do you know of any datasets and/or resources besides the ones you listed that would be helpful towards either endeavor?

I feel interested in pursuing either avenue, but I think near term static foundations is probably the way to go, and maybe when enough existing open-source foundations have been built maybe off a central framework like react or something a more general model could be trained that is capable of generalizing across the central framework towards the beginnings of component design as well as cross project combination.

At the point of reliable component design, I think it makes sense to integrate in backends, which probably themselves will have a similar evolution, but that one seems murkier. I guess the static foundations would be something like an api router, where the ai gets in a request with a body and it determines how to use the available apis, which seems like what function calling llms already are headed towards.

On the backend component design seems like creating endpoints.

The combination of both paradigms seems like some form of communication between the architect on the frontend with the architect(s) on the backend where the backend is providing context for the frontend to "render" it.

Based on your experience with this project if I were to train llama3's for each separation of tasks in order to create some sort of architect architecture, what datasets would be most useful for each?

  1. Frontend component design, doubting there are any reliable combination datasets.
  2. Function calling, I can probably find some for this, but your input would be appreciated.
  3. API endpoint examples/documentation for the design side of the backend.

TimeLordRaps avatar Apr 25 '24 19:04 TimeLordRaps

  1. https://huggingface.co/datasets/EddieChen372/react_repos 800Mb
  2. Any SOTA high quality relevant function calling dataset, feels like there are a few
  3. a giant function dataset which seems applicable as there's probably a ton of api endpoint functions in it, so would just need to filter it: https://huggingface.co/datasets/Fsoft-AIC/the-vault-function.

TimeLordRaps avatar Apr 25 '24 23:04 TimeLordRaps

I think 1 would make sense since human users would still see what's happening and be able to step in to control it. Sort of driving a car vs. sitting in automated train with no window.

xhluca avatar Apr 26 '24 18:04 xhluca

Thanks

TimeLordRaps avatar Apr 27 '24 02:04 TimeLordRaps