AutoGPT I made a CLIP vision + GPT-* + Stable Diffusion = Auto-GPT-SD + visual web (image) search. But...

... I have no idea how to adhere to your "contribution guidelines" (your "code of conduct" is a 404 btw).

I am not a developer. I can't really code. I can sufficiently read Python to understand general concepts and find a variable to make an intended change. Or save an image that was only being displayed originally. But that's about it. But I do consider myself to have "an abundance of creative ideas", and I consider myself to be a "prompt engineer". In other words, GPT-4 wrote 90% of the code. I would never be able to do it myself, without the help of GPT-4.

AI & I have implemented a custom run_clip command into AutoGPT (and a custom stablediffusion for command line, but that's rather irrelevant in the broad scheme of things).

Goal 1: GPT-3.5 (in my case, no GPT-4 API access yet) obtains "a CLIP opinion" about an initial image (gradient ascent, using @advadnoun's "CLIP gradient ascent" script, for which I have obtained explicit permission to make it public).

It runs as a subprocess to hide the output from GPT-3.5 and "pre-process" it by stripping non-printable characters, duplicates, and dumping all tokens (words) returned by CLIP into a single line in a txt file.

Goal2, 3: read_file <clip_tokens.txt>. Make a coherent, meaningful, but creative prompt for stable diffusion using CLIP opinion, explicitly including CLIP's "weird" token opinions like "spiderrollercoaster" (real example), and launch [custom stable diffusion submodule process, in my case]

Goal 4: Obtain a CLIP opinion about the image I (AI) have created with stable diffusion and use it to make a new prompt.

Goal 5: Repeat ad infinitum (or token limit, or catastrophic forgetting of what the AI is doing)

Result: Self-reinforcing, self-improving image generation process (due to CLIP [or iterations thereof] being part of pretty much any text-to-image generative AI system, and CLIP's text encoder again building on GPT-3, and GPT-3.5 / GPT-4 being iterations thereof, I guess; like a "hard soft prompt" (see arXiv:2302.03668 ); the prompts are "strange" to a human, but the outcome is (to a human) ever-improving - up until the "demented-are-go-GPT point" / model limitations).

https://twitter.com/zer0int1/status/1652372002546524160

CLIP's "opinion" tokens always contain some spot-on descriptions of what's in the image, in a mix with incomprehensible AI-weirdness. GPT-3.5 handles that extremely well, able to use CLIP's vision to extract accurate picture classification and engaging in web search to "find more pictures thereof", for example:

https://twitter.com/zer0int1/status/1652590746581585922

Nevertheless, oddly enough, if GPT-3.5 is instructed not just to "generate image" but "image of footwear", then - even if the CLIP opinion tokens describe footwear, like in my above video example - GPT-3.5 will conclude it must "get a better prompt" and run off to Pinterest to look for inspiration there, or - in the absence of internet availability - it will conclude "the CLIP opinion is not very good. I must use a different image to get a better CLIP opinion".

So I am leveraging CLIP's "typographic attack vulnerability" (obsession about text present in image, leading to steer CLIP towards the meaning of the text, which - if coherent with the actual object / image content - is excellent for reinforcing a certain interpretation / focus on a topic).

AI & I are currently working on a CLIP GradCAM implementation to have a "debug CLIP for the human" option that will show what CLIP is "looking at". This is an excellent example to show that I am not exaggerating when I say "GPT-4 coded this, I have essentially nothing to do with the code":

https://twitter.com/zer0int1/status/1652944918984249344

So, I have this thing which is a bit of a mess, likely grossly disrespecting developer / coding guidelines due to how it came to be.

I would like to contribute it, though.

But I wonder how. I could eventually do a fork that works for me, locally, but is "totally broken" for everybody else due to absolute paths (how would I even define a "home" path that is the home in Linux AND Windows? No idea!), requires OpenAI/CLIP + models as a perquisite, requires messing with the gradient ascent script code to change iterations and CLIP model, just crashes when model is too big for VRAM, etc. - you name it.

I would be happy for you to just "steal the idea" from inside this mess and implement it in a professional way, if you leave the author credits intact.

Let me know what you think - sorry about the verbosity, feel free to task your GPT-* with a tl;dr. ;-)

And: Thank you very much for this awesome repo!

May 02 '23 06:05 zer0int

I’m about to head to bed, but I’m glad to hear you want to give back. It really made my night. That’s what this project is all about. I’ll let someone like @Vwing or @k-boikov handle it from here but I love the spirit.

May 02 '23 06:05 ntindle

"Word of Mouth AI evaluation"

What are CLIP, GPT-* and stablediffusion communicating in special CLIP token language...?🧐 Coherent Creativity, but no spatial information, I'd say... The "blue cube on top of a red cube" problem cannot be solved this way, it seems.

(See this as "proof that the coherent generation of a series of shoes designs is not random, but always applies - just limited by GPT-3.5 hiccups / model limitations")

https://twitter.com/zer0int1/status/1653510082364047365

PS: Tried the cube stack with just the word "cube" written on it, too. That was too much for CLIP's typographic attack vulnerability; it essentially fell down a rabbit hole of its neuronal net:

CLIP: "quescube cube node physics box cubic roblox ︎ cubecube efficient cscrambled object itercubecubes cubes"

May 02 '23 21:05 zer0int

Messy or not every contribution is a contribution! No matter what comes out of a piece of code its always worth sharing it if you feel like so. We will be happy to look at it and potentially spring a discussion on how it can be useful for other people, it might end up a plugin or maybe part of the core repo - who knows.

May 02 '23 22:05 k-boikov

Alright - AI & I have struggled with the v0.3.0 and the now non-existing prompts.py (hey, I needed that! :-P ) and what GPT-4 referred to as an "@command decorator" until I ran out generated responses for the next 3 hours - a good time to give up on the "fork" for now, and just upload the working [for me] v0.2.2 like "totally disconnected from this project" (new repo).

Sorry about that - but at least that way I know I didn't just edit master and everything is broken either way (which is hardly possible with a fork and the lack of orientation therein).

https://github.com/zer0int/Auto-GPT-SD-Vision-alpha

I have no idea if I can just by default receive a "pull request" there, but "yes please" in case you (ANY "you" here) fixes the messed up absolute paths into something that works for everyone - I'd happily work with those instead of the "this ONLY works for me" version!

If you have GPT-4 API access and try this - I don't, albeit applying with this very project for API access...

But I'd love to see what you do with GPT-4 and this, so please increase my AI FOMO level to over 9000 and share it with me - tag me on Twitter: @zer0int1

Sorry for the mess, but I really hope you'll find it enjoyable (once you fixed the enraging mess of awful coding).

I guess this is the future. Repos / forks spawned by lusers leveraging GPT-4 to make cool things happen, but in outrageously messed-up ways that nobody else can figure out because they cluelessly don't follow any coding conventions. ✌

May 03 '23 16:05 zer0int

defeated-gpt4

May 03 '23 18:05 zer0int

tl;dr visual summary of what this even does, in embedded images:

whattheysee1 whattheysee2 results-shoes-full

May 03 '23 18:05 zer0int

Update: I got GPT-4 API access some 15 hours ago. The issue with catastrophic forgetting is indeed just a model limitation of GPT-3.5; GPT-4 has executed the task loop with perfection, i.e. 0 user interventions / 100% approval rate for 42 iterations of the "generate image" loop.

Together with the fact that GPT-4 coded the working implementation of v0.2.2. AutoGPT-SD-Vision, AND this flawless context memory, the adaptation to v0.3.0 will likely be possible, too, now that I won't have to wait 3 hours but can continuously work on it, even if via the playground. However, I'll try to put two copies of a stripped-down v0.2.2 and v0.30 folder into its workspace first, let's see if it can just auto-fix / port the code, with careful and precise prompt engineering / goals.

If that works, its likely AI & I can also fix all the "n00b" problems like absolute paths, special characters bugging out GradCAM, etc. - in short, it's probably best for you not to bother with this and my infinitely verbose explanation of current issues (if I had known that I would get GPT-4 API access eventually, I would have refrained from revealing the project to you at this point!).

Cheers & wishing you all a great weekend!

https://user-images.githubusercontent.com/132047210/236417681-21913953-bbbc-483d-a34a-0a963d5f167e.mp4

May 05 '23 09:05 zer0int

This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days.

Sep 06 '23 21:09 github-actions[bot]

This issue was closed automatically because it has been stale for 10 days with no activity.

Sep 17 '23 01:09 github-actions[bot]

AutoGPT AutoGPT copied to clipboard

I made a CLIP vision + GPT-* + Stable Diffusion = Auto-GPT-SD + visual web (image) search. But...

AutoGPT
AutoGPT copied to clipboard