screenshot-to-code Follow up: Means of Evaluation and Instruction Automatic Generation

Following up with #70 , we now have a potential way to evaluate the result provided by the system. Meanwhile, we can generate instructions automatically by vision comparison.

Now our system have vision comparison, but it is coupling with other generation process. Thus we can not fully take advantages of vision comparison:

Evaluation: We need a means of evaluation of the result generated by our system. Vision comparison is perfect for this.
Instruction generation: What users are doing is to compare the differences between origin and result images, and tell GPT the differences and ask GPT to update. By Vision Comparison, we can automatically generate the instructions for users, so that users make less effort to find differences and typing.

Current flow:

Update flow:

We will add another button “Generate Instruction” which can insert the “Auto Generate Instruction” and Eval into the flow.

System design

Frontend

Add Generate Instruction button:

It should call instructionGenerate when user click on it.

instructionGenerate

function instructionGenerate() {
	const resultImage = await takeScreenshot();
	const originalImage = referenceImages[0];
	doGenerateInstruction({
    generationType: "update",
    image: originalImage,
    resultImage: resultImage,
  });
}

function doGenerateInstruction(params: InstructionGenerationParams) {
    setAppState(AppState.INSTRUCTION_GENERATING);

    // Merge settings with params
    const updatedParams = { ...params, ...settings };

    generateInstruction(
      wsRef,
      updatedParams,
      (token) => setUpdateInstruction((prev) => prev + token),
      (code) => setUpdateInstruction(code),
      () => setAppState(AppState.CODE_READY)
    );
}

AppState.INSTRUCTION_GENERATING

It should set disable and loading for all buttons on panel when appState is AppState.INSTRUCTION_GENERATING.

Backend

prompt

You are a Frontend Vision Comparison expert,
You are required to compare two website screenshots: the first one is the original site and the second one is a redesigned version.
Your task is to identify differences in elements and their css, focusing on layout, style, and structure.
Do not consider the content(text, placeholder) of the elements, only the elements themselves
Analyze the screenshots considering these categories:

Lack of Elements: Identify any element present in the original but missing in the redesign.
Redundant Elements: Spot elements in the redesign that were not in the original.
Wrong Element Properties: Note discrepancies in element properties like size, color, font, and layout.

Provide a clear conclusion as a list, specifying the element, the mistake, and its location.
In ambiguous cases, suggest a manual review.
Remember, this comparison is not pixel-by-pixel, but at a higher, more conceptual level.

Return only the JSON array in this format:
[
  {
    "element": "name, text, etc.",
    "mistake": "wrong color, wrong size, etc.(strictly use css properties to describe)",
    "improvement": "use #xxx color, use width: xxx px, etc.",
    "location": "header"
  },
]
Do not include markdown "```" or "```JSON" at the start or end.

api

generate-instruction

Eval

Add GPT count mistakes that previously made, this can give user a vibe of the performance of our system. For more serious evaluation, I will explore and follow up.

Nov 24 '23 02:11 clean99

Apply Sweep Rules to your PR?

[ ] Apply: All new business logic should have corresponding unit tests.
[ ] Apply: Refactor large functions to be more modular.
[ ] Apply: Add docstrings to all functions and file headers.

Nov 24 '23 02:11 sweep-ai-deprecated[bot]

When I messed around with ChatGPT, it hallucinated a lot when it did a visual comparison. I'm curious if your prompt works well. How are the visual comparison results? Are they accurate?

Thanks for exploring this method of improvement.

Nov 24 '23 03:11 abi

When I messed around with ChatGPT, it hallucinated a lot when it did a visual comparison. I'm curious if your prompt works well. How are the visual comparison results? Are they accurate?

Thanks for exploring this method of improvement.

It is not bad after limiting it to only care about the CSS properties mistakes. But I believe there are room to improve and this will be an experimental feature, The user can choose to skip it or modify the result it generate so I'd like to put it here and hopefully there will be better prompt contributed in the future

Nov 24 '23 03:11 clean99

Thank you for this and sorry I'm slow to review it. Will get it in tomorrow.

Dec 01 '23 21:12 abi

Finally found some time to try out this PR.

The primary issue I have with merging this in is that I think the quality of the outputs is not good. Here's an example:

Original Screenshot 2023-11-29 at 2 56 02 PM Result Screenshot 2023-12-03 at 6 00 25 PM Text a3251e342">

As you can see, it gets the colors wrong most obviously. Says there's a background video. Not sure what's going on.

Other than that,

Make "Generate instruction" smaller
Need to fix textarea so that it's not too big but expands based on text in it perhaps.

Fundamentally, the goal of the user is to make the generated result more like the screenshot through repeated loops of "generate instruction" -> "update" but unfortunately, I don't know if GPT4 vision and this approach work well together.

Would love to hear thoughts on how this can be improved, and your experiences with it.

Dec 04 '23 00:12 abi

Closing this PR for now since it's been a while. Still looking to improve the quality of generations so might come back to this approach of iterative LLM-driven improvements. I haven't tested GPT4o yet but with all the other older models, this method did not yield better results than just a single prompt. Will test for GPT4o in the next few weeks.

Jun 05 '24 19:06 abi