chat-ui icon indicating copy to clipboard operation
chat-ui copied to clipboard

[v0.9.1] Formatting issues while rendering code

Open adhishthite opened this issue 1 year ago • 8 comments

image

@nsarrazin Whenever I ask chat-ui to explain / generate code, the < does not get rendered correctly. Can you please take a look?

adhishthite avatar Jul 10 '24 16:07 adhishthite

If you still have access, could you send me the raw conversation that shows this behaviour ? image there's a download button next to user messages in the UI

nsarrazin avatar Jul 11 '24 09:07 nsarrazin

OK. Think I can explain this one, and offer an improvement.

Code blocks in markdown can either be fenced ( ```html) or indented 4 spaces ( ).

The issue arises when the LLM responds with a code block that is both fenced AND indented.

In this case I think the correct behaviour is to show a code block, with the fences displayed as part of the code. VSCode and https://markdownlivepreview.com/ do this.

What is happening in Chat-UI seems to be:

  • The marked lexer does not pick this up as a code block, meaning that <CodeBlock> isn't used.
  • The marked renderer does which emits <pre> and <code> tags, causing the styling to look similar to a correctly rendered code block and the &lt to go through as-is. Note that the Copy to Clipboard button is not present because it hasn't been rendered by CodeBlock.
  • The behaviour is incorrect as in this case it should be including the triple backticks as part of its display (although I'd expect in >99% of cases the user would prefer standard CodeBlock behaviour and the LLM has made a mistake.)

In looking at this, I've bumped the marked library to 13.0.3 and then 14.0 (to see if this fix : https://github.com/markedjs/marked/pull/3264 would make a difference - it doesn't). This does change the interface a little but fairly easy to update.

In the meantime, adding this to ChatMessage returns the &lt to < and renders the fences, although doesn't benefit from the Syntax Highlighting and Copy to Clipboard button that <CodeBlock> provides.

renderer.code = (code) => {		
   return `<pre><code>${sanitizeMd(code.raw)}</code></pre>`;
}

@nsarrazin - let me know if you want this as a PR - I think this improves behaviour but isn't a proper fix. In the meantime, I'm going to tweak my System prompt to advise the LLM against using indents for Code Blocks, and ultimately think this is a defect in the marked library?

:( image

:) with additional render hook image

evalstate avatar Aug 15 '24 20:08 evalstate

Leaving the previous comment, however I am not 100% confident on whether indents are the only cause after more testing. The output below uses <CodeBlock> from list entries 10 onward...

        "updates": [
            {
                "type": "status",
                "status": "started"
            },
            {
                "type": "title",
                "title": "💻 HTML tags"
            },
            {
                "type": "finalAnswer",
                "text": "Certainly! Here are some common HTML tags with a brief description:\n\n1. **`<html>`**: The root element of an HTML page. It encloses all the content of the page.\n\n   ```html\n   <html>\n   </html>\n   ```\n\n2. **`<head>`**: Contains meta-information about the HTML document, like title and links to scripts and stylesheets.\n\n   ```html\n   <head>\n   </head>\n   ```\n\n3. **`<title>`**: Defines the title of the document, shown in the browser's title bar or tab.\n\n   ```html\n   <title>Page Title</title>\n   ```\n\n4. **`<body>`**: Contains the content of the document, such as text, images, and hyperlinks.\n\n   ```html\n   <body>\n   </body>\n   ```\n\n5. **`<h1>` to `<h6>`**: Header tags, `<h1>` is the highest level and `<h6>` is the lowest. They define headings.\n\n   ```html\n   <h1>Main Heading</h1>\n   ```\n\n6. **`<p>`**: Represents a paragraph of text.\n\n   ```html\n   <p>This is a paragraph.</p>\n   ```\n\n7. **`<a>`**: The anchor tag, used to create hyperlinks.\n\n   ```html\n   <a href=\"https://www.example.com\">Visit Example</a>\n   ```\n\n8. **`<img>`**: Embeds an image in the document. It should include the `src` (source) attribute and often an `alt` (alternative text) attribute.\n\n   ```html\n   <img src=\"image.jpg\" alt=\"Description of image\">\n   ```\n\n9. **`<ul>` / `<ol>`**: Unordered (bulleted) and ordered (numbered) list containers, respectively.\n\n   ```html\n   <ul>\n       <li>List item 1</li>\n       <li>List item 2</li>\n   </ul>\n   ```\n\n   ```html\n   <ol>\n       <li>First item</li>\n       <li>Second item</li>\n   </ol>\n   ```\n\n10. **`<li>`**: Represents a list item, used within `<ul>` or `<ol>`.\n\n   ```html\n   <li>A list item</li>\n   ```\n\n11. **`<div>`**: A generic container for content, often used for styling or layout purposes.\n\n   ```html\n   <div>This is a division.</div>\n   ```\n\n12. **`<span>`**: A generic inline container, typically used to apply styles or scripts.\n\n   ```html\n   <span style=\"color:blue\">This is a blue text.</span>\n   ```\n\n13. **`<input>`**: Represents an input field in a form, where data can be entered.\n\n   ```html\n   <input type=\"text\" name=\"username\">\n   ```\n\n14. **`<button>`**: Represents a clickable button.\n\n   ```html\n   <button>Click me</button>\n   ```\n\nRemember, these are just foundational tags, and HTML supports many more elements you can learn about as you build more complex pages.",
                "interrupted": false,
                "usage": {
                    "input_tokens": 88,
                    "output_tokens": 691
                }
            }
        ],

evalstate avatar Aug 15 '24 20:08 evalstate

Here is a snippet that shows the issue:

  • https://gist.github.com/evalstate/6b5ca3f67634602f7ce8dd8c3dbab7a3
  • Marked Demo

The handling of code blocks in lists changes; asking the LLM via Chat-UI to repeat all or part of the block verbatim shows the behaviour.

The GFM spec recommends using a blank HTML comment to disambiguate indented blocks: https://github.github.com/gfm/#example-288


## Inside a List

- This is a test (normal fences)

```html
<foo />
  • This is another test (indented block)

  • This is a further test (indents and fences)

    <foo />
       <bar />
    
  • Test complete

Outside a List

This is a test (normal fences)

<foo />

This is another test (indented block)

<foo />
    <bar />

This is another test (indents and fences)

```
<foo />
   <bar />
```

Test complete

evalstate avatar Aug 16 '24 08:08 evalstate

Final update on this for the moment - the issue also occurs when code blocks are children of lists, causing the parse(token.raw) to show the child codeblock rather than being caught by the type==="code" clause here:

https://github.com/huggingface/chat-ui/blob/97b6feb8b9ed57148e76b11944ace966029ea108/src/lib/components/chat/ChatMessage.svelte#L267-L276

Can't see an obvious quick way to fix this.

evalstate avatar Aug 16 '24 12:08 evalstate

Getting this issue with Qwen2.5-Coder-32B-Instruct:

Screenshot_1

The raw markdown looks like:

### Explanation of the Code

1. **Loop through each `char*` and delete it:**
   ```cpp
   for (size_t i = 0; i < count; i++) {
       delete suggestions[i];
       suggestions[i] = 0;
   }

Seems like the code block produced by Qwen is indented, which usually isn't common, but seems to be more common with this particular model.

rotemdan avatar Nov 12 '24 09:11 rotemdan

It's because it's a child of a bulleted/numbered list. In this case it doesn't use the CodeBlock component but the marked output.

On Tue, 12 Nov 2024, 09:19 Rotem Dan, @.***> wrote:

Getting this issue with Qwen2.5-Coder-32B-Instruct:

Screenshot_1.png (view on web) https://github.com/user-attachments/assets/bcac2c50-e676-4a2c-9393-1a6aa60dffb1

The raw markdown looks like:

Explanation of the Code

  1. Loop through each char* and delete it:
    for (size_t i = 0; i < count; i++) {
        delete suggestions[i];
        suggestions[i] = 0;
    }
    
    

Seems like the code block produced by Qwen is indented, which usually isn't common, but seems to be more common with this particular model.

— Reply to this email directly, view it on GitHub https://github.com/huggingface/chat-ui/issues/1337#issuecomment-2469995884, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOYXFQ6HIS3NX7Q73VCHWD2AHBZXAVCNFSM6AAAAABKVHNUMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRZHE4TKOBYGQ . You are receiving this because you commented.Message ID: @.***>

evalstate avatar Nov 12 '24 09:11 evalstate

Last reply not helpful - there are 2 separate issues:

  1. Code blocks that are children of lists don't get rendered via the CodeBlock component.
  2. Those code blocks render "<" symbols incorrectly.

I can produce a PR for the second issue (I fixed this in my fork but left it as it's not a "complete" fix).

Adding this to ChatMessage fixes the <'s.

renderer.code = (code) => {		
   return `<pre><code>${sanitizeMd(code.raw)}</code></pre>`;
}

evalstate avatar Nov 12 '24 09:11 evalstate