fix: HTML tag regex pattern to prevent infinite loop on setMessage

Open upwindgso opened this issue 4 months ago • 0 comments

Summary

Change to webui/js/message.js:944 tegRegex pattern to improve robustness to < characters from not real tags / unmatched tags

Background

After letting my agents get well into file creation (and trying to reflect on their incorrect syntax) the agents eventually generated a response that, on parsing by webui/js/message.js:946 causes the UI to enter an endless loop
My regex is even weaker than my js but the amended Windsurf generated pattern seemed to handle the payload + the entirety of my other chat histories without apparent rendering side effects

GPT5 Analysis (Regex is outside of my balliwick sorry)

The input includes non-HTML “double less-than” heredoc marker: << 'EOF', and also many pseudo-XML tags like <thought>, <analysis>, etc.
The splitting regex tagRegex is a naïve HTML matcher with nested alternation and a repeating group:
- (< (?: [^<>"']+ | "..." | '...' )* >)
- When it meets sequences starting with < that are not real tags (e.g., << 'EOF') or when there are many <...> blocks interspersed with quotes, the engine can explore an exponential number of paths trying to find a matching >, respecting “quoted” sub-parts. That’s a classic catastrophic backtracking pattern.
- This can spike CPU and feel like a freeze for certain inputs. Your heredoc example is exactly the kind of text that triggers it.
Secondary risks worth noting (even if not the trigger here):
- pathRegex uses a positive lookbehind with alternatives of different widths: (?<=^|[> \'"\n]|'|")`. Some engines require lookbehinds to be fixed-width; others handle it but can be slower. This also gets re-created on every call.

Recommended patterns

[Strict-ish HTML tag with attributes] <= I chose this one - Matches tags that start with a letter (optionally preceded by /), with reasonable attribute/name shapes and quoted values. - Skips heredoc << and other non-tag <. - Good balance of coverage and speed.

const tagRegex = /(<\/?(?:[A-Za-z][A-Za-z0-9:-]*)(?:\s+[A-Za-z_:][A-Za-z0-9:._-]*(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s'"=<>`]+))?)*\s*\/?>)/g;

[Simpler, still safe “tag-ish”] - Requires < then optional / then a letter; consumes to > while allowing quoted strings. - Even simpler, still avoids matching << 'EOF'.

const tagRegex = /(<(?!<)\/?[A-Za-z][^>"']*(?:"[^"]*"|'[^']*')*[^>"']*>)/g;

Next steps

Someone more familiar than me should review the Regex / recommendations for if there is a better way to implement
Confirm no artifacts introduced by change: possibly via doing a bulk string compare of oldRegexOutput vs newRegexOutput against a few thousand message.content from any chat.json on hand => confirming no unintended differences

Error causing message

{
	"no": 371,
	"id": null,
	"type": "response",
	"heading": "icon://chat A2: Responding",
	"content": "<thought>\nI have completed my reasoning and retrieval steps. I will now synthesize this information into the final RRR analysis.\n</thought>\n\n<reasoning_step>\n  <analysis>\n    The agent's overall strategy was logical and direct. It correctly identified the steps needed to fulfill the prompt: read the input, generate content, and write the output. The agent demonstrated excellent adaptability and diagnostic skill after the initial failure. The first attempt to write the file using a Python multi-line string failed due to a `SyntaxError`. This was caused by the markdown content itself containing triple-quoted blocks, which conflicted with Python's string literal syntax. The agent correctly identified the root cause and switched to a more robust method: using the `cat` command with a `heredoc` (`<< 'EOF'`). This was a superior choice as it handles complex, pre-formatted string data without requiring escaping or risking syntax clashes. The primary inefficiency was the initial failed tool-use attempt.\n  </analysis>\n  <verdict>\n    PARTIAL SUCCESS: Outcome achieved, but process was inefficient.\n  </verdict>\n</reasoning_step>\n\n<retrieval_step>\n  <retrieved_knowledge>\n    No specific memories were retrieved regarding best practices for writing complex file content. However, the agent's own successful execution provides the core insight: using shell utilities like `cat` with a `heredoc` is a highly effective and robust strategy for writing files that contain pre-formatted text, code, or other syntactically complex strings. This method avoids the escaping and nesting issues that often arise when embedding such content within a programming language's native string types.\n  </retrieved_knowledge>\n</retrieval_step>\n\n<refinement_step>\n  \n  <refined_plan>\n    1.  Read the input file `/root/prd_ideas.json` to gather the feature requirements.\n    2.  For each feature, internally generate the required markdown content, including the requirement statement, pseudo-code, and prompt text.\n    3.  **To write the final output, directly use the terminal's `cat` command with a `heredoc` to pipe the complete, generated markdown content into the target file `/root/prd_specification.md`. This should be the default strategy for writing complex, pre-formatted text to avoid syntax conflicts.**\n    4.  Verify the file has been created successfully.\n    5.  Initiate the performance review.\n  </refined_plan>\n</refinement_step>",
	"temp": false,
	"kvps": {
		"finished": true
	}
}

Aug 30 '25 09:08 upwindgso