NextChat icon indicating copy to clipboard operation
NextChat copied to clipboard

[Bug] Incorrect Handling of Mixed LaTeX Math Symbols and Natural Language Text with Dollar Signs

Open ayanamists opened this issue 1 year ago • 6 comments

Bug Description

To solve issue #2841, current text buffer is modified by escapeDollarNumber.

// app/components/markdown.tsx
function escapeDollarNumber(text: string) {
  let escapedText = "";

  for (let i = 0; i < text.length; i += 1) {
    let char = text[i];
    const nextChar = text[i + 1] || " ";

    if (char === "$" && nextChar >= "0" && nextChar <= "9") {
      char = "\\$";
    }

    escapedText += char;
  }

  return escapedText;
}

However, current algorithm will affect all latex formula start with numbers. Even $1 + 1 = 2$ cannot be correctly displayed.

image

I also have a real-world example:

例如,在表达式$\lambda . \lambda . 1$中,最内层的$1$是封闭的,因为它的索引值$1$等于它在表达式中的深度$1$。同样,在表达式$\lambda . \lambda . 2$中,最内层的$2$也是封闭的,因为它的索引值$2$等于它在表达式中的深度$2$。

如果一个变量的索引值大于它所在的深度,那么它就被认为是自由的。例如,在表达式$\lambda . 2$中,$2$就是一个自由变量,因为它的索引值$2$大于它在表达式中的深度$1$。
image

I know this is a difficult problem to solve, but such case is not rare.

Steps to Reproduce

  1. start a new talk
  2. say: "Please output 1 + 1 = 2 in latex"
  3. GPT will output in latex, and this will not be correctly displayed

Expected Behavior

Most latex formula start with numbers should be correctly displayed.

Screenshots

No response

Deployment Method

  • [X] Docker
  • [ ] Vercel
  • [ ] Server

Desktop OS

No response

Desktop Browser

No response

Desktop Browser Version

No response

Smartphone Device

No response

Smartphone OS

No response

Smartphone Browser

No response

Smartphone Browser Version

No response

Additional Logs

No response

ayanamists avatar Mar 06 '24 11:03 ayanamists

I know this related to this

  • #4155
  • #3964
  • #3239

H0llyW00dzZ avatar Mar 07 '24 03:03 H0llyW00dzZ

I know this related to this

* [[Bug] LaTeX 渲染异常 #4155](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/4155)

* [[Bug] latex 公式渲染 问题 #3964](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/3964)

* [[Bug] LaTeX Syntax still bug #3239](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/3239)

The final solution has not been confirmed yet? Honestly, without the merged pull request for fixing the dollar sign issue, further improvements are out of the question.

daiaji avatar Mar 09 '24 12:03 daiaji

I know this related to this

* [[Bug] LaTeX 渲染异常 #4155](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/4155)

* [[Bug] latex 公式渲染 问题 #3964](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/3964)

* [[Bug] LaTeX Syntax still bug #3239](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/3239)

The final solution has not been confirmed yet? Honestly, without the merged pull request for fixing the dollar sign issue, further improvements are out of the question.

This issue is challenging to resolve. I'm not convinced it's feasible to fix given its complexity, particularly for the frontend and the React Markdown. It might be more practical to create a simpler, standalone package rather than dealing with the complexities of this issue.

H0llyW00dzZ avatar Mar 09 '24 13:03 H0llyW00dzZ

I know this related to this

* [[Bug] LaTeX 渲染异常 #4155](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/4155)

* [[Bug] latex 公式渲染 问题 #3964](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/3964)

* [[Bug] LaTeX Syntax still bug #3239](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/3239)

The final solution has not been confirmed yet? Honestly, without the merged pull request for fixing the dollar sign issue, further improvements are out of the question.

This issue is challenging to resolve. I'm not convinced it's feasible to fix given its complexity, particularly for the frontend and the React Markdown. It might be more practical to create a simpler, standalone package rather than dealing with the complexities of this issue.

Regardless of how the code is encapsulated, it seems that there is no way to avoid using complex logic and regular expressions to address this issue. I conducted a brief search for Markdown rendering packages in Node.js, and it appears that almost all packages have given up on properly handling the rendering of the dollar sign. The maintainers seem to have chosen a rather passive approach of not addressing such rendering issues.

The issue might be the only valuable thing there; markdown-it doesn't support LaTeX at all, as for react-markdown, you know.

Frankly, if everyone continues to handle this issue with a negative attitude, it may eventually be left to LLM for maintenance.

daiaji avatar Mar 09 '24 13:03 daiaji

I know this related to this

* [[Bug] LaTeX 渲染异常 #4155](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/4155)

* [[Bug] latex 公式渲染 问题 #3964](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/3964)

* [[Bug] LaTeX Syntax still bug #3239](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/3239)

The final solution has not been confirmed yet? Honestly, without the merged pull request for fixing the dollar sign issue, further improvements are out of the question.

This issue is challenging to resolve. I'm not convinced it's feasible to fix given its complexity, particularly for the frontend and the React Markdown. It might be more practical to create a simpler, standalone package rather than dealing with the complexities of this issue.

Regardless of how the code is encapsulated, it seems that there is no way to avoid using complex logic and regular expressions to address this issue. I conducted a brief search for Markdown rendering packages in Node.js, and it appears that almost all packages have given up on properly handling the rendering of the dollar sign. The maintainers seem to have chosen a rather passive approach of not addressing such rendering issues.

The issue might be the only valuable thing there; markdown-it doesn't support LaTeX at all, as for react-markdown, you know.

Frankly, if everyone continues to handle this issue with a negative attitude, it may eventually be left to LLM for maintenance.

I believe there's always a way to resolve this without resorting to complex logic and excessive use of regular expressions. It's just that I currently don't have the time to do it.

H0llyW00dzZ avatar Mar 09 '24 16:03 H0llyW00dzZ

I know this related to this

* [[Bug] LaTeX 渲染异常 #4155](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/4155)

* [[Bug] latex 公式渲染 问题 #3964](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/3964)

* [[Bug] LaTeX Syntax still bug #3239](https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/3239)

The final solution has not been confirmed yet? Honestly, without the merged pull request for fixing the dollar sign issue, further improvements are out of the question.

This issue is challenging to resolve. I'm not convinced it's feasible to fix given its complexity, particularly for the frontend and the React Markdown. It might be more practical to create a simpler, standalone package rather than dealing with the complexities of this issue.

Regardless of how the code is encapsulated, it seems that there is no way to avoid using complex logic and regular expressions to address this issue. I conducted a brief search for Markdown rendering packages in Node.js, and it appears that almost all packages have given up on properly handling the rendering of the dollar sign. The maintainers seem to have chosen a rather passive approach of not addressing such rendering issues. The issue might be the only valuable thing there; markdown-it doesn't support LaTeX at all, as for react-markdown, you know. Frankly, if everyone continues to handle this issue with a negative attitude, it may eventually be left to LLM for maintenance.

I believe there's always a way to resolve this without resorting to complex logic and excessive use of regular expressions. It's just that I currently don't have the time to do it.

I took a quick look at the example of markdown-to-jsx, and it seems that it requires writing LaTeX rendering conditions. This task seems a bit simpler compared to what we are currently working on, at least we don't have to replace dollar signs. However, the question is whether it's worth refactoring the code.

Honestly, if there are no existing solutions available, our choices might be limited.

daiaji avatar Mar 09 '24 21:03 daiaji

I meet the problem too. Can we use the $ to announce here is a math syntax and use $ to annouce here is a price or something.

The `` of $ xxx can be add with LLM by using prompt.

Just some simple ideas.

ClConstantine avatar Mar 19 '24 06:03 ClConstantine

I meet the problem too. Can we use the $ to announce here is a math syntax and use $ to annouce here is a price or something.

The `` of $ xxx can be add with LLM by using prompt.

Just some simple ideas.

Add to prompt may not be a good idea, as each time it will cost some tokens.

ayanamists avatar Mar 19 '24 08:03 ayanamists

I found a solution: https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/pull/4354 I tested it with different examples, and it worked.

Algorithm5838 avatar Mar 19 '24 22:03 Algorithm5838

I found a solution: #4354 I tested it with different examples, and it worked.

Would you like to share some insights in your PR? I cannot understand the complex regex used in your code

/(?<!`)\$(\d+(?:[.,]\d+)*)(?=\s*[a-zA-Z.,;!?]?\s*$|\s+[a-zA-Z]|\s+\$)(?!`)/g

ayanamists avatar Mar 20 '24 02:03 ayanamists

I came up with it with the help of LLMs.

Here is an explanation:

  1. Ensure that the dollar sign ($) is not preceded by a backtick (`).
  2. Match the dollar sign ($).
  3. Match one or more digits (\d+).
  4. Optionally match a decimal separator (. or ,) followed by one or more digits ((?:[.,]\d+)*).
  5. Ensure that the matched dollar amount is followed by:
    • Either the end of the line ($), or
    • A non-word character (e.g., punctuation mark like ., ,, ;, !, ?) and then the end of the line, or
    • A word character (e.g., a letter) preceded by one or more whitespace characters, or
    • Another dollar sign ($) preceded by one or more whitespace characters.
  6. Ensure that the dollar sign is not followed by a backtick (`).
  7. The g flag at the end makes the regular expression global, meaning it will match all occurrences in the text.

I think it still can be improved upon.

Algorithm5838 avatar Mar 20 '24 02:03 Algorithm5838

I noticed an issue with the regex and fixed it. It is now:

/(?<!`)\$(\d+(?:[.,]\d+)*)(?=\s*[.,;!?]\s*\B|\s+[a-zA-Z]|\s+\$)(?!`)/g
    // Regex explanation:
    // (?<!`)                 # Negative lookbehind to ensure the '$' is not preceded by a backtick (`)
    // \$                     # Match a literal '$' character
    // (\d+(?:[.,]\d+)*)      # Capture group 1: Match one or more digits, optionally followed by a decimal part (e.g., 123.45)
    // (?=                    # Positive lookahead to ensure the following conditions are met:
    //   \s*[.,;!?]\s*\B      #   The number is followed by a punctuation mark (.,;!?) and a non-word boundary
    //   |                    #   OR
    //   \s+[a-zA-Z]          #   The number is followed by one or more whitespace characters and a letter
    //   |                    #   OR
    //   \s+\$                #   The number is followed by one or more whitespace characters and a '$' sign
    // )
    // (?!`)                  # Negative lookahead to ensure the '$' is not followed by a backtick (`)
    // /g                     # Global flag to replace all occurrences

Algorithm5838 avatar Mar 20 '24 04:03 Algorithm5838

I noticed an issue with the regex and fixed it. It is now:

/(?<!`)\$(\d+(?:[.,]\d+)*)(?=\s*[.,;!?]\s*\B|\s+[a-zA-Z]|\s+\$)(?!`)/g
    // Regex explanation:
    // (?<!`)                 # Negative lookbehind to ensure the '$' is not preceded by a backtick (`)
    // \$                     # Match a literal '$' character
    // (\d+(?:[.,]\d+)*)      # Capture group 1: Match one or more digits, optionally followed by a decimal part (e.g., 123.45)
    // (?=                    # Positive lookahead to ensure the following conditions are met:
    //   \s*[.,;!?]\s*\B      #   The number is followed by a punctuation mark (.,;!?) and a non-word boundary
    //   |                    #   OR
    //   \s+[a-zA-Z]          #   The number is followed by one or more whitespace characters and a letter
    //   |                    #   OR
    //   \s+\$                #   The number is followed by one or more whitespace characters and a '$' sign
    // )
    // (?!`)                  # Negative lookahead to ensure the '$' is not followed by a backtick (`)
    // /g                     # Global flag to replace all occurrences

Thanks for your explanation. I tried some examples:

function check(line) {
     console.log(line.replace(/(?<!`)\$(\d+(?:[.,]\d+)*)(?=\s*[.,;!?]\s*\B|\s+[a-zA-Z]|\s+\$)(?!`)/g, '\\$&'));
}

check('The price of xxx is $1')
check('The price of xxx is $1. You can buy it for $0.95 or lower')
check('例如,在表达式$\lambda . \lambda . 1$中,最内层的$1$是封闭的,因为它的索>引值$1$等于它在表达式中的深度$1$。同样,在表达式$\lambda . \lambda . 2$中,最内>层的$2$也是封闭的,因为它的索引值$2$等于它在表达式中的深度$2$')
check('$1 + 1 = 2$')

The output:

The price of xxx is $1
The price of xxx is \$1. You can buy it for \$0.95 or lower
例如,在表达式$lambda . lambda . 1$中,最内层的$1$是封闭的,因为它的索引值$1$等于它在表达式中的深度$1$。同样,在表达式$lambda . lambda . 2$中,最内层的$2$也是封闭的,因为它的索引值$2$等于它在表达式中的深度$2$
$1 + 1 = 2$

It seems your solution works fine in these examples.

ayanamists avatar Mar 20 '24 06:03 ayanamists

It is now:

/(?<!`|\\)\$(\d+(?:[.,]\d+)*)(?=\s*[.,;!?]\s*\B|\s+[a-zA-Z]|\s+\$|$)(?!`)/g

I noticed the issues in your output and fixed them. Update: fixing other scenarios and rare use cases.

/(?<!`|\\)\$(\d+(\w+)?(?:[.,]\d+(\w+)?)*)(?=\s*[.,;?]\s*\B|!?\s+[a-zA-Z]|!?\s+\$|!?\s*[-=+\/]\s*\$\b|$)(?!`)/g

Algorithm5838 avatar Mar 20 '24 07:03 Algorithm5838

Update: I went about it the wrong. The new PR is the one to use, it has a better and short regex, covering all cases. https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/pull/4363

/(?<!`|\\)\$\d+([,.](\d+[,.])?\d+)?(?!.*\$\B)(?!`)/g

Algorithm5838 avatar Mar 21 '24 05:03 Algorithm5838

得益于Algorithm5838 的贡献,目前该问题已解决 image

Dean-YZG avatar Apr 09 '24 02:04 Dean-YZG

Bot detected the issue body's language is not English, translate it automatically.


Thanks to the contribution of Algorithm5838, this problem has been solved. image

Issues-translate-bot avatar Apr 09 '24 02:04 Issues-translate-bot

Unfortunately, there are still some issues:

  1. If you did not use the Inject System Prompt, the issues will persist, as the LLM might still use single dollar signs for inline LaTeX.
  2. Similarly, the same problem is present in block LaTeX, where if the double dollar signs are followed by a number, the LaTeX rendering would break. The first two issues are related because the dollar sign(s) is followed by a number.
  3. Another issue is that if the dollar sign and number are inside a code block or inline code, a backslash would be rendered incorrectly.

And here is a related issue: https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web/issues/4537

My workaround has fixed these three issues.

Current implementation: Screenshot 2024-04-18 at 13 01 58

My workaround: Screenshot 2024-04-18 at 13 00 26

You can try it yourself, here is the instance of my fork https://github.com/Algorithm5838/NextChat/tree/dollar-sign: https://nextchat-git-dollar-sign-algorithm5838s-projects.vercel.app/

Algorithm5838 avatar Apr 18 '24 10:04 Algorithm5838

Perhaps this problem will never be solved.

daiaji avatar Apr 18 '24 10:04 daiaji

@daiaji Did you try my workaround? If so, how did you find it?

Algorithm5838 avatar Apr 18 '24 10:04 Algorithm5838

Sorry, I just feel very frustrated.

As you can see, I submitted this PR. Honestly, even though GPT has provided a lot of help and it has taken up a significant amount of my time, it seems that the problem is still far from being solved.

That's all for now.😔

daiaji avatar May 12 '24 09:05 daiaji

Sorry, I just feel very frustrated.

As you can see, I submitted this PR. Honestly, even though GPT has provided a lot of help and it has taken up a significant amount of my time, it seems that the problem is still far from being solved.

That's all for now.😔

I understand your frustration. It can be disheartening when you've put in a significant amount of time and effort into a pull request and the problem still remains unsolved.

In my opinion, this issue should definitely be addressed in the remark parser. The parser should correctly identify what is math and what is a US dollar symbol. Interestingly, I have never encountered such a problem when using pandoc (for converting and blogging, see My Blog Project). This is because pandoc uses a stronger rule for markdown math, as documented in pandoc's user guide:

Extension: tex_math_dollars Anything between two $ characters will be treated as TeX math. The opening $ must have a non-space character immediately to its right, while the closing $ must have a non-space character immediately to its left, and must not be followed immediately by a digit. Thus, $20,000 and $30,000 won’t parse as math. If for some reason you need to enclose text in literal $ characters, backslash-escape them and they won’t be treated as math delimiters.

I have tested my inputs, and all of them are correctly handled by pandoc. Most of the time, the output of ChatGPT follows this guideline. So I'd like to figure out why remark don't use this rule.

ayanamists avatar May 15 '24 02:05 ayanamists

Sorry, I just feel very frustrated. As you can see, I submitted this PR. Honestly, even though GPT has provided a lot of help and it has taken up a significant amount of my time, it seems that the problem is still far from being solved. That's all for now.😔

I understand your frustration. It can be disheartening when you've put in a significant amount of time and effort into a pull request and the problem still remains unsolved.

In my opinion, this issue should definitely be addressed in the remark parser. The parser should correctly identify what is math and what is a US dollar symbol. Interestingly, I have never encountered such a problem when using pandoc (for converting and blogging, see My Blog Project). This is because pandoc uses a stronger rule for markdown math, as documented in pandoc's user guide:

Extension: tex_math_dollars Anything between two $ characters will be treated as TeX math. The opening $ must have a non-space character immediately to its right, while the closing $ must have a non-space character immediately to its left, and must not be followed immediately by a digit. Thus, $20,000 and $30,000 won’t parse as math. If for some reason you need to enclose text in literal $ characters, backslash-escape them and they won’t be treated as math delimiters.

I have tested my inputs, and all of them are correctly handled by pandoc. Most of the time, the output of ChatGPT follows this guideline. So I'd like to figure out why remark don't use this rule.

It's not possible to fix anyway related to LaTeX because the module conflicts with the front-end CSS and UI/UX.

H0llyW00dzZ avatar May 15 '24 02:05 H0llyW00dzZ