textlint-plugin-latex2e icon indicating copy to clipboard operation
textlint-plugin-latex2e copied to clipboard

Working on v2

Open pddg opened this issue 1 year ago • 8 comments

I have made some modifications since I became one of the maintainers, but I have not been able to actively participate in the development over the past few years. I had discovered some problems but had put them off due to the complexity of the current code base.

Several major events have occurred during this time.

  • latex-utensils releases additional 3 major releases.
    • textlint-plugin-latex2e depends on v3.x. However, the latest is v6.1.
    • There are known to be some difficulties in updating.
  • Starting with textlint v13.0.0, all AST NodeTypes are now defined!
    • https://textlint.github.io/blog/2023/01/27/textlint-13.html#-breaking-changes
  • LaTeX Workshop, the VSCode extension that led me to introduce latex-utensils, unified-latex started being used
    • https://github.com/James-Yu/LaTeX-Workshop/issues/3877

This project is a viable solution for using textlint in LaTeX, even in the year 2024, with no alternative. I began to think that it might be appropriate to rebuild the entire project once in order to maintain it on an ongoing basis. This means building v2 from scratch.

Which parser library is better for us?

For v2 I re-evaluated several LaTeX parsers written in TypeScript or JavaScript. Sample TeX source is as follows:

% This is the beginning of the document
\documentclass{article}  % This is a comment

\begin{document}

This is a paragraph that contains inline math: $E=mc^2$.
This is a second line.

% This is the beginning of the nested document
\begin{itemize}
    \item First item
    \item Second item
    \item Third item
\end{itemize}

\begin{tabular}{|c|c|c|}
    \hline
    Column 1 & Column 2 & Column 3 \\
    \hline
    Row 1 & Row 1 & Row 1 \\
    \hline
    Row 2 & Row 2 & Row 2 \\
    \hline
\end{tabular}

\end{document}

latex-utensils v6.2.0

Generated AST of sample tex source is here.

  • Pros
    • We already know this parser
    • Fast
    • Less dependencies
  • Cons
    • space, linebreak, softbreak do not have location information.
    • Comments can be parsed, but it is not in AST. We should calculate the position of them in AST.
      • completeComment func is implemented for this.

unified-latex v1.6.0

Generated AST of sample tex source is here.

  • Pros
    • Most whitespace has location information.
    • Comments are included in AST by default.
  • Cons
    • whitespace and parbreak around \item has no location information
      • https://github.com/siefkenj/unified-latex/issues/73
      • Maybe, we should implement the feature to calculate position of them.
    • whitespace may be a softbreak
      • latex-utensils parses it as softbreak: https://github.com/pddg/eval-js-latex-parsers/blob/88f3ed88f8da01540da3e73d64b4d076ad6c94fb/latex-utensils/output.json#L310-L312
      • unified-latex parses it as whitespace (softbreak is not defined in unified-latex): https://github.com/pddg/eval-js-latex-parsers/blob/88f3ed88f8da01540da3e73d64b4d076ad6c94fb/unified-latex/output.json#L514-L528
    • Large dependencies
    • Compatibility with ES module
      • https://github.com/siefkenj/unified-latex/issues/72
      • We can use it via "module": "ESNext" and "moduleResolution": "Bundler", now.
    • Slower than latex-utensils
      • https://github.com/James-Yu/LaTeX-Workshop/issues/3877#issuecomment-1544277944

completeBlank and completeComment is hard to maintain. I think it might be worth the transition.

LaTeX.js v0.12.6

(As far as I could tell) this could not be used as a mere parser. It may be possible to define a Generator that generates textlint ASTs from LaTeX ASTs. However, I do not know how to create a Generator. Document says TODO, now. https://latex.js.org/api.html#class-generator

Create original one

Another option would be to make your own, but the reason other parsers produce such output in the first place is because that is the way TeX syntax is written. Therefore, it is difficult to imagine that a home-built parser would be significantly easier for us to use.

Goal of v2

  • Correct AST of textlint is constructed from LaTeX source.
    • We should fix them:
      • #84
      • #69
  • Add snapshot test.
    • By verifying that the snapshot of expected output matches the actual output, we should be able to implement more tests more easily.
  • Add integration test if possible.
    • Ensure that violation of rule can be detected while using this plugin.
    • Ensure that fixable problem provided by some rules can be fixed while using this plugin.

Non Goal

  • Keep AST compatible with v1
  • Enhance unit tests currently implemented

How do we do

As maintainers, we should work on one of the following

  • Work on v2
  • Continue to improve on v1
  • Declare Deprecated and archive repository

Which one do you think is better? @tani @kn1cht

pddg avatar Feb 11 '24 10:02 pddg

Thank you for revisiting our project. I would like to express the gratitude for your deep survey and valuable comments.

I agree with your comment that our project depends on the old library and it should be updated to the latest version. Further, I am happy to hear that I could collaborate with you again. Let's reboot our project and make it better.

In my opinion, I would like to use unified-latex libarary. Even if we use the latex-utensils, we need to momdify the upstream code to make it work with our project. I think that it is time to change the library to unified-latex because it is more feature-rich. (For my personal preference, I tend to use the new library for better experience :wink:.)

tani avatar Feb 11 '24 12:02 tani

Hello, I am glad to hear that you are doing well! I also agree with working on v2.

kn1cht avatar Feb 11 '24 13:02 kn1cht

whitespace and parbreak around \item has no location information

This may be resolved by parseMinimal method of unified-latex. https://github.com/pddg/eval-js-latex-parsers/blob/e31b34348473fdecbc1f628c07d49fb9373c3f8f/unified-latex-minimal/output.json#L726-L740

pddg avatar Feb 12 '24 05:02 pddg

Looking at the reference, markdown-to-ast, it appears that whitespace, etc. caused by List nesting is simply ignored. Each ListItem has no whitespace before or after it, and whitespace or line breaks exist only in the raw value of the List. We might simply remove all elements without position from the AST and then convert them.

pddg avatar Feb 12 '24 06:02 pddg

I have written a sample that works to some extent, but have found additional problems and reported them to unified-latex. https://github.com/siefkenj/unified-latex/issues/77#issue-2140725952

pddg avatar Feb 18 '24 04:02 pddg

Because latex combines elements of formatting and text, it is often difficult to express in ASTs prepared for mere text.

I am currently struggling with the description environment. What kind of AST should this be expressed as in textlint?

\begin{description}
  \item[item1] desc
  \item[item2] desc
\end{description}

I have three ideas.

1. Consider headings and their descriptions to be lists.

- item1
  - desc
- item2
  - desc

textlint AST will be as follows:

List
  ListItem
    Paragraph
    List
      ListItem
        Paragraph
  ListItem
    Paragraph
    List
      ListItem
        Paragraph
  • pros
    • Nothing (i think)
  • cons
    • Need to generate elements of the list that don't actually exist.

2. Consider the heading as a Header and the description as a Paragraph.

- ## item1
  desc
- ## item2
  desc

textlint AST will be as follows:

List
  ListItem
    Header
    Paragraph
  ListItem
    Header
    Paragraph
  • pros
    • Simple
  • cons
    • TxtHeaderNode requires depth parameter that can be 1|2|3|4|5|6. However, we don't know what the depth to be used here.
      • https://github.com/textlint/textlint/blob/6cef9504c25a7ab771dd45986e4710463c094369/packages/%40textlint/ast-node-types/src/NodeType.ts#L142

3. Consider headings and their descriptions as an independent Paragraph.

- item1
  desc
- item2
  desc

textlint AST will be as follows:

List
  ListItem
    Paragraph
    Paragraph
  ListItem
    Paragraph
    Paragraph
  • pros
    • Implementation will be simple
  • cons
    • It is not actually a paragraph and can be false positive for some rules.

pddg avatar Feb 18 '24 04:02 pddg

Since Re:VIEW also has definition list, I examined the behavior of textlint-plugin-review. As a result, it output AST similar to the idea 1.

input (test/chunker-test.js)

 : Alpha
    DEC の作っていた RISC CPU。
    浮動小数点数演算が速い。
 : POWER
    IBM とモトローラが共同製作した RISC CPU。
    派生として POWER PC がある。
 : SPARC
    Sun が作っている RISC CPU。
    CPU 数を増やすのが得意。

output: It seems that the Re:VIEW plugin generates ListItem element for each header and list item. This is close to idea 1, but the difference is whether the ListItem hierarchy is flat or nested. Personally, I feel that the flat hierarchy is simpler.

kn1cht avatar Feb 18 '24 09:02 kn1cht

@tani @kn1cht I have created PR for v2. At least simple source code parsing and many of the implemented tests seem to work.

pddg avatar Feb 23 '24 07:02 pddg