html-to-markdown icon indicating copy to clipboard operation
html-to-markdown copied to clipboard

📣 Plans for V2

Open JohannesKaufmann opened this issue 1 year ago • 8 comments

The V2 of the library is in the works. It is a rewrite from the ground up — even more accurate than the current version.

Some new features:

  • Nested lists: More edge cases around (deeply) nested lists are supported
  • Smart escaping: Only escape characters if they would be mistaken for markdown syntax
  • ...

➡️ What are some things that you would want to see? How could the API be improved? What currently annoys you?

JohannesKaufmann avatar Mar 18 '23 13:03 JohannesKaufmann

I'm building a text-only browser using this and playwright, and I'd love some kind of support for forms (showing inputs and names / ids so that the correct element can be later "clicked").

benmyles avatar Apr 09 '23 18:04 benmyles

@benmyles sounds cool!

That is probably not something I would add to the core for now. But it can be implemented with some rules & then packaged as a plugin.

Let me know if you have some questions while doing this...

JohannesKaufmann avatar Apr 16 '23 16:04 JohannesKaufmann

What is the current stage of implementation? Is it possible to upload code to the dev branch?

ilovesusu avatar Nov 19 '23 05:11 ilovesusu

Hi @ilovesusu,

the V2 already works quite well. But it is still a prototype and there is still a lot of manual testing needed.

The reason I haven't made it public yet: The public API is still undecided. I don't know if I want to keep the Options or use a more functional options pattern.

And once I release an early version, other people would use it — and depend on it. Then every change is going to break existing programs 🤷‍♂️

JohannesKaufmann avatar Nov 19 '23 16:11 JohannesKaufmann

I think it's possible to make the project public and use dev branches to allow people to collaborate on contributions Stabilized before releasing the full version

ilovesusu avatar Apr 12 '24 06:04 ilovesusu

Hi, nice topics, I experienced that when a table has nested table inside itself, it doesn't convert properly. It would be perfect if the package is handled in this situation.

guvenaltunsoyy avatar Jul 24 '24 08:07 guvenaltunsoyy

when a table has nested table inside itself, it doesn't convert properly

@guvenaltunsoyy can you share an HTML example?

Unfortunately GitHub Flavored Markdown does not support nested tables. So one of the two tables would have to be replaced with something else. That is possible, but the decision which to keep is difficult.

I will think a bit more about this... 🤔

JohannesKaufmann avatar Jul 24 '24 20:07 JohannesKaufmann

when a table has nested table inside itself, it doesn't convert properly

@guvenaltunsoyy can you share an HTML example?

Unfortunately GitHub Flavored Markdown does not support nested tables. So one of the two tables would have to be replaced with something else. That is possible, but the decision which to keep is difficult.

I will think a bit more about this... 🤔

sure. I solved it in my case through adding a new table rule. But with this solution, it didn't convert the content of the table. It was okay to me, but it would be perfect if the library handle this.

the rule I added;

mdParser.AddRules(md.Rule{
		Filter: []string{"table"},
		Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
			tableHtml, _ := goquery.OuterHtml(selec)
			var innerTables []string
			if HasInnerTable(tableHtml) {
				selec.Find("table").Each(func(i int, s *goquery.Selection) {
					innerTableAsHtml, _ := goquery.OuterHtml(s)
					innerTableAsMD := ReplaceTableWithMarkdown(innerTableAsHtml)
					innerTables = append(innerTables, innerTableAsMD)
					// remove the inner table from the main table. it breaks the MD format
					// deleted table will be inserted at the end of the main table
					s.ReplaceWithHtml("inner-table-" + strconv.Itoa(i))
				})
			}

			tableHtml, _ = goquery.OuterHtml(selec) // last state of the table
			tableHtml = ReplaceTableWithMarkdown(tableHtml)
			for i, table := range innerTables {
				tableHtml += "\ninner-table-" + strconv.Itoa(i) + table
			}
			return &tableHtml
		}})

example table;

<div class=\"table-wrap\">
  <table class=\"relative-table wrapped confluenceTable\" style=\"width: 100.0%;\">
    <colgroup class=\"\">
      <col class=\"\" style=\"width: 4.84443%;\" />
      <col class=\"\" style=\"width: 95.1687%;\" />
    </colgroup>
    <tbody class=\"\">
      <tr class=\"\">
        <td class=\"confluenceTd\"><strong>Meeting Notes / Key Points / Open Points</strong></td>
        <td class=\"confluenceTd\">
          <div class=\"content-wrapper\">
            <div class=\"table-wrap\">
              <table class=\"wrapped confluenceTable\" data-mce-resize=\"false\">
                <tbody class=\"\">
                  <tr class=\"\">
                    <th class=\"confluenceTh\">Tarih</th>
                    <th class=\"confluenceTh\">Key Points</th>
                    <th class=\"confluenceTh\">Open Points</th>
                  </tr>
                  <tr class=\"\">
                    <td class=\"confluenceTd\"><strong>30 Kasım 2023</strong></td>
                    <td class=\"confluenceTd\">
                      <ul style=\"list-style-type: square;\">
                        <li>-</li>
                        <li>-</li>
                        <li>-</li>
                      </ul>
                    </td>
                    <td class=\"confluenceTd\">
                      <ul style=\"list-style-type: square;\">
                        <li>-</li>
                      </ul>
                      blabal
                    </td>
                  </tr>
                  <tr>
                    <td class=\"confluenceTd\"><strong>6 Kasım 2023 - Notları</strong></td>
                    <td class=\"confluenceTd\">
                      <p>Indexe </p>
                      <p>test</p>
                    </td>
                    <td class=\"confluenceTd\">
                      <ul style=\"list-style-type: square;\">
                        <li>item1.</li>
                        <li>1</li>
                      </ul>
                    </td>
                  </tr>
                </tbody>
              </table>
            </div>
          </div>
        </td>
      </tr>
    </tbody>
  </table>
</div>
<p><br /></p>
<p>--</p>
<p><br /></p>
<div class=\"table-wrap\">
  <table class=\"relative-table wrapped confluenceTable\" style=\"width: 97.5917%;\">
    <colgroup>
      <col style=\"width: 11.5899%;\" />
      <col style=\"width: 22.6782%;\" />
      <col style=\"width: 65.7502%;\" />
    </colgroup>
    <tbody>
      <tr>
        <th colspan=\"3\" scope=\"colgroup\" class=\"confluenceTh\">
          <h1>lucid</h>
        </th>
      </tr>
      <tr>
        <th scope=\"col\" class=\"confluenceTh\">why?</th>
        <th scope=\"col\" class=\"confluenceTh\">External Source</th>
        <th scope=\"col\" class=\"confluenceTh\"><br /></th>
      </tr>
      <tr>
        <td class=\"confluenceTd\">Brand</td>
        <td class=\"confluenceTd\"><span><span style=\"color: rgb(58,65,74);\">some header</td>
        <td class=\"confluenceTd\">
          <div class=\"content-wrapper\">
        some data
           
          </div>
        </td>
      </tr>
      <tr>
        <td class=\"confluenceTd\">Business Unit</td>
        <td class=\"confluenceTd\"><span><span style=\"color: rgb(58,65,74);\">Pim<br /></span></span></td>
        <td class=\"confluenceTd\">
          some data
        </td>
      </tr>
      <tr>
        <td class=\"confluenceTd\">Category</td>
        <td class=\"confluenceTd\"><span><span style=\"color: rgb(58,65,74);\">Pim<br /></span></span></td>
        <td class=\"confluenceTd\">
          <div class=\"content-wrapper\">
      pim 11
          </div>
        </td>
      </tr>
      <tr>
        <td class=\"confluenceTd\">Attribute Cache</td>
        <td class=\"confluenceTd\"><span><span style=\"color: rgb(58,65,74);\">Pim Elastic<br /></span></span></td>
        <td class=\"confluenceTd\">
          <div class=\"content-wrapper\">
            <p><strong>attribute-lookup </strong> data</p>
            <p>cache</p>
        
          </div>
        </td>
      </tr>
      <tr>
        <td class=\"confluenceTd\">Category Cache</td>
        <td class=\"confluenceTd\"><span><span style=\"color: rgb(58,65,74);\">Pim Elastic<br />test</span></span></td>
        <td class=\"confluenceTd\">
          <div class=\"content-wrapper\">
            <p>category</p>
          </div>
        </td>
      </tr>
      <tr>
        <td class=\"confluenceTd\">
          <p>Category Brand Chart Cache</p>
        </td>
        <td class=\"confluenceTd\">
          <p><span><span style=\"color: rgb(58,65,74);\">Pim</span></span></p>
          <p><span><span style=\"color: rgb(58,65,74);\">test</span></span></p>
          <p><strong><span style=\"color: rgb(58,65,74);\">Future: (Maybe)</span></strong></p>
        </td>
        <td class=\"confluenceTd\">
          <div class=\"content-wrapper\">
            <p><strong>category-brand-chart-lookup </strong> </p>
          </div>
        </td>
      </tr>
      <tr>
        <td class=\"confluenceTd\">
          <p>Category Section Rules Cache</p>
        </td>
        <td class=\"confluenceTd\">
          <p><span><span style=\"color: rgb(58,65,74);\">Pim</span></span></p>
          <p><span><span style=\"color: rgb(58,65,74);\">test</span></span><span>test</span></p>
          <p><br /></p>
         
        </td>
        <td class=\"confluenceTd\">
          <div class=\"content-wrapper\">
            <p><strong>category-section-lookup </strong> </p>
          </div>
        </td>
      </tr>
      <tr>
        <td class=\"confluenceTd\">
          <p>Content Blacklist Cache</p>
        </td>
        <td class=\"confluenceTd\">
          <p><span><span style=\"color: rgb(58,65,74);\">International Elastic</span></span></p>
        
        </td>
        <td class=\"confluenceTd\">
          <div class=\"content-wrapper\">
            test
         
          </div>
        </td>
      </tr>
      <tr>
        <td class=\"confluenceTd\">
          <p>Legal Rule Cache </p>
        </td>
        <td class=\"confluenceTd\">
          <div class=\"content-wrapper\">
            test
          </div>
          <p><strong><span style=\"color: rgb(58,65,74);\">Example:</span></strong></p>
          <p><span style=\"color: rgb(58,65,74);\">Category</span></p>
        </td>
        <td class=\"confluenceTd\">
          <div class=\"content-wrapper\">
            <p>cache</p>
          </div>
  test

        </td>
      </tr>
      <tr>
        <td class=\"confluenceTd\">
          <p>Storefront Cache Service</p>
        </td>
        <td class=\"confluenceTd\">
          <div class=\"content-wrapper\">
            <p>Example:</p>
          </div>
          blabla
        </td>
        <td class=\"confluenceTd\">
          <div class=\"content-wrapper\">
            <p>example</p>
          </div>
        </td>
      </tr>
      <tr>
        <td class=\"confluenceTd\">
          test
        </td>
        <td class=\"confluenceTd\">
          test3
        </td>
        <td class=\"confluenceTd\">
          some code
        </td>
      </tr>
    </tbody>
  </table>
</div>

guvenaltunsoyy avatar Jul 26 '24 09:07 guvenaltunsoyy