node-html-to-text icon indicating copy to clipboard operation
node-html-to-text copied to clipboard

Table Cells in Row: Tab Separator

Open ajl000 opened this issue 3 years ago • 4 comments

I am trying to use html-to-text as part of a spreadsheet IMPORTHTML function (webix sheets library).

It works really well using browserify.

With tables it would be wonderful if the cells could be separated by a tab character.

Possibly an option could be used such as selectors: [ { selector: 'table', rowCellSeparator: '\\t' } ]

Many thanks for this great project.

ajl000 avatar Dec 22 '21 06:12 ajl000

This was first requested in #98

Am I right what you essentially seek for is HTML to CSV/TSV conversion?

If that's the case then the right approach would be to have a separate formatter rather than an option for the default one shipped with html-to-text.

I'll see whether I can include it in version 9. Making a custom formatter on your own is also possible. It will be simpler that the default one but still more complicated than any tags people usually customize.

KillyMXI avatar Dec 22 '21 13:12 KillyMXI

Yes this is correct: I am using html-to-text for HTML conversion to a JSON array (row-column) for an JavaScript spreadsheet, which is essentially CSV/TSV.

I don't know NodeJS so I cannot customize formatter.js to add a \t before each th/td in a row (non-first). Even every th/td in a row would be fine for my needs.

If you would include it, I would be pleased to sponsor a relatively small amount of USD150 for this feature.

Something possibly like the following would be great.

const {convert} = require('html-to-text');
const vs1 = "<p>Heading</p><table><tr><th>Month</th><th>Savings</th></tr><tr><td>January</td><td>$100</td></tr><tr><td>February</td><td>$80</td></tr></table>"

console.log(convert(vs1, {
 selectors: [ { selector: 'table', format: 'dataTableRowCellSeparator',  rowCellSeparator: '\\t' }} ]
}));

I have tried to look at custom formatters and may not have fully understood your comments above.

As an aside I note that

const {convert} = require('html-to-text');
const vs1 = "<p>Heading</p><table><tr><th>Month</th><th>Savings</th></tr><tr><td>January</td><td>$100</td></tr><tr><td>February</td><td>$80</td></tr></table>"

console.log(convert(vs1, {
 selectors: [ { selector: 'table', format: 'table' } ]
}));

console.log(convert(vs1, {
}));

Seems to output all the words together.

"Heading

MonthSavingsJanuary$100February$80"

ajl000 avatar Dec 23 '21 03:12 ajl000

format: 'table'

This is a legacy format that comes together with tables option (now deprecated). That was the way to select which tables should be rendered as tables before selectors were introduced. Since one of the main purposes for html-to-text is to clean up html emails and many emails use tables for layout - table tags can't be taken as tables by default. Once I remove the tables option the default format for tables will simply be block.

MonthSavingsJanuary$100February$80

This is because format: 'table' is essentially equivalent to format: 'block' but there is no format specified for rows and cells, so they are interpreted as inline tags. Thanks for bringing this up - it can actually be used to achieve the desired output without a complex table formatter.

{
  wordwrap: false,
  whitespaceCharacters: ' \r\n\f\u200b', // excluded tab character
  formatters: {
    'cellFormatter': function (elem, walk, builder, formatOptions) {
      builder.addInline('\t');
      walk(elem.children, builder);
    }
  },
  selectors: [
    { selector: 'table', format: 'block', options: { leadingLineBreaks: 2, trailingLineBreaks: 2 } },
    { selector: 'tr', format: 'block', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
    { selector: 'th + th', format: 'cellFormatter' },
    { selector: 'th + td', format: 'cellFormatter' },
    { selector: 'td + td', format: 'cellFormatter' }
  ]
}

- this should do the job if there is no complex content inside cells.


When I get to a dedicated (and more robust) formatter implementation - I think I'll call it delimitedTable to match what seems to be the umbrella term - delimiter-separated values.

With workaround figured out, I think I won't try to make 8.2.0 for this. And version 9 is still few months away - there are a couple of big issues to address.

KillyMXI avatar Dec 23 '21 12:12 KillyMXI

I haven't included delimitedTable among the new default formatters in the version 9, but there seems to be one improvement handy to simplify the example above - builder.addLiteral function. It is made for markup elements and it circumvents the whitespace processing, so no need to alter whitespaceCharacters.

{
  wordwrap: false,
  formatters: {
    'cellFormatter': function (elem, walk, builder, formatOptions) {
      builder.addLiteral('\t');
      walk(elem.children, builder);
    }
  },
  selectors: [
    { selector: 'table', format: 'block', options: { leadingLineBreaks: 2, trailingLineBreaks: 2 } },
    { selector: 'tr', format: 'block', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
    { selector: 'th + th', format: 'cellFormatter' },
    { selector: 'th + td', format: 'cellFormatter' },
    { selector: 'td + td', format: 'cellFormatter' }
  ]
}

KillyMXI avatar Dec 19 '22 15:12 KillyMXI