react-client icon indicating copy to clipboard operation
react-client copied to clipboard

Number of character / number of words

Open parisni opened this issue 5 years ago • 3 comments

hello

intersting information are:

  • number of charcacter in the resulting format (html, odm, pdf)
  • number of words in the resulting format

right now the information of number of charracter is about every single markdown markup

parisni avatar Nov 30 '19 11:11 parisni

I once tried to build that, but it turned out to be a nontrivial problem.

There are things like punctuation which should be ignored, as well as emoticons. And one might things, well then a simple regex should do the job, but wrong. We live in a UTF-8 world that knows more languages than English and French. We have CJK languages that we support and where a word count is everything but easy. I don't even have an idea how to start a word count there.

Also we might have things like source code embedded, do we count those words or do we remove all raw tags? But then again, some people have a writing style that keeps some words in code style. Do we simply ignore them?

And when people use other HTML notations to change something or hide words, how do we handle that?

Yes, the DOM API provides a text-only version of the content but it makes it impossible to figure out a lot other things (like what was source code) And therefore, sadly renders the word count pretty useless.

I'm not saying that it's an unsolvable problem, but it includes a lot of trade offs and decisions that make it hard to consider it even remotely correct.

On November 30, 2019 12:05:57 PM GMT+01:00, Nicolas Paris [email protected] wrote:

hello

intersting information are:

  • number of charcacter in the resulting format (html, odm, pdf)
  • number of words in the resulting format

right now the information of number of charracter is about every single markdown markup

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/codimd/server/issues/229

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

SISheogorath avatar Dec 03 '19 00:12 SISheogorath

For the word-count this library looks quite complete in regard to UTF-8 character word-recognition: https://www.npmjs.com/package/words-count

ErikMichelson avatar Oct 28 '20 00:10 ErikMichelson

The request is concerned with the word and character count in the result. So currently only HTML, because there is no PDF export at this moment. This could be done with the html output, but we would need something that extracts the correct word & character count out of an HTML document.

In the mean time we could add word count via the lib @ErikMichelson suggested to the infos in the status bar below the editor. Between lines and length (aka characters would be a neat place) grafik

DerMolly avatar Oct 30 '20 18:10 DerMolly

This has been implemented

ErikMichelson avatar Nov 22 '22 20:11 ErikMichelson