better-word-count icon indicating copy to clipboard operation
better-word-count copied to clipboard

Markdown-oriented word counts?

Open jon-heard opened this issue 2 years ago • 6 comments

My documents often have a fair bit of markdown, which is currently increasing my word-count by a good amount, even though it's not displayed, such as dash-bullets. Would it be possible to allow only considering words to be non-spaced-character strings with at least one alpha-numeric character? Thanks for your work on this!

jon-heard avatar Nov 11 '22 08:11 jon-heard

@lukeleppan For a different use case, I once wrote a function to disregard markdown syntax from a word count, which you might snatch for this issue? (not making a PR, since I read that you reorganizing the code right now)

function removeMarkdown (text) {
	text = text
		.replace(/`\$?=[^`]+`/g, "") // inline dataview
		.replace(/^---\n.*?\n---\n/s, "") // YAML Header
		.replace(/!?\[(.+)\]\(.+\)/g, "$1") // URLs & Image Captions
		.replace(/\*|_|\[\[|\]\]|\||==|~~|---|#|> |`/g, ""); // Markdown Syntax

	if (excludeComments) {
		text = text
			.replace(/<!--.*?-->/sg, "") // s-flag for comments spanning multiple lines
			.replace(/%%.*?%%/sg, "");
	}
	else text = text.replace(/%%|<!--|-->/g, ""); // remove only comment syntax

	if (excludeTasks) text = text.replace(/^\s*- \[[x ]] .*$/gm, "");
	else text = text.replace(/^\s*- \[[x ]] /gm, "");

	return text;
}

chrisgrieser avatar Nov 15 '22 10:11 chrisgrieser

Thank you @chrisgrieser. Very much appreciated. Like I said in issue #55 when I release the rewrite next week you may want to create a PR, otherwise, I will implement it myself (still crediting @jon-heard and yourself).

My one concern with option like this (e.g. don't count comments, etc.) Has to do with statistic collection. The way the stats are collected and stored will change many times but currently the total count and the today counts for each day will be stored. What I haven't figured out is how stats will be collected for people who have these settings enabled. My current thinking is that status bar counts and stored counts will have separate versions of these options. However, that still doesn't fix the issue that would occur if the storage setting are changed and then when, say the total words over time graph is viewed, it will have a sudden drop. So maybe storage counts are kept to defaults. Not sure.

Sorry for the ramble.

As a side note, in the future there should probably be options to enable what stats are stored, so if you wanted to view an increase in citations over time you can enable that at a certain storage cost I guess.

lukeleppan avatar Nov 15 '22 10:11 lukeleppan

Very nice @chrisgrieser! I ended up writing something that worked for me, but what you have is way more comprehensive.

For posterity, here's what ended up I writing and using. It only accepts words that contain at least one alpha-numeric character.

// word separators - space/tab, forward-slash, comma
static REGEX_SEPARATORS = "\\s\\/,";
// word - 0 or more non-separators, an alphanumeric, 0 or more non-separators
static REGEX_WORDS = new RegExp(`[^${this.REGEX_SEPARATORS}]*[a-zA-Z0-9][^${this.REGEX_SEPARATORS}]*`, "g");

static updateStatusBar(text)
{
	const words = text?.match(this.REGEX_WORDS)?.length || 0;
	const chars = text.length;
	this.statusBar.displayText(`${words} words ${chars} characters`);
}

jon-heard avatar Nov 15 '22 16:11 jon-heard

@lukeleppan. When you say you are concerned about "statistic collection", are you referring to an issue where having multiple formula choices (all vs non-markdown) opens the possibility for a different formula to be used for stored value collection than for real-time display?

As a user, I'd really expect that the stored values would be collected using the same formula as the dynamic values, i.e. whatever formula is chosen in the settings. This does open the possibility that the user switches formulas multiple times during the data collection, meaning that the collected data is mixed-formula.

At first glance, I see two reactions to this issue:

  1. Let the user deal with the fallout from the stored data being mixed-formula. It's kind of their choice to change-up the formula, and is up to them whether it's worth the stored data becoming mixed. Perhaps including a warning by the setting would be useful.
  2. Collect data using all formulas, but maybe only show the data for the formula that they currently have selected. This would be most easy on the user, but would at least double the data storage. Is the data storage big enough to warrant concern about this?

jon-heard avatar Nov 19 '22 17:11 jon-heard

Thanks @jon-heard for the feedback. That is what I'm talking about. I don't believe that collecting all the different formulas as stats each day would be feasible. So the way I see it, there are 2 options.

  1. The "formula" chosen in settings is what's stored and a warning is added explaining how issues can arise.

  2. Or, there is a second set of settings that allow them to change the "formula" that is stored. However this would still require a warning and may be redundant because most people would chose the say "formula".

Thank you again this helps.

lukeleppan avatar Nov 19 '22 21:11 lukeleppan

Just wanted to chime in with a modified version of jon-heard's approach that should work for all languages:

const pattern = /\P{Z}*[\p{L}\p{N}]\P{Z}*/gu;
export function getWordCount(text: string): number {
  return (text.match(pattern) || []).length;
}

Specifically, this uses unicode property matching to simplify things (see https://javascript.info/regexp-unicode)

/P{Z}* -> Matches 0 or more of anything that is not a separator (eg, space, tab, newline) [\p{L}\p{N}] -> Matches anything that is a letter or a number

mgmeyers avatar Jul 16 '23 17:07 mgmeyers