JSON.minify Complete rewrite without RegExp

Somewhat inspired by this: https://gist.github.com/WizKid/1170297, I have completely rewritten the minifier to achieve better performance. It passes the given tests and returns the same result as the upstream for all valid jsons that I have tested, except that it fixes this issue: https://github.com/getify/JSON.minify/issues/71 . Invalid jsons may return different results, e.g. I have seen upstream working with unescaped quoes inside strings, while my code toggles in_comment. Another note: The upstream code does remove newlines from inside strings, is it intended? I think it is wrong, but I did it as well to avoid introducing another difference that should be addressed separately. If it needs to be fixed, the test at line 50 can be simply removed.

Performances:

wireguard.txt mentioned here: https://github.com/getify/JSON.minify/pull/52 source length 154741, minified length 154741 upstream time 18 ms = 8502 chars/ms forked time 6 ms = 25367 chars/ms
https://github.com/json-iterator/test-data/blob/master/large-file.json source length 26141343, minified length 26129992 upstream time 1164 ms = 22454 chars/ms forked time 562 ms = 46490 chars/ms
https://github.com/cytoscape/cytoscape.js/blob/unstable/documentation/demos/cose-layout/data.json source length 259660, minified length 186042 upstream time 36 ms = 7213 chars/ms forked time 10 ms = 27048 chars/ms

22 MB/sec is not bad, but if you can do 45, I'd say why not? The code is also probably easier to port to other languages since it does not need a regex implementation, although in other languages there may be more effective ways to build large strings.

May 12 '22 13:05 Virtual-X

I appreciate the contribution. It's something to consider.

FWIW, the reason I've pushed back on various "complete rewrites" in the past is:

RegEx should be, if written correctly, faster in most cases where you're finding occurrences of a small subset of chars/strings in a larger string. The regex engine should be more than capable of doing that logic at native speed and so at best JS should approach that, not beat it by 3x or more. That tells me the problem might be simply that there is more appropriate regex to take advantage of the native engine more properly. We've made tweaks to the regex recently that significantly improved its performance by avoiding a really unfortunate back-tracking that was happening.

I would really like to understand, perhaps from JS engine implementors or regex experts, why the regex approach is not at least as fast as the manual loop approach.
I do not want the ports to be radically different implementations, if at all possible, as it makes it much harder to find and fix cross-implementation bugs or security issues. As such, a "complete rewrite" as opposed to a tweak or improvement now means that I, or someone(s), have to painstakingly go through and re-port this to all the other languages, and test and re-benchmark in all those envs, before we can just greenlight a complete switch of the code.
Performance is obviously important, but so is readability of the code. I don't yet have an opinion on if your version of the code is more or less readable, but one reason I favored regex over manual imperative logic is that regex is often a more "universally understood" programming construct, which I hoped would aid in the readability in a cross-language way. If the way to port this new version of the code ends up looking "less readable" in one or more of the target languages, it's a concern.

None of these are deal breakers. But that hopefully helps explain why I think we'll be careful in the consideration of this complete rewrite rather than merging at first glance.

May 12 '22 14:05 getify

Certainly. I'd say the biggest risk with a complete rewrite is the behaviour for untested cases, maybe some users are even relying on the current behaviour for malformed jsons?

There is a lot more done than regexp, which may compromise performance (all those temp substr) and may impact readability (my version is more clear to me, but that's likely to be the case because I have written it, of course).

For the question about removing newlines from strings, I think I can answer my own question having found out that newlines in strings are not allowed in json, the backslash od \n should be escaped.

May 12 '22 14:05 Virtual-X

RegExp is not that good, or just Chrome is very fast iterating strings. Here are measures for the 26 MB string:

	const len = json.length;
	let i = 0, c0 = 0, c1 = 0, c = 0;
	while (i < len) {
		c = json[i++];
		if (c !== " " && c !== "\t" && c !== "\n" && c !== "\r" && c !== "\"" && c !== "/" && c !== "*") {
			c0++;
		}
		c1++;
	}
	console.log(c0 + c1);

time 112 ms = 233405 chars/ms

	var tokenizer = /"|(\/\*)|(\*\/)|(\/\/)|\n|\r/g, tmp;
	while (tmp = tokenizer.exec(json));

time 228 ms = 114857 chars/ms

	var tokenizer = /"|(\/\*)|(\*\/)|(\/\/)|\n|\r/g, tmp, lc, rc;
	while (tmp = tokenizer.exec(json)) {
		lc = RegExp.leftContext;
		rc = RegExp.rightContext;
	}

time 349 ms = 74818 chars/ms

May 12 '22 15:05 Virtual-X

Just out of curiosity, can you check your same benchmark against this alternate regex?

var tokenizer = /["\n\r]|(?:\/\*)|(?:\*\/)|(?:\/\/)/g

That consolidates some of the alternation into a single character class, and then makes the three groupings non-capturing. I'm just wondering if that regex is more optimized to let Chrome's regex engine run faster?

May 12 '22 15:05 getify

	var tokenizer = /["\n\r]|(?:\/\*)|(?:\*\/)|(?:\/\/)/g, tmp, lc, rc;
	while (tmp = tokenizer.exec(json));

time 184 ms = 142460 chars/ms

	var tokenizer = /["\n\r]|(?:\/\*)|(?:\*\/)|(?:\/\/)/g, tmp, lc, rc;
	while (tmp = tokenizer.exec(json)) {
		lc = RegExp.leftContext;
		rc = RegExp.rightContext;
	}

time 342 ms = 76414 chars/ms

	var tokenizer = /\n/g, tmp, lc, rc;
	while (tmp = tokenizer.exec(json)) {
		lc = RegExp.leftContext;
		rc = RegExp.rightContext;
	}

time 5 ms = 5125754 chars/ms

	var tokenizer = /\s/g, tmp, lc, rc;
	while (tmp = tokenizer.exec(json)) {
		lc = RegExp.leftContext;
		rc = RegExp.rightContext;
	}

time 45 ms = 579631 chars/ms

	var tokenizer = /\"/g, tmp, lc, rc;
	while (tmp = tokenizer.exec(json)) {
		lc = RegExp.leftContext;
		rc = RegExp.rightContext;
	}

time 251 ms = 104024 chars/ms

I am not sure how to make sense of those measures, it probably depends on the content of the string, which is not ideal.

May 12 '22 16:05 Virtual-X

It was straight-forward to port my code to c++ and c, see https://github.com/Virtual-X/JSON.minify/tree/cpp

https://github.com/json-iterator/test-data/blob/master/large-file.json source length 26141343, minified length 26129992 js upstream time 1164 ms = 22454 chars/ms js forked time 562 ms = 46490 chars/ms cpp time 159 ms = 164410 chars/ms c time 124 ms = 210817 chars/ms

May 13 '22 12:05 Virtual-X

JSON.minify JSON.minify copied to clipboard

Complete rewrite without RegExp

JSON.minify
JSON.minify copied to clipboard