jawn icon indicating copy to clipboard operation
jawn copied to clipboard

Faster whitespace detection

Open non opened this issue 11 years ago • 6 comments

At nescala Matthias suggested there was a faster way to check for possible whitespace characters using bitwise logic. We should look into that.

non avatar Feb 11 '15 15:02 non

fancy

softprops avatar Feb 11 '15 18:02 softprops

Did he explain it to you? He talked me through it - it's simple, cunning and opportunistic. :)

propensive avatar Feb 11 '15 19:02 propensive

I don't remember exactly how it worked. I will try to reverse engineer it (or ask him) later.

non avatar Feb 11 '15 20:02 non

The basic idea, IIRC, was that because all the whitespace characters are low in the ASCII table, you can shift 1.byte left by each byte, AND it with a bitmask containing a few strategically-placed 1s, and if the answer is nonzero, then it's whitespace. I forget how much of a saving this was, but Matthias had worked out exactly what the difference was at the machine level, and it saved a tiny amount in a particularly hot hotspot...

propensive avatar Feb 11 '15 21:02 propensive

@propensive If you can a link to some cod that does this I'd be appreciative.

My instinct was first to just try an initial c <= 32 test to find any potential whitespace candidate (since any char that is 32 or less is either whitespace or illegal).

non avatar Feb 11 '15 21:02 non

I don't remember exactly -- I only saw the code on his computer, but recollection is that it was something like this:

def isWhitespace(b: Byte) = ((1L << b) & 4294977024L) != 0

where I calculated that magic number from List('\n', '\r', ' ', '\t').map(1L << _).sum.

Though do you need it to fail fast? Could you do the shift (as above), and OR the result from each byte into an accumulator which you check once at the end of parsing, something like this? Just throwing ideas out there... I've no idea if this actually works, or profiles well...

propensive avatar Feb 11 '15 22:02 propensive