jetscii
jetscii copied to clipboard
expose more primitive operations
As I understand it, the public API currently requires that haystacks be &str
and that all needles be ASCII. Might you consider exposing a way to run a search on a &[u8]
? And possibly also exposing a way to use arbitrary bytes instead of limiting it to ASCII? (I feel less strongly about the latter point than the former point.)
The specific reason why I'd want this is for use in regex
. In particular, the internal matching engines can work on either &str
or &[u8]
, so the prefix literal scanning operates on &[u8]
so that it can be used with any of the matching engines. Since all inputs originate with &str
(currently), I could unsafely transmute to &[u8]
before calling out to jetscii, but I don't think that assumption will always hold in the future and the use of unsafe there seems a little unfortunate.
Other ideas? Thoughts?
All of this absolutely makes sense. I actually thought there was an issue for this already, as I'm sure someone else wanted it.
run a search on a
&[u8]
a way to use arbitrary bytes instead of limiting it to ASCII
Yeah, I'd probably just set up parallel structures that allow you to do whatever you want with bytes. This would allow folk to use it for binary files where you need to find 255
or somesuch.
For my own education, will you actually be searching for non-ASCII characters? If so, are you just going to search for the first byte of a multiple-byte UTF8 character and then do a sanity check to make sure it was the right one?
@shepmaster regex
today does indeed search on bytes, which may be any value in 0-255
. The invariants are little more complex than I'd like, but:
- For the DFA, it matches on raw bytes. UTF-8 decoding is handled implicitly in the compiled program. Since we guarantee (by construction) that the automata can only enter a match state via valid UTF-8, we can do whatever kind of prefix scanning we like---even if it's only part of a single encoded codepoint.
- For the NFAs, they generally do UTF-8 decoding explicitly since the automata is itself defined in terms of codepoints. (Yup, this means we have two distinct flavors of automata in
regex
.) When prefix literals are extract from such a program, then the literals are themselves just sequences of codepoints. When we do the actual prefix scanning, if the prefix starts with a non-ASCII codepoint, then it will just pass the first byte tomemchr
(for example). The matching engine then picks up at the beginning of the literal match, so it never needs to deal with partial decoding or any of that.
With that said...
- Using memchr on a non-ASCII byte probably isn't wise for performance reasons, and would probably be better to run memchr on, say, the ending byte of a UTF-8 encoding of a single codepoint. The performance picture isn't exactly clear here though.
- In the future, I would like for
Regex
to be capable of matching on&[u8]
, which may contain arbitrary bytes. Today, search text must be valid UTF-8.
Hope that makes sense. Happy to answer more questions!
@BurntSushi what kind of API would you like to see? Strings are nice in that they have the Matcher
API to bolt into. Byte slices don't seem to have the same thing. There's a bit of setup involved, so there probably wants to be a struct to hold it. My natural inclination is to have something like
let needle = some_macro_name!(needle_byte_1, needle_byte_2, ...);
needle.position(haystack);
But it would probably feel more natural to have haystack.position(needle)
which would involve a trait. I'm pretty flexible here, so as an interested API consumer, I think your voice gets a lot of weight! :palm_tree:
@shepmaster Hmm. The absolute simplest API I can think of is: fn pcmpestri(needles: &[u8], haystack: &[u8]) -> Option<usize>
(excuse me if pcmpestri
is not the appropriate name here). It could panic if needles.len() > 16
.
I suspect what you're going to say is that this adds some kind of overhead to every search since the bytes in needles
need to be packed into something before being passed to asm!
. If that's true, then perhaps:
struct PcmpSearcher {
// some packed representation?
}
impl PcmpSearcher {
// Create a new PcmpNeedles with the bytes given.
// This panics if needles.len() > 16.
// Duplicate bytes are ignored. (Seems suspicious.)
fn new(needles: &[u8]) -> Self { ... }
fn position(&self, haystack: &[u8]) -> Option<usize> { ... }
}
(Naming isn't my specialty. :P)
You could even impl FromIterator
so that let pcmp: PcmpSearcher = vec![b'a', b'b', b'c'].into_iter().collect()
works, but that might be too much fluff. :-)
Other thoughts:
- I could live with a trait, but there's really nothing in this space for
&[u8]
, so my natural inclination is to not add any. - I think a macro would be regrettable.
@shepmaster Thanks for working with me on this btw. I actually think this might help regex
catch up to PCRE on a few benchmarks (because they are precisely the benchmarks with a small number of prefix bytes). At least on Rust nightly, anyway. :-)
I think a macro would be regrettable.
The macro is to allow for fallback cases where the intrinsic operations aren't available. Technically, I could make a few small changes and release Jetscii right now and it would work on stable Rust. It just isn't worth it because the fallback would be the only supported case!
Right now, when you call something like:
part_number.split(ascii_chars!('-', ':')).collect();
That macro expands into the packed structures needed for the intrinsic as well as an optimized closure that checks for those two bytes "manually" (part_number.as_bytes().position(|b| b == '-' || b == ':')
). Conditional compilation means only one path is actually used.
This all corresponds to AsciiChars
and AsciiCharsWithFallback
.
If you compile with the right features (which the docs don't), then there are methods on AsciiChars
like find
, which returns the byte index.
It's possible that you'd prefer to have your own fallback code for when the intrinsics aren't available. If so, then you'd be able to use the raw-bytes equivalent of find
(which I agree should be called position
in the context of bytes).
I did think about the one-shot function, and it's something I'd be open to if enough people wanted it, but I look at it as a bit of antipattern - you probably don't want to call it in a tight loop, much like Regex::new
.
The FromIterator
implementation is indeed cute, and I'll have to give some thought to that. :smile_cat:
Thanks for working with me on this btw.
Absolutely! Having people use the library is the only way it will get better! And the faster Rust is, the more likely I am to get paid to write it for a living! :innocent:
Hmm, right, so then I definitely agree that the one-shot case should just be completely thrown out the window. Encouraging folks to write slow code is bad (and is one reason why, e.g., I want to remove the free function is_match
from regex
when it hits 1.0).
With that said... you're right about the fallback. I definitely want to be able to control, up front, which type of prefix searching is used. Sometimes it's just memchr
, sometimes it's Boyer Moore and other times it's Aho-Corasick (and of course, the latter two make use of memchr
internally, and hopefully jetscii too!). If for some reason the prefix isn't amenable to pcmpstri
, then I want to choose a different strategy at compile time.
I think the fallback code is nice from a usability perspective, but I do also believe there should be a way to execute pcmpstri with as few assumptions as possible. I think that's what I'd like most.
Now, if for some reason it's impossible to know whether a fallback is needed up front, then that could be a problem! (I feel like there shouldn't be, but my understanding of the pcmpstri instruction is extremely limited.)
With all that said, if you feel strongly about tying it to a fallback in the public API, then I can cope and write my own fallback routine. :-) (For example, I'd use memchr2
in the memchr
crate before the naive iterator approach if there are two bytes.)
I don't think I am doing a great job explaining, but hopefully code will help. Grab the bytes branch. You should be able to use jetscii::bytes::Bytes
:
extern crate jetscii;
use jetscii::bytes::Bytes;
fn main() {
let words = include_bytes!("/tmp/words.gz");
let mut needle = Bytes::new();
needle.push(0x80);
println!("{:?}", needle.position(words));
}
Make sure you compile Jetscii with features=["unstable"]
to enable everything (and a nightly, of course).
The big thing is that Bytes::position
does not exist unless you
- Have the feature enabled
- Are on x86_64
The fallback code I'm talking about makes it so a consumer of the library doesn't need to know about those concerns. You want to care about that, and this is the current way I expose it.
@shepmaster Code is good, thanks for that. And it has made me realize that I had a major brain fart. For whatever reason (I blame lack of sleep), I was thinking that the fallback was used when a caller added more than 16 bytes to the needle, but now of course I realize that's silly. I forgot about the pcmp instructions not being supported everywhere!
So with that said... Yes, I would use with_fallback
and do my special memchr stuff in there. :-) Sorry for the confusion! Up to you whether you should still have position
on the Bytes
type.
I would use
with_fallback
and do my specialmemchr
stuff in there
Hmm. That makes sense, but could be a bit tricky. Right now, the fallback takes a Fn(u8) -> bool
which is usually provided with a closure like |b| b == b'A' || b == b'Z'
. It's really impressive how well the compiler optimizes and inlines those!
In fact, a lot of my existing usage has centered around a set of characters fixed at compile time. For example, XML parsing can benefit from looking for the first of ['&', '<', '>']
. You can see the benchmarks to see that an inlined closure is about 1 GB/s faster than a slice!
In your case, you'd rather a fallback that operates more like Fn(&[u8]) -> Option<usize>
, basically a complete substitute implementation.
I can certainly add something like that, but it starts to feel a bit circular. What would you actually be using for your fallback? I'd assume something like memchr
. If that's the case, then perhaps that's what Jetscii should actually be using under the hood any way.
What would you actually be using for your fallback?
Yes, absolutely. :-)
The memchr
functions are here: http://burntsushi.net/rustdoc/memchr/ Note the existence of memchr2
and memchr3
, which I cobbled together specifically in cases where something like jetscii can't be used. They should be faster than the naive |b| b == b'A' || b == b'Z'
.
@huonw - I wonder if you might have any special knowledge in this space, since I know you've been working on simd stuff. (I've looked at it, but can't quite grok how to use it for substring/byte matching!)
One thing I notice is that your performance numbers for memchr
blow Jetscii out of the water:
which | speed |
---|---|
Iterator::position |
1893 MB/s |
memchr |
49504 MB/s |
Jetscii | 5719 MB/s |
I'm not even sure why you'd want to switch!
@shepmaster Hehe, well that's memchr
, which only works on a single needle and is defined inside libc
which is probably doing all sorts of cool things. It would be better to compare jetscii with memchr2
and memchr3
, which I wrote based on memchr
's fallback code written by @bluss. Even then, memchr2
and memchr3
only support 2 and 3 needles respectively, whereas jetscii will go up to 16.
To be clear, when there's a single leading byte, I won't use jetscii---I'll just use memchr. But if there's more than that.... :-)
Here are the latest benchmarks in memchr
from Travis: https://travis-ci.org/BurntSushi/rust-memchr/jobs/109233136#L251-L258 --- I'm not sure if those numbers are directly comparable to what you have in jetscii's README (but you could run the benchmarks on the same machine easily enough).
@shepmaster Anything I can do to help move this along? :-)
@BurntSushi Let's try to get this started again! Maybe one thing that would help me would be if you could point (or hand-wave) to where ever some jetscii code would fit in.
I do see there's some SIMD stuff in regex already, so don't feel bad telling me that I'm just too late to catch the train with the cool kids 😊
@shepmaster The SIMD stuff did relieve some of the pressure, but I bet jetscii
could still be useful, for example, for matching an alternation of single bytes. e.g., the regex (a|b|c|d|e)\w+
seems like it would benefit from jetscii. The specific place in the code is probably as a different literal searcher: https://github.com/rust-lang-nursery/regex/blob/master/src/literals.rs#L47
Thank you! The code is pretty easy to understand; very nice!
I was on a plane and didn't have the benchmark suite downloaded, but the test suite takes 94% of the previous time with the Jetscii stuff added; an encouraging sign.
@BurntSushi care to give this an appraisal?
name bench-master ns/iter bench-more-assembly ns/iter diff ns/iter diff %
misc::anchored_literal_long_match 24 (16250 MB/s) 26 (15000 MB/s) 2 8.33%
misc::anchored_literal_long_non_match 27 (14444 MB/s) 25 (15600 MB/s) -2 -7.41%
misc::anchored_literal_short_match 23 (1130 MB/s) 24 (1083 MB/s) 1 4.35%
misc::anchored_literal_short_non_match 27 (962 MB/s) 26 (1000 MB/s) -1 -3.70%
misc::easy0_1K 69 (15231 MB/s) 69 (15231 MB/s) 0 0.00%
misc::easy0_1MB 87 (12052908 MB/s) 89 (11782056 MB/s) 2 2.30%
misc::easy0_32 69 (855 MB/s) 69 (855 MB/s) 0 0.00%
misc::easy0_32K 69 (475289 MB/s) 69 (475289 MB/s) 0 0.00%
misc::easy1_1K 54 (19333 MB/s) 55 (18981 MB/s) 1 1.85%
misc::easy1_1MB 57 (18396421 MB/s) 57 (18396421 MB/s) 0 0.00%
misc::easy1_32 55 (945 MB/s) 56 (928 MB/s) 1 1.82%
misc::easy1_32K 54 (607185 MB/s) 56 (585500 MB/s) 2 3.70%
misc::hard_1K 69 (15231 MB/s) 68 (15455 MB/s) -1 -1.45%
misc::hard_1MB 85 (12336505 MB/s) 86 (12193058 MB/s) 1 1.18%
misc::hard_32 68 (867 MB/s) 70 (842 MB/s) 2 2.94%
misc::hard_32K 69 (475289 MB/s) 69 (475289 MB/s) 0 0.00%
misc::literal 20 (2550 MB/s) 18 (2833 MB/s) -2 -10.00%
misc::long_needle1 3,230 (30960 MB/s) 3,221 (31046 MB/s) -9 -0.28%
misc::long_needle2 812,507 (123 MB/s) 822,673 (121 MB/s) 10,166 1.25%
misc::match_class 99 (818 MB/s) 32 (2531 MB/s) -67 -67.68%
misc::match_class_in_range 36 (2250 MB/s) 31 (2612 MB/s) -5 -13.89%
misc::medium_1K 71 (14816 MB/s) 72 (14611 MB/s) 1 1.41%
misc::medium_1MB 88 (11915954 MB/s) 89 (11782067 MB/s) 1 1.14%
misc::medium_32 72 (833 MB/s) 72 (833 MB/s) 0 0.00%
misc::medium_32K 72 (455500 MB/s) 72 (455500 MB/s) 0 0.00%
misc::no_exponential 500 (200 MB/s) 512 (195 MB/s) 12 2.40%
misc::not_literal 129 (395 MB/s) 135 (377 MB/s) 6 4.65%
misc::one_pass_long_prefix 74 (351 MB/s) 74 (351 MB/s) 0 0.00%
misc::one_pass_long_prefix_not 73 (356 MB/s) 74 (351 MB/s) 1 1.37%
misc::one_pass_short 54 (314 MB/s) 55 (309 MB/s) 1 1.85%
misc::one_pass_short_not 56 (303 MB/s) 55 (309 MB/s) -1 -1.79%
misc::reallyhard2_1K 96 (10833 MB/s) 100 (10400 MB/s) 4 4.17%
misc::reallyhard_1K 2,328 (451 MB/s) 2,332 (450 MB/s) 4 0.17%
misc::reallyhard_1MB 2,349,958 (446 MB/s) 2,448,527 (428 MB/s) 98,569 4.19%
misc::reallyhard_32 147 (401 MB/s) 158 (373 MB/s) 11 7.48%
misc::reallyhard_32K 72,215 (454 MB/s) 70,968 (462 MB/s) -1,247 -1.73%
misc::reverse_suffix_no_quadratic 7,563 (1057 MB/s) 2,088 (3831 MB/s) -5,475 -72.39%
regexdna::find_new_lines 18,613,370 (273 MB/s) 18,270,181 (278 MB/s) -343,189 -1.84%
regexdna::subst1 1,095,902 (4638 MB/s) 1,149,494 (4422 MB/s) 53,592 4.89%
regexdna::subst10 1,159,534 (4384 MB/s) 1,182,921 (4297 MB/s) 23,387 2.02%
regexdna::subst11 1,103,694 (4605 MB/s) 1,137,528 (4468 MB/s) 33,834 3.07%
regexdna::subst2 1,119,467 (4540 MB/s) 1,146,736 (4432 MB/s) 27,269 2.44%
regexdna::subst3 1,099,312 (4624 MB/s) 1,190,286 (4270 MB/s) 90,974 8.28%
regexdna::subst4 1,123,742 (4523 MB/s) 1,131,335 (4493 MB/s) 7,593 0.68%
regexdna::subst5 1,098,962 (4625 MB/s) 1,134,847 (4479 MB/s) 35,885 3.27%
regexdna::subst6 1,099,016 (4625 MB/s) 1,133,407 (4485 MB/s) 34,391 3.13%
regexdna::subst7 1,101,411 (4615 MB/s) 1,132,839 (4487 MB/s) 31,428 2.85%
regexdna::subst8 1,118,263 (4545 MB/s) 1,191,274 (4267 MB/s) 73,011 6.53%
regexdna::subst9 1,127,738 (4507 MB/s) 1,113,407 (4565 MB/s) -14,331 -1.27%
regexdna::variant1 4,895,873 (1038 MB/s) 4,997,135 (1017 MB/s) 101,262 2.07%
regexdna::variant2 8,394,295 (605 MB/s) 8,543,739 (594 MB/s) 149,444 1.78%
regexdna::variant3 9,993,058 (508 MB/s) 10,457,008 (486 MB/s) 463,950 4.64%
regexdna::variant4 10,092,960 (503 MB/s) 10,644,563 (477 MB/s) 551,603 5.47%
regexdna::variant5 8,498,861 (598 MB/s) 8,773,161 (579 MB/s) 274,300 3.23%
regexdna::variant6 8,309,121 (611 MB/s) 8,289,828 (613 MB/s) -19,293 -0.23%
regexdna::variant7 8,707,494 (583 MB/s) 8,932,302 (569 MB/s) 224,808 2.58%
regexdna::variant8 8,883,837 (572 MB/s) 9,022,304 (563 MB/s) 138,467 1.56%
regexdna::variant9 8,800,123 (577 MB/s) 8,819,907 (576 MB/s) 19,784 0.22%
sherlock::before_after_holmes 1,318,164 (451 MB/s) 1,334,614 (445 MB/s) 16,450 1.25%
sherlock::before_holmes 86,066 (6912 MB/s) 87,794 (6776 MB/s) 1,728 2.01%
sherlock::everything_greedy 2,955,550 (201 MB/s) 2,963,474 (200 MB/s) 7,924 0.27%
sherlock::everything_greedy_nl 1,280,764 (464 MB/s) 1,264,705 (470 MB/s) -16,059 -1.25%
sherlock::holmes_cochar_watson 212,473 (2800 MB/s) 215,042 (2766 MB/s) 2,569 1.21%
sherlock::holmes_coword_watson 710,506 (837 MB/s) 713,156 (834 MB/s) 2,650 0.37%
sherlock::ing_suffix 478,865 (1242 MB/s) 506,858 (1173 MB/s) 27,993 5.85%
sherlock::ing_suffix_limited_space 1,537,909 (386 MB/s) 1,690,583 (351 MB/s) 152,674 9.93%
sherlock::letters 28,868,458 (20 MB/s) 31,051,139 (19 MB/s) 2,182,681 7.56%
sherlock::letters_lower 28,242,104 (21 MB/s) 30,509,948 (19 MB/s) 2,267,844 8.03%
sherlock::letters_upper 2,372,619 (250 MB/s) 2,548,747 (233 MB/s) 176,128 7.42%
sherlock::line_boundary_sherlock_holmes 1,356,437 (438 MB/s) 1,379,389 (431 MB/s) 22,952 1.69%
sherlock::name_alt1 42,433 (14020 MB/s) 43,734 (13603 MB/s) 1,301 3.07%
sherlock::name_alt2 183,175 (3247 MB/s) 191,617 (3104 MB/s) 8,442 4.61%
sherlock::name_alt3 195,035 (3050 MB/s) 204,909 (2903 MB/s) 9,874 5.06%
sherlock::name_alt3_nocase 1,476,059 (403 MB/s) 1,547,693 (384 MB/s) 71,634 4.85%
sherlock::name_alt4 227,746 (2612 MB/s) 256,287 (2321 MB/s) 28,541 12.53%
sherlock::name_alt4_nocase 310,494 (1916 MB/s) 337,413 (1763 MB/s) 26,919 8.67%
sherlock::name_alt5 188,038 (3163 MB/s) 209,919 (2834 MB/s) 21,881 11.64%
sherlock::name_alt5_nocase 822,826 (723 MB/s) 894,400 (665 MB/s) 71,574 8.70%
sherlock::name_holmes 57,112 (10416 MB/s) 54,114 (10994 MB/s) -2,998 -5.25%
sherlock::name_holmes_nocase 274,654 (2166 MB/s) 275,834 (2156 MB/s) 1,180 0.43%
sherlock::name_sherlock 76,547 (7772 MB/s) 79,323 (7500 MB/s) 2,776 3.63%
sherlock::name_sherlock_holmes 42,817 (13894 MB/s) 44,528 (13360 MB/s) 1,711 4.00%
sherlock::name_sherlock_holmes_nocase 241,296 (2465 MB/s) 244,435 (2433 MB/s) 3,139 1.30%
sherlock::name_sherlock_nocase 236,539 (2515 MB/s) 247,517 (2403 MB/s) 10,978 4.64%
sherlock::name_whitespace 89,045 (6681 MB/s) 88,568 (6717 MB/s) -477 -0.54%
sherlock::no_match_common 28,122 (21155 MB/s) 28,633 (20777 MB/s) 511 1.82%
sherlock::no_match_really_common 410,134 (1450 MB/s) 459,290 (1295 MB/s) 49,156 11.99%
sherlock::no_match_uncommon 27,861 (21353 MB/s) 30,621 (19428 MB/s) 2,760 9.91%
sherlock::quotes 644,258 (923 MB/s) 633,001 (939 MB/s) -11,257 -1.75%
sherlock::repeated_class_negation 108,826,573 (5 MB/s) 112,777,709 (5 MB/s) 3,951,136 3.63%
sherlock::the_lower 736,800 (807 MB/s) 776,598 (766 MB/s) 39,798 5.40%
sherlock::the_nocase 603,604 (985 MB/s) 625,625 (950 MB/s) 22,021 3.65%
sherlock::the_upper 55,682 (10684 MB/s) 61,550 (9665 MB/s) 5,868 10.54%
sherlock::the_whitespace 1,363,712 (436 MB/s) 1,373,879 (433 MB/s) 10,167 0.75%
sherlock::word_ending_n 2,476,486 (240 MB/s) 2,509,844 (237 MB/s) 33,358 1.35%
sherlock::words 11,806,797 (50 MB/s) 11,842,803 (50 MB/s) 36,006 0.30%
poke @BurntSushi
@shepmaster Can you show the code you used to generate those benchmarks? It looks like the improvement to match_class
is because of jetscii, right? I was surprised to see the improvement to reverse_suffix_no_quadratic
as well, since that should be using a suffix literal, maybe you changed it to use the beginning character class?
Can you show the code you used to generate those benchmarks?
Oh, yes, that was silly of me to omit. Here's my regex branch. You'll see that it's basically a direct clone of the existing types. I also have changes to jetscii, which basically boil down to copy-paste-remove checks.
maybe you changed it to use the beginning character class?
Nope! That's one of many reasons why your expertise is requested 😜.