fst
fst copied to clipboard
consider adding a tokenizer/scanning routine
@llogiq brought up this use case where one might have a really big set of tokens that one wants to use to scan some text. It turns out that this is pretty easy to support using existing public APIs, but it might be nice to actually make this a proper part of the API. Here is a candidate implementation:
use fst::raw::Fst;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let fst = Fst::from_iter_map(vec![
(" ", 0),
("a", 1),
("ab", 2),
("abc", 3),
("bc", 4),
("uvwxyz", 5),
]).unwrap();
let mut input = "abc a uvwxyz uvwxyz ab ab a ab abc";
while !input.is_empty() {
let (end, id) = match scan(&fst, input.as_bytes()) {
None => return Err(From::from("unrecognized token")),
Some((end, id)) => (end, id),
};
// don't print whitespace
if id != 0 {
println!("{:?} :: {:?}", id, &input[..end]);
}
input = &input[end..];
}
Ok(())
}
fn scan<D: AsRef<[u8]>>(fst: &Fst<D>, input: &[u8]) -> Option<(usize, u64)> {
let mut node = fst.root();
let mut out = fst::raw::Output::zero();
// If the empty string is a token, then `last_match` probably needs to
// be initialized in a smarter way.
let mut last_match: Option<(usize, u64)> = None;
for (i, &b) in input.iter().enumerate() {
node = match node.find_input(b) {
None => return last_match,
Some(trans_index) => {
let t = node.transition(trans_index);
out = out.cat(t.out);
fst.node(t.addr)
}
};
if node.is_final() {
last_match = Some((i+1, out.value()));
}
}
last_match
}
And the output:
$ cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.00s
Running `target/debug/fst-longest`
3 :: "abc"
1 :: "a"
5 :: "uvwxyz"
5 :: "uvwxyz"
2 :: "ab"
2 :: "ab"
1 :: "a"
2 :: "ab"
3 :: "abc"
I think if node
is final we need to cat
the final_output
of the node. So I replaced the loop with
for (i, &b) in input.iter().enumerate() {
match node.find_input(b) {
None => return last_match,
Some(trans_index) => {
let t = node.transition(trans_index);
node = fst.node(t.addr);
if node.is_final() {
out = out.cat(node.final_output());
last_match = Some((i + 1, out.value()));
} else {
out = out.cat(t.out);
}
}
}
}
This gives me correct results.
Ah yeah, great catch!
There's still one error, because we prematurely cat the final output, but this "final" node may still be a prefix of another node that's valid. So with that corrected:
for (i, &b) in input.iter().enumerate() {
if let Some(trans_index) = node.find_input(b) {
let t = node.transition(trans_index);
node = fst.node(t.addr);
if node.is_final() {
last_match = Some((i + 1, out.cat(node.final_output()).value()));
}
out = out.cat(t.out);
} else {
return last_match;
}
}
Regarding the name, I think scan
is short, but not too descriptive. Perhaps find_longest_prefix
?
Having done exactly this, I agree that inclusion in the public API would be desirable. (Perhaps more generally, prefix matching would be a natural inclusion given the trie's internal structure, though expectations vary when it comes to APIs.)