fst consider adding a tokenizer/scanning routine

@llogiq brought up this use case where one might have a really big set of tokens that one wants to use to scan some text. It turns out that this is pretty easy to support using existing public APIs, but it might be nice to actually make this a proper part of the API. Here is a candidate implementation:

use fst::raw::Fst;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fst = Fst::from_iter_map(vec![
        (" ", 0),
        ("a", 1),
        ("ab", 2),
        ("abc", 3),
        ("bc", 4),
        ("uvwxyz", 5),
    ]).unwrap();

    let mut input = "abc a uvwxyz uvwxyz ab ab a ab abc";
    while !input.is_empty() {
        let (end, id) = match scan(&fst, input.as_bytes()) {
            None => return Err(From::from("unrecognized token")),
            Some((end, id)) => (end, id),
        };
        // don't print whitespace
        if id != 0 {
            println!("{:?} :: {:?}", id, &input[..end]);
        }
        input = &input[end..];
    }

    Ok(())
}

fn scan<D: AsRef<[u8]>>(fst: &Fst<D>, input: &[u8]) -> Option<(usize, u64)> {
    let mut node = fst.root();
    let mut out = fst::raw::Output::zero();
    // If the empty string is a token, then `last_match` probably needs to
    // be initialized in a smarter way.
    let mut last_match: Option<(usize, u64)> = None;
    for (i, &b) in input.iter().enumerate() {
        node = match node.find_input(b) {
            None => return last_match,
            Some(trans_index) => {
                let t = node.transition(trans_index);
                out = out.cat(t.out);
                fst.node(t.addr)
            }
        };
        if node.is_final() {
            last_match = Some((i+1, out.value()));
        }
    }
    last_match
}

And the output:

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/fst-longest`
3 :: "abc"
1 :: "a"
5 :: "uvwxyz"
5 :: "uvwxyz"
2 :: "ab"
2 :: "ab"
1 :: "a"
2 :: "ab"
3 :: "abc"

Apr 03 '20 13:04 BurntSushi

I think if node is final we need to cat the final_output of the node. So I replaced the loop with

    for (i, &b) in input.iter().enumerate() {
        match node.find_input(b) {
            None => return last_match,
            Some(trans_index) => {
                let t = node.transition(trans_index);
                node = fst.node(t.addr);
                if node.is_final() {
                    out = out.cat(node.final_output());
                    last_match = Some((i + 1, out.value()));
                } else {
                    out = out.cat(t.out);
                }
            }
        }
    }

This gives me correct results.

Apr 03 '20 18:04 llogiq

Ah yeah, great catch!

Apr 03 '20 18:04 BurntSushi

There's still one error, because we prematurely cat the final output, but this "final" node may still be a prefix of another node that's valid. So with that corrected:

    for (i, &b) in input.iter().enumerate() {
        if let Some(trans_index) = node.find_input(b) {
            let t = node.transition(trans_index);
            node = fst.node(t.addr);
            if node.is_final() {
                last_match = Some((i + 1, out.cat(node.final_output()).value()));
            }
            out = out.cat(t.out);
        } else {
            return last_match;
        }
    }

Apr 03 '20 22:04 llogiq

Regarding the name, I think scan is short, but not too descriptive. Perhaps find_longest_prefix?

Apr 04 '20 06:04 llogiq

Having done exactly this, I agree that inclusion in the public API would be desirable. (Perhaps more generally, prefix matching would be a natural inclusion given the trie's internal structure, though expectations vary when it comes to APIs.)

May 19 '20 17:05 tapeinosyne

fst fst copied to clipboard

consider adding a tokenizer/scanning routine

fst
fst copied to clipboard