fst icon indicating copy to clipboard operation
fst copied to clipboard

consider adding a tokenizer/scanning routine

Open BurntSushi opened this issue 4 years ago • 5 comments

@llogiq brought up this use case where one might have a really big set of tokens that one wants to use to scan some text. It turns out that this is pretty easy to support using existing public APIs, but it might be nice to actually make this a proper part of the API. Here is a candidate implementation:

use fst::raw::Fst;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fst = Fst::from_iter_map(vec![
        (" ", 0),
        ("a", 1),
        ("ab", 2),
        ("abc", 3),
        ("bc", 4),
        ("uvwxyz", 5),
    ]).unwrap();

    let mut input = "abc a uvwxyz uvwxyz ab ab a ab abc";
    while !input.is_empty() {
        let (end, id) = match scan(&fst, input.as_bytes()) {
            None => return Err(From::from("unrecognized token")),
            Some((end, id)) => (end, id),
        };
        // don't print whitespace
        if id != 0 {
            println!("{:?} :: {:?}", id, &input[..end]);
        }
        input = &input[end..];
    }

    Ok(())
}

fn scan<D: AsRef<[u8]>>(fst: &Fst<D>, input: &[u8]) -> Option<(usize, u64)> {
    let mut node = fst.root();
    let mut out = fst::raw::Output::zero();
    // If the empty string is a token, then `last_match` probably needs to
    // be initialized in a smarter way.
    let mut last_match: Option<(usize, u64)> = None;
    for (i, &b) in input.iter().enumerate() {
        node = match node.find_input(b) {
            None => return last_match,
            Some(trans_index) => {
                let t = node.transition(trans_index);
                out = out.cat(t.out);
                fst.node(t.addr)
            }
        };
        if node.is_final() {
            last_match = Some((i+1, out.value()));
        }
    }
    last_match
}

And the output:

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/fst-longest`
3 :: "abc"
1 :: "a"
5 :: "uvwxyz"
5 :: "uvwxyz"
2 :: "ab"
2 :: "ab"
1 :: "a"
2 :: "ab"
3 :: "abc"

BurntSushi avatar Apr 03 '20 13:04 BurntSushi

I think if node is final we need to cat the final_output of the node. So I replaced the loop with

    for (i, &b) in input.iter().enumerate() {
        match node.find_input(b) {
            None => return last_match,
            Some(trans_index) => {
                let t = node.transition(trans_index);
                node = fst.node(t.addr);
                if node.is_final() {
                    out = out.cat(node.final_output());
                    last_match = Some((i + 1, out.value()));
                } else {
                    out = out.cat(t.out);
                }
            }
        }
    }

This gives me correct results.

llogiq avatar Apr 03 '20 18:04 llogiq

Ah yeah, great catch!

BurntSushi avatar Apr 03 '20 18:04 BurntSushi

There's still one error, because we prematurely cat the final output, but this "final" node may still be a prefix of another node that's valid. So with that corrected:

    for (i, &b) in input.iter().enumerate() {
        if let Some(trans_index) = node.find_input(b) {
            let t = node.transition(trans_index);
            node = fst.node(t.addr);
            if node.is_final() {
                last_match = Some((i + 1, out.cat(node.final_output()).value()));
            }
            out = out.cat(t.out);
        } else {
            return last_match;
        }
    }

llogiq avatar Apr 03 '20 22:04 llogiq

Regarding the name, I think scan is short, but not too descriptive. Perhaps find_longest_prefix?

llogiq avatar Apr 04 '20 06:04 llogiq

Having done exactly this, I agree that inclusion in the public API would be desirable. (Perhaps more generally, prefix matching would be a natural inclusion given the trie's internal structure, though expectations vary when it comes to APIs.)

tapeinosyne avatar May 19 '20 17:05 tapeinosyne