Public Rust API
We want to design public API so Sudachi would be usable like the following. Syntax can be a bit invalid and all names are open for discussion.
let model = JapaneseModel::from_cfg("...")?;
let mut analyzer = model.new_analyzer();
for line in data {
for sentence in analyzer.analyze(sentence)? {
for token in sentence {
println!(token.surface);
}
}
}
Key points of API
- Model should be immutable and safe to share between threads
- It contains dictionary, connection, needed const data for preprocessing
- Analyzer contains mutable data for analysis, e.g. lattice and tries to reuse allocations as much as possible.
- In the long time, analyzer should have O(1) allocations
Because of Python API and lifetime considerations, Model should be a thin wrapper on Arc<RealModel> or something like that.
Layering
We have Rust API and Python API with different lifetime considerations.
Rust API should use lifetimes to safeguard against misuse and use mostly references for sharing data. On the other hand Python can't use Rust lifetimes and should use mostly Arc for sharing data.
Design proposal here is to have pointer-generic internals with thin wrappers for API types which mostly exist for instantiating concrete types.
API Surface (Types)
-
Dictionary- stores immutable data for tokenization -
Tokenizer- stores mutable state for tokenization -
InputBuffer- handles zero-copy input, sentence splitting and streaming of input data (eventually) -
MorphemeList- analysis result of a single block of input data -
Morpheme- unit of analysis result
Names and semantics should be close to Java version as possible. (comment from Takaoka-san)
TL:DR
Nice API with multiple sentences is currently blocked in stable Rust by in-progress GATs feature, also see http://lukaskalbertodt.github.io/2018/08/03/solving-the-generalized-streaming-iterator-problem-without-gats.html.
Want to have:
- Analyzer having internal mutable state for storing one sentence (and not more) because of performance reasons
- Iterator over analyzed sentences needs to borrow analyzer mutably to enforce that it is impossible to access the next sentence before consuming the current one
Problems:
- Current Rust iterators can't borrow from
&selfwithout GATs which would probably introduce new-ish Iterator API as well
What to do
- Provide non-allocating API for
- Sentence splitting, returning Iterator of
&str - Provide sentence-based analysis API
- Sentence splitting, returning Iterator of
- Optionally provide allocating combined API, which copies needed information (mostly POS) from analyzer
- Another option would be to implement Iterator-like pattern without using standard library traits to iterate over analyzed sentences (for consuming in while loop).
Splitting API into sentence splitter / analysis
for sentence in analyzer.split_sentences(line)? {
let result = analyzer.analyze_sentence(sentence)?
for token in result.tokens() {
// process token
}
}
Morpheme's part_of_speech should not return option of POS array, it should panic when given invalid POS id instead.