nlprule
nlprule copied to clipboard
Support for AnnotatedText
hey, thanks for this awesome project! do you consider adding AnnotatedText support?
this would allow nlprule to be used to spell-check markdown/word/html/etc. documents converted to AnnotatedText format (supported by LanguageTool)
right now i'm thinking how it could be done, but i can't quite figure out how LanguageTool can spellcheck ignoring the markup but then map the ranges to original document
i've managed to actually crack the code (i think), it's really easy you start at 0 offset and 0 line you iterate the nodes/elements in annotation you push node's contents into text at the current line additionally you want to check for new line and increase the current line count then you add length of the node to offset
the only thing remaining is you should iterate over texts line-by-line and add text's offset to nlprule's offset for example if i have a misspelling at offset 30 and misspelling occurred on 4 index, you just add 30 + 4, this would be your final offset which basically is consistent with your document length
here's implemented in js
let annotatedText = {"annotation": [
{"text": "A "},
{"markup": "<b>"},
{"text": "test"},
{"markup": "</b>"},
{"markup": "<p>", "interpretAs": "\n\n"},
{"text": "Interpret as new line"},
{"markup": "</p>"}
]}
let offset = 0
let currentLine = 0
let texts = []
let result = ''
annotatedText.annotation.forEach((node) => {
// result is only for debugging
result += node.text ? node.text : node.markup
if (node.interpretAs === '\n\n') {
currentLine++
}
if (node.text) {
if (!texts[currentLine]) texts[currentLine] = []
texts[currentLine].push({text: node.text, offset})
}
offset += node.text ? node.text.length : node.markup.length
})
console.log(texts)
console.log(result)
here's the result, you can check that offset is correct
[
[{ text: 'A', offset: 0 }, { text: 'test', offset: 5 }],
[{ text: 'Interpret as new line', offset: 16 }]
]
A <b>test</b><p>Interpret as new line</p>
i'm thinking of translating into Rust now, but i'd need to make sure about the edge-cases first one that i can think of is utf8
Hi, I'm not familiar with the AnnotatedText format, can you link some resource? Is this specific to LanguageTool?
In principle this does sound like a good feature though.
Thanks for the sample implementation, it does seem pretty straightforward.
hi! yep, take a look at LanguageTool HTTP API
https://languagetool.org/http-api/#!/default/post_check
annotated text feature in LanguageTool allows you to check documents with markup (html/word/markdown) without writing parsers
you only have to convert the text into annotated text format (using tools already available) i'd be personally interested in building annotated text converters so that people don't have to build their own (like in case of prose-md) if they want to check markup
annotated text is just a nice abstraction to allow that
the workflow would look like this: convert into annotated text using a converter > check with nlprule
now what's better than this is that one could take it one step further and build a HTTP server on top of nlprule the HTTP server could then be used as a drop-in replacement for LanguageTool, which i think is a good thing, because it would drive more people towards this project
OK, thanks, I've had a look.
If there is a clean, simple implementation we can support AnnotatedText in the main library. I am currently not actively working on nlprule so a PR would be very welcome.
Regarding the HTTP server. That would be a nice tool but it's not something that should be in the main library, and not something I currently want to work on / maintain - but it would be a good fit for a separate package!
i'd start annotatedtext crate for building and parsing annotatedtext then i'd try to do spell-checking using nlprule and think about how to add it the the library
totally agree that http server shouldn't be included in the library anyways, will report the progress here
I have finished my AnnotatedText library for Rust and ready to make tests using nlprule i will publish as soon as i get them working together
in the original implementation i did overlook the code a bit:
texts[currentLine] = {text: node.text, offset}
should actually be
texts[currentLine].push({text: node.text, offset})
Result
texts = [
[{ text: 'A', offset: 0 }, { text: 'test', offset: 5 }],
[{ text: 'Interpret as new line', offset: 16 }]
]
then you can get your sentences line-by-line
texts.map(line => line.text)
(i have updated my reference code above)
here's how you'll do the same thing using the Rust library
fn main () {
let example = r#"
{"annotation": [
{"text": "A "},
{"markup": "<b>"},
{"text": "test"},
{"markup": "</b>"},
{"markup": "<p>", "interpretAs": "\n\n"},
{"text": "Interpret as new line"},
{"markup": "</p>"}
]}"#;
let annotation = lib::Annotation::from_str(&example).unwrap();
let result = annotation.to_text_tree();
let r = result[0].iter().cloned().map(|r| r.text)
.collect::<Vec<_>>()
.join("");
println!("{}", r)
}
also, i still don't know whether the offset should be expressed in bytes or in chars (currently it's in bytes), maybe you have an opinion on that?
Hi, sorry for the late response.
It is probably best to keep track of both and return a Position. That way it is compatible with the Python bindings / LT API (where counting in characters is natural) and with the Rust API (where counting in bytes is natural).
Hey, here's the code so far:
use std::{str::FromStr, collections::HashMap, ops::Range};
use serde::{Deserialize, Serialize};
#[derive(Serialize, Deserialize, Debug)]
pub struct Annotation {
pub annotation: Vec<AnnotatedText>
}
#[derive(Serialize, Deserialize, Debug)]
pub struct AnnotatedText {
pub text: Option<String>,
pub markup: Option<String>,
pub interpretAs: Option<String>
}
pub type AnnotatedTextMap = HashMap<usize, String>;
impl FromStr for Annotation {
type Err = serde_json::error::Error;
fn from_str(s: &str) -> Result<Self, Self::Err> {
return serde_json::from_str(&s);
}
}
impl ToString for Annotation {
fn to_string(&self) -> String {
let mut result: String = "".to_string();
self.annotation.iter().for_each(|annotated_text| {
match annotated_text {
AnnotatedText { text: Some(text), ..} => result += &text,
AnnotatedText { interpretAs: Some(interpretAs), ..} => result += &interpretAs,
_ => ()
}
});
return result;
}
}
impl Annotation {
pub fn to_original(&self) -> String {
let mut result: String = "".to_string();
self.annotation.iter().for_each(|annotated_text| {
match annotated_text {
AnnotatedText { text: Some(text), ..} => result += &text,
AnnotatedText { markup: Some(markup), ..} => result += &markup,
_ => ()
}
});
return result;
}
pub fn to_text_map(&self) -> AnnotatedTextMap {
let mut offset: usize = 0;
let mut map = HashMap::new();
let _terminator = String::from("\n\n");
self.annotation.iter().for_each(|annotated_text| {
match annotated_text {
AnnotatedText { text: Some(text), ..} => {
map.insert( offset, text.clone());
offset += text.len()
},
// AnnotatedText { interpretAs: Some(_terminator), ..} => offset += _terminator.len(),
AnnotatedText { markup: Some(markup), ..} => offset += markup.len(),
_ => ()
}
});
return map
}
pub fn find_original_position(&self, text_position: Range<usize>) -> Range<usize> {
let text_map = self.to_text_map();
let mut min_distance = usize::MAX;
let mut best_match: usize = 0;
text_map.iter().for_each(|(key, _value)| {
let closest_position = *key;
if text_position.start <= closest_position {
let distance = closest_position - text_position.start;
if distance < min_distance {
best_match = *key;
min_distance = distance;
}
}
});
return best_match + text_position.start..text_position.end
}
}
i decided the best approach would be to just copy the Java implementation to Rust
i leave the code here, so maybe someone could take it and reimplement the find_original_position
function to find the correct range
the function takes AnnotatedText, converts it to offset map and returns original position relative to plain text position
LanguageTool reference source can be found here: https://github.com/languagetool-org/languagetool/blob/b5f85984ea2fcbce8b64da1d88fc701528810a13/languagetool-core/src/main/java/org/languagetool/markup/AnnotatedText.java#L109-L141
the issue with find_original_position
right now is that the end position is incorrect
i can't really fix it right now, because i don't know Java and don't really understand the algorithm, on top of that i haven't figured out completely how borrowing works in Rust
I’ve decided to give it another try yesterday
Changelog
- B-tree instead of HashMap (sorted)
- Closest position takes the sentence length into account for more precision
- Some progress in end range calculation
What doesn't work
End range is still incorrect, example program output:
HTML: <h1>She was <span>not been here since </span><b>Monday</b></h1>
Text: She was not been here since Monday
Text Range: 4..16
Original Range: 8..20
Snippet: "was <span>no"
LanguageTool calculates the end range correctly taking <span>
into account
Code
lib.rs
use std::{str::FromStr, collections::BTreeMap, ops::Range};
use serde::{Deserialize, Serialize};
#[derive(Serialize, Deserialize, Debug)]
pub struct Annotation {
pub annotation: Vec<AnnotatedText>
}
#[derive(Serialize, Deserialize, Debug)]
pub struct AnnotatedText {
pub text: Option<String>,
pub markup: Option<String>,
pub interpretAs: Option<String>
}
pub type AnnotatedTextMap = BTreeMap<usize, String>;
impl FromStr for Annotation {
type Err = serde_json::error::Error;
fn from_str(s: &str) -> Result<Self, Self::Err> {
return serde_json::from_str(&s);
}
}
impl ToString for Annotation {
fn to_string(&self) -> String {
let mut result: String = "".to_string();
self.annotation.iter().for_each(|annotated_text| {
match annotated_text {
AnnotatedText { text: Some(text), ..} => result += &text,
AnnotatedText { interpretAs: Some(interpretAs), ..} => result += &interpretAs,
_ => ()
}
});
return result;
}
}
impl Annotation {
pub fn to_original(&self) -> String {
let mut result: String = "".to_string();
self.annotation.iter().for_each(|annotated_text| {
match annotated_text {
AnnotatedText { text: Some(text), ..} => result += &text,
AnnotatedText { markup: Some(markup), ..} => result += &markup,
_ => ()
}
});
return result;
}
pub fn to_text_map(&self) -> AnnotatedTextMap {
let mut offset: usize = 0;
let mut map = BTreeMap::new();
let _terminator = String::from("\n\n");
self.annotation.iter().for_each(|annotated_text| {
match annotated_text {
AnnotatedText { text: Some(text), ..} => {
map.insert(offset, text.clone());
offset += text.len()
},
// AnnotatedText { interpretAs: Some(_terminator), ..} => offset += _terminator.len(),
AnnotatedText { markup: Some(markup), ..} => offset += markup.len(),
_ => ()
}
});
return map
}
pub fn find_original_position(&self, text_position: &Range<usize>) -> Range<usize> {
let text_map = self.to_text_map();
let mut min_distance = usize::MAX;
let mut best_match: usize = 0;
text_map.iter().for_each(|(key, value)| {
let closest_position = *key + value.len();
if text_position.start <= closest_position {
let distance = closest_position - text_position.start;
if distance < min_distance {
best_match = *key;
min_distance = distance;
}
}
});
return best_match + text_position.start..best_match + text_position.end
}
}
main.rs
mod lib;
use std::str::FromStr;
use nlprule::{Tokenizer, Rules};
fn main () {
// This example doesn't work correctly
let example = r#"
{"annotation": [
{"markup": "<h1>"},
{"text": "She was "},
{"markup": "<span>"},
{"text": "not been here since "},
{"markup": "</span>"},
{"markup": "<b>"},
{"text": "Monday"},
{"markup": "</b>"},
{"markup": "</h1>"}
]}"#;
// let example = r#"
// {"annotation": [
// {"text": "She was "},
// {"text": "not been here since "},
// {"markup": "<b>"},
// {"text": "Monday"},
// {"markup": "</b>"}
// ]}"#;
let tokenizer = Tokenizer::new("./en_tokenizer.bin").unwrap();
let rules = Rules::new("./en_rules.bin").unwrap();
let annotation = lib::Annotation::from_str(&example).unwrap();
let text = annotation.to_string();
let original = annotation.to_original();
let suggestions = rules.suggest(&text, &tokenizer);
let original_range = suggestions[0].span().byte();
let result = annotation.find_original_position(&original_range);
println!("HTML: {}", &original);
println!("Text: {}", &text);
println!("Text Range: {:?}", &original_range);
println!("Original Range: {:?}", &result);
println!("Snippet: {:?}", &original[result])
}
The only question unsolved right now is the end range