nlprule icon indicating copy to clipboard operation
nlprule copied to clipboard

Support for AnnotatedText

Open mishushakov opened this issue 2 years ago • 10 comments

hey, thanks for this awesome project! do you consider adding AnnotatedText support?

this would allow nlprule to be used to spell-check markdown/word/html/etc. documents converted to AnnotatedText format (supported by LanguageTool)

right now i'm thinking how it could be done, but i can't quite figure out how LanguageTool can spellcheck ignoring the markup but then map the ranges to original document

mishushakov avatar Apr 28 '22 13:04 mishushakov

i've managed to actually crack the code (i think), it's really easy you start at 0 offset and 0 line you iterate the nodes/elements in annotation you push node's contents into text at the current line additionally you want to check for new line and increase the current line count then you add length of the node to offset

the only thing remaining is you should iterate over texts line-by-line and add text's offset to nlprule's offset for example if i have a misspelling at offset 30 and misspelling occurred on 4 index, you just add 30 + 4, this would be your final offset which basically is consistent with your document length

here's implemented in js

let annotatedText = {"annotation": [
  {"text": "A "},
  {"markup": "<b>"},
  {"text": "test"},
  {"markup": "</b>"},
  {"markup": "<p>", "interpretAs": "\n\n"},
  {"text": "Interpret as new line"},
  {"markup": "</p>"}
]}

let offset = 0
let currentLine = 0
let texts = []
let result = ''

annotatedText.annotation.forEach((node) => {
  // result is only for debugging
  result += node.text ? node.text : node.markup

  if (node.interpretAs === '\n\n') {
    currentLine++
  }

  if (node.text) {
    if (!texts[currentLine]) texts[currentLine] = []
    texts[currentLine].push({text: node.text, offset})
  }

  offset += node.text ? node.text.length : node.markup.length
})

console.log(texts)
console.log(result)

here's the result, you can check that offset is correct

[
 [{ text: 'A', offset: 0 }, { text: 'test', offset: 5 }],
 [{ text: 'Interpret as new line', offset: 16 }]
]
A <b>test</b><p>Interpret as new line</p>

i'm thinking of translating into Rust now, but i'd need to make sure about the edge-cases first one that i can think of is utf8

mishushakov avatar Apr 29 '22 17:04 mishushakov

Hi, I'm not familiar with the AnnotatedText format, can you link some resource? Is this specific to LanguageTool?

In principle this does sound like a good feature though.

Thanks for the sample implementation, it does seem pretty straightforward.

bminixhofer avatar Apr 29 '22 17:04 bminixhofer

hi! yep, take a look at LanguageTool HTTP API

https://languagetool.org/http-api/#!/default/post_check

annotated text feature in LanguageTool allows you to check documents with markup (html/word/markdown) without writing parsers

you only have to convert the text into annotated text format (using tools already available) i'd be personally interested in building annotated text converters so that people don't have to build their own (like in case of prose-md) if they want to check markup

annotated text is just a nice abstraction to allow that

mishushakov avatar Apr 29 '22 17:04 mishushakov

the workflow would look like this: convert into annotated text using a converter > check with nlprule

now what's better than this is that one could take it one step further and build a HTTP server on top of nlprule the HTTP server could then be used as a drop-in replacement for LanguageTool, which i think is a good thing, because it would drive more people towards this project

mishushakov avatar Apr 29 '22 17:04 mishushakov

OK, thanks, I've had a look.

If there is a clean, simple implementation we can support AnnotatedText in the main library. I am currently not actively working on nlprule so a PR would be very welcome.

Regarding the HTTP server. That would be a nice tool but it's not something that should be in the main library, and not something I currently want to work on / maintain - but it would be a good fit for a separate package!

bminixhofer avatar Apr 29 '22 17:04 bminixhofer

i'd start annotatedtext crate for building and parsing annotatedtext then i'd try to do spell-checking using nlprule and think about how to add it the the library

totally agree that http server shouldn't be included in the library anyways, will report the progress here

mishushakov avatar Apr 29 '22 18:04 mishushakov

I have finished my AnnotatedText library for Rust and ready to make tests using nlprule i will publish as soon as i get them working together

in the original implementation i did overlook the code a bit:

texts[currentLine] = {text: node.text, offset}

should actually be

texts[currentLine].push({text: node.text, offset})

Result

texts = [
 [{ text: 'A', offset: 0 }, { text: 'test', offset: 5 }],
 [{ text: 'Interpret as new line', offset: 16 }]
]

then you can get your sentences line-by-line

texts.map(line => line.text)

(i have updated my reference code above)

here's how you'll do the same thing using the Rust library

fn main () {
  let example = r#"
    {"annotation": [
        {"text": "A "},
        {"markup": "<b>"},
        {"text": "test"},
        {"markup": "</b>"},
        {"markup": "<p>", "interpretAs": "\n\n"},
        {"text": "Interpret as new line"},
        {"markup": "</p>"}
    ]}"#;

    let annotation = lib::Annotation::from_str(&example).unwrap();
    let result = annotation.to_text_tree();

    let r = result[0].iter().cloned().map(|r| r.text)
    .collect::<Vec<_>>()
    .join("");

    println!("{}", r)
}

also, i still don't know whether the offset should be expressed in bytes or in chars (currently it's in bytes), maybe you have an opinion on that?

mishushakov avatar May 01 '22 20:05 mishushakov

Hi, sorry for the late response.

It is probably best to keep track of both and return a Position. That way it is compatible with the Python bindings / LT API (where counting in characters is natural) and with the Rust API (where counting in bytes is natural).

bminixhofer avatar May 13 '22 16:05 bminixhofer

Hey, here's the code so far:

use std::{str::FromStr, collections::HashMap, ops::Range};
use serde::{Deserialize, Serialize};

#[derive(Serialize, Deserialize, Debug)]
pub struct Annotation {
    pub annotation: Vec<AnnotatedText>
}

#[derive(Serialize, Deserialize, Debug)]
pub struct AnnotatedText {
    pub text: Option<String>,
    pub markup: Option<String>,
    pub interpretAs: Option<String>
}

pub type AnnotatedTextMap = HashMap<usize, String>;

impl FromStr for Annotation {
    type Err = serde_json::error::Error;
    fn from_str(s: &str) -> Result<Self, Self::Err> {
        return serde_json::from_str(&s);
    }
}

impl ToString for Annotation {
    fn to_string(&self) -> String {
        let mut result: String = "".to_string();

        self.annotation.iter().for_each(|annotated_text| {
            match annotated_text {
                AnnotatedText { text: Some(text), ..} => result += &text,
                AnnotatedText { interpretAs: Some(interpretAs), ..} => result += &interpretAs,
                _ => ()
            }
        });

        return result;
    }
}

impl Annotation {
    pub fn to_original(&self) -> String {
        let mut result: String = "".to_string();

        self.annotation.iter().for_each(|annotated_text| {
            match annotated_text {
                AnnotatedText { text: Some(text), ..} => result += &text,
                AnnotatedText { markup: Some(markup), ..} => result += &markup,
                _ => ()
            }
        });

        return result;
    }

    pub fn to_text_map(&self) -> AnnotatedTextMap {
        let mut offset: usize = 0;
        let mut map = HashMap::new();
        let _terminator = String::from("\n\n");

        self.annotation.iter().for_each(|annotated_text| {
            match annotated_text {
                AnnotatedText { text: Some(text), ..} => {
                    map.insert( offset, text.clone());
                    offset += text.len()
                },
                // AnnotatedText { interpretAs: Some(_terminator), ..} => offset += _terminator.len(),
                AnnotatedText { markup: Some(markup), ..} => offset += markup.len(),
                _ => ()
            }
        });

        return map
    }

    pub fn find_original_position(&self, text_position: Range<usize>) -> Range<usize> {
        let text_map = self.to_text_map();
        let mut min_distance = usize::MAX;
        let mut best_match: usize = 0;

        text_map.iter().for_each(|(key, _value)| {
            let closest_position = *key;
            if text_position.start <= closest_position {
                let distance = closest_position - text_position.start;
                if distance < min_distance {
                    best_match = *key;
                    min_distance = distance;
                }
            }
        });

        return best_match + text_position.start..text_position.end
    }
}

i decided the best approach would be to just copy the Java implementation to Rust

i leave the code here, so maybe someone could take it and reimplement the find_original_position function to find the correct range

the function takes AnnotatedText, converts it to offset map and returns original position relative to plain text position

LanguageTool reference source can be found here: https://github.com/languagetool-org/languagetool/blob/b5f85984ea2fcbce8b64da1d88fc701528810a13/languagetool-core/src/main/java/org/languagetool/markup/AnnotatedText.java#L109-L141

the issue with find_original_position right now is that the end position is incorrect

i can't really fix it right now, because i don't know Java and don't really understand the algorithm, on top of that i haven't figured out completely how borrowing works in Rust

mishushakov avatar May 16 '22 14:05 mishushakov

I’ve decided to give it another try yesterday

Changelog

  • B-tree instead of HashMap (sorted)
  • Closest position takes the sentence length into account for more precision
  • Some progress in end range calculation

What doesn't work

End range is still incorrect, example program output:

HTML: <h1>She was <span>not been here since </span><b>Monday</b></h1>
Text: She was not been here since Monday
Text Range: 4..16
Original Range: 8..20
Snippet: "was <span>no"

LanguageTool calculates the end range correctly taking <span> into account

Code

lib.rs

use std::{str::FromStr, collections::BTreeMap, ops::Range};
use serde::{Deserialize, Serialize};

#[derive(Serialize, Deserialize, Debug)]
pub struct Annotation {
    pub annotation: Vec<AnnotatedText>
}

#[derive(Serialize, Deserialize, Debug)]
pub struct AnnotatedText {
    pub text: Option<String>,
    pub markup: Option<String>,
    pub interpretAs: Option<String>
}

pub type AnnotatedTextMap = BTreeMap<usize, String>;

impl FromStr for Annotation {
    type Err = serde_json::error::Error;
    fn from_str(s: &str) -> Result<Self, Self::Err> {
        return serde_json::from_str(&s);
    }
}

impl ToString for Annotation {
    fn to_string(&self) -> String {
        let mut result: String = "".to_string();

        self.annotation.iter().for_each(|annotated_text| {
            match annotated_text {
                AnnotatedText { text: Some(text), ..} => result += &text,
                AnnotatedText { interpretAs: Some(interpretAs), ..} => result += &interpretAs,
                _ => ()
            }
        });

        return result;
    }
}

impl Annotation {
    pub fn to_original(&self) -> String {
        let mut result: String = "".to_string();

        self.annotation.iter().for_each(|annotated_text| {
            match annotated_text {
                AnnotatedText { text: Some(text), ..} => result += &text,
                AnnotatedText { markup: Some(markup), ..} => result += &markup,
                _ => ()
            }
        });

        return result;
    }

    pub fn to_text_map(&self) -> AnnotatedTextMap {
        let mut offset: usize = 0;
        let mut map = BTreeMap::new();
        let _terminator = String::from("\n\n");

        self.annotation.iter().for_each(|annotated_text| {
            match annotated_text {
                AnnotatedText { text: Some(text), ..} => {
                    map.insert(offset, text.clone());
                    offset += text.len()
                },
                // AnnotatedText { interpretAs: Some(_terminator), ..} => offset += _terminator.len(),
                AnnotatedText { markup: Some(markup), ..} => offset += markup.len(),
                _ => ()
            }
        });

        return map
    }

    pub fn find_original_position(&self, text_position: &Range<usize>) -> Range<usize> {
        let text_map = self.to_text_map();
        let mut min_distance = usize::MAX;
        let mut best_match: usize = 0;

        text_map.iter().for_each(|(key, value)| {
            let closest_position = *key + value.len();
            if text_position.start <= closest_position {
                let distance = closest_position - text_position.start;
                if distance < min_distance {
                    best_match = *key;
                    min_distance = distance;
                }
            }
        });

        return best_match + text_position.start..best_match + text_position.end
    }
}

main.rs

mod lib;
use std::str::FromStr;
use nlprule::{Tokenizer, Rules};

fn main () {
    // This example doesn't work correctly
    let example = r#"
    {"annotation": [
      {"markup": "<h1>"},
      {"text": "She was "},
      {"markup": "<span>"},
      {"text": "not been here since "},
      {"markup": "</span>"},
      {"markup": "<b>"},
      {"text": "Monday"},
      {"markup": "</b>"},
      {"markup": "</h1>"}
    ]}"#;

    // let example = r#"
    // {"annotation": [
    //   {"text": "She was "},
    //   {"text": "not been here since "},
    //   {"markup": "<b>"},
    //   {"text": "Monday"},
    //   {"markup": "</b>"}
    // ]}"#;

    let tokenizer = Tokenizer::new("./en_tokenizer.bin").unwrap();
    let rules = Rules::new("./en_rules.bin").unwrap();

    let annotation = lib::Annotation::from_str(&example).unwrap();
    let text = annotation.to_string();
    let original = annotation.to_original();
    let suggestions = rules.suggest(&text, &tokenizer);

    let original_range = suggestions[0].span().byte();
    let result = annotation.find_original_position(&original_range);
    println!("HTML: {}", &original);
    println!("Text: {}", &text);
    println!("Text Range: {:?}", &original_range);
    println!("Original Range: {:?}", &result);
    println!("Snippet: {:?}", &original[result])
}

The only question unsolved right now is the end range

mishushakov avatar May 23 '22 12:05 mishushakov