nom icon indicating copy to clipboard operation
nom copied to clipboard

Is there a good way?

Open jellybobbin opened this issue 3 years ago • 3 comments

enum Tokens<'a>{
    Words(&'a str),
    Spaces(usize),
    Return,
    NewLine
}

let input = "\"H\\u{65}llo \\u{20} rust\\n"\";

I want this result:

Vec[
    Tokens::Words("Hello"),
    Tokens::Spaces(3),
    Tokens::Words("rust"),
    NewLine
]

Is this feasible?Here is the simple code:

pub fn parse_token(input: &str) -> IResult<&str, Vec<Tokens>> 
{
    many1(get_token)(input)
}

fn get_token(input: &str) -> IResult<&str, Tokens>
{
    alt((
        //only return Tokens::Words("H")
        map(alpha1, Tokens::Words),
        //here is only return a char, It doesn't work well `&str`
        map(parse_escaped_char, Tokens::CJKString),
    ))(input)
}

pub fn parse_escaped_char<'a, E>(input: &'a str) -> IResult<&'a str, char, E>
where
  E: ParseError<&'a str> + FromExternalError<&'a str, std::num::ParseIntError>,
{
    preceded(
        char('\\'),
         alt((
             parse_unicode,
             value('\n', char('n')),
        )),
  )(input)
}

fn parse_unicode<'a, E>(input: &'a str) -> IResult<&'a str, char, E>
where
  E: ParseError<&'a str> + FromExternalError<&'a str, std::num::ParseIntError>,
{
    let parse_hex = take_while_m_n(1, 6, |c: char| c.is_ascii_hexdigit());

    let parse_delimited_hex = preceded(
        char('u'),
        delimited(char('{'), parse_hex, char('}')),
    );

    let parse_u32 = map_res(parse_delimited_hex, move |hex| u32::from_str_radix(hex, 16));

    map_opt(parse_u32, |value| std::char::from_u32(value))(input)
}

I'll close it as soon as possible, thx!!!

jellybobbin avatar Jan 23 '22 09:01 jellybobbin

I think the main problem is that your Tokens::Words contains a &str, which means it references a direct slice of the input. That's not what you want though, you want to apply transformations to the input (unescaping unicode escapes), so you'll have to copy the data into a String.

Xiretza avatar Jan 23 '22 09:01 Xiretza

@Xiretza

Even if I don't use it & str, the parser returns char. When the input is escape Unicode, it cannot become a continuous string;

when let input = "\"H\\u{65}llo \\u{20} rust\\n"\"; I want get:

Vec[
    Tokens::Words(String::from("Hello")),
    Tokens::Spaces(3),
    Tokens::Words(String::from("rust")),
    NewLine
]

but not:

Vec[
    Tokens::Words(String::from("H")),
    Tokens::Words(String::from("e")),
    Tokens::Words(String::from("llo")),
    Tokens::Spaces(1),
    Tokens::Spaces(1),
    Tokens::Spaces(1),
    Tokens::Words(String::from("rust")),
    NewLine
]

jellybobbin avatar Jan 23 '22 10:01 jellybobbin

You can do post-parsing transformations on it, for instance

svelterust avatar Jan 23 '22 23:01 svelterust