chumsky icon indicating copy to clipboard operation
chumsky copied to clipboard

Feature-Request: Apply parsers in any order

Open taladar opened this issue 2 years ago • 7 comments

An example use case would be LDAP schema as retrieved via LDAP

( 1.3.6.1.4.1.1466.115.121.1.8 DESC 'Certificate' X-BINARY-TRANSFER-REQUIRED 'TRUE' X-NOT-HUMAN-READABLE 'TRUE' )

which I want to parse into

pub struct LDAPSyntax {
    pub oid: ObjectIdentifier,
    pub desc: String,
    pub x_binary_transfer_required: bool,
    pub x_not_human_readable: bool,
}

The tags (DESC, X-BINARY-TRANSFER-REQUIRED,...) here can appear in any order and some of them can be optional (with or without a default value) but others are required (DESC in this case) for a given type of schema entry (the above is one for an LDAP-Syntax). Some have values after the tag and some do not (but a given tag always has a value or never has one). So what we really want is to express the form of each tag as a parser.

I can of course build a parser for a specific entry by chaining or() and then making it repeated() and then manually sifting through the results to create custom errors if required tags are missing but it feels like the general problem of having things to parse that are made from components that are required to exist but in an unspecified order could benefit from a general solution in chumsky directly, maybe something that could take a parameter

vec![ Foobar::Required(parser1),
        Foobar::Required(parser2),
        Foobar::Optional(parser3),
      ]

Maybe the API could even optionally allow the use of a Builder (like the ones created by derive-builder) pattern for the result type.

Alternatively one could return a vector sorted the same way as the input vector with a value for required parsers and Option for optional ones. Then one would probably want to require all the parsers to return the same type.

taladar avatar Dec 16 '21 11:12 taladar

Thanks for opening this issue!

I have a few thoughts about this:

  1. While I agree that in principle there are cases where it might be useful, this is quite a context-sensitive thing and any slight deviation of the intended semantics from what a user wants would likely make the combinator useless and waste a user's time trying to get it to work, or invite further feature requests to account for such cases. Although chumsky has some combinators that are technically context-sensitive, the general approach of the library is to cater primary to context-free grammars.

  2. I'm not sure how the behaviour of this would be communicated or how edge cases would be handled. What if an input can be parsed as more than one of the inputs? What if an input can be covered by either a single parser, or by two other parsers chained together?

  3. On a type level, I'm unsure how this might be implemented. Do all of the parsers need to produce the same output? If not, how does the combinator multiplex over them? It sounds like we'd need to figure out how to generate some enum-like data structure at a type level that can abstract over each of the outputs. That sounds difficult or perhaps even impossible without the user providing this enum themselves, which rather defeats the point of the abstraction.

I'm not opposed to this feature in principle, but I'd like to hear your thoughts on this because I'm having difficulty thinking about what form such a feature would take.

zesterer avatar Dec 16 '21 14:12 zesterer

Maybe it would make sense to come up with at least a second and possible some more use cases to see what they all have in common. I know I have encountered similar parsing use cases in the past but I can't quite remember which project had similar issues. Probably one I did a few years ago related to parsing information on website imprint pages but I can't recall the specifics.

Some aspects of this problem are certainly more specific to my current problem than to the general problem (e.g. the bit where each parser parses a tag and then a value).

Edge cases are certainly a concern. Presumably the same issue can arise when using or() and I would expect similar behaviour here (the first parser that successfully parses something wins).

One possibility would be to use something like the struct derive-builder produces which is basically a version of the input struct with all fields wrapped in an Option.

My current solution is below (in shortened form, I think the omitted bits should be pretty obvious). It is far from the most elegant or clean way of solving the problem, nor the most general but it works.

#[derive(PartialEq, Eq, Clone, Debug, Is, EnumAsInner)]
pub enum LDAPSchemaTagValue {
    Standalone,
    OID(ObjectIdentifier),
[...]
}

#[derive(PartialEq, Eq, Debug)]
pub struct LDAPSchemaTag {
    tag_name: String,
    tag_value: LDAPSchemaTagValue,
}

#[derive(PartialEq, Eq, Debug)]
pub enum LDAPSchemaTagType {
    Standalone,
    OID,
[...]
}

#[derive(PartialEq, Eq, Debug)]
pub struct LDAPSchemaTagDescriptor {
    tag_name: String,
    tag_type: LDAPSchemaTagType,
}

pub fn ldap_schema_tag_value_parser(
    tag_type: &LDAPSchemaTagType,
) -> impl Parser<char, LDAPSchemaTagValue, Error = Simple<char>> {
    match tag_type {
        LDAPSchemaTagType::Standalone => empty()
            .map(|_| LDAPSchemaTagValue::Standalone)
            .labelled("no value")
            .boxed(),
        LDAPSchemaTagType::OID => oid_parser()
            .map(LDAPSchemaTagValue::OID)
            .labelled("OID")
            .boxed(),
[...]
    }
}

pub fn ldap_schema_tag_parser(
    tag_descriptor: &LDAPSchemaTagDescriptor,
) -> impl Parser<char, LDAPSchemaTag, Error = Simple<char>> + '_ {
    seq(tag_descriptor.tag_name.chars())
        .padded()
        .ignore_then(ldap_schema_tag_value_parser(&tag_descriptor.tag_type).padded())
        .map(move |tag_value| LDAPSchemaTag {
            tag_name: tag_descriptor.tag_name.to_string(),
            tag_value,
        })
}

pub fn ldap_schema_parser(
    tag_descriptors: &[LDAPSchemaTagDescriptor],
) -> impl Parser<char, (ObjectIdentifier, Vec<LDAPSchemaTag>), Error = Simple<char>> + '_ {
    let (first, rest) = tag_descriptors
        .split_first()
        .expect("tag descriptors must have at least one element");
    oid_parser()
        .then(
            rest.iter()
                .fold(ldap_schema_tag_parser(first).boxed(), |p, td| {
                    p.or(ldap_schema_tag_parser(td)).boxed()
                })
                .padded()
                .repeated(),
        )
        .padded()
        .delimited_by('(', ')')
}

pub fn required_tag(
    tag_name: &str,
    span: &std::ops::Range<usize>,
    tags: &[LDAPSchemaTag],
) -> Result<LDAPSchemaTagValue, Simple<char>> {
    tags.iter()
        .find(|x| x.tag_name == tag_name)
        .ok_or_else(|| {
            Simple::custom(
                span.clone(),
                format!("No {} tag in parsed LDAP schema tag list", tag_name),
            )
        })
        .map(|x| x.tag_value.to_owned())
}

pub fn optional_tag(tag_name: &str, tags: &[LDAPSchemaTag]) -> Option<LDAPSchemaTagValue> {
    tags.iter()
        .find(|x| x.tag_name == tag_name)
        .map(|x| x.tag_value.to_owned())
}

#[derive(Clone, PartialEq, Eq)]
pub struct LDAPSyntax {
    pub oid: ObjectIdentifier,
    pub desc: String,
    pub x_binary_transfer_required: bool,
    pub x_not_human_readable: bool,
}

pub fn ldap_syntax_parser() -> impl Parser<char, LDAPSyntax, Error = Simple<char>> {
    lazy_static! {
        static ref LDAP_SYNTAX_TAGS: Vec<LDAPSchemaTagDescriptor> = vec![
            LDAPSchemaTagDescriptor {
                tag_name: "DESC".to_string(),
                tag_type: LDAPSchemaTagType::String
            },
            LDAPSchemaTagDescriptor {
                tag_name: "X-BINARY-TRANSFER-REQUIRED".to_string(),
                tag_type: LDAPSchemaTagType::Boolean
            },
            LDAPSchemaTagDescriptor {
                tag_name: "X-NOT-HUMAN-READABLE".to_string(),
                tag_type: LDAPSchemaTagType::Boolean
            },
        ];
    }
    ldap_schema_parser(&LDAP_SYNTAX_TAGS).try_map(|(oid, tags), span| {
        Ok(LDAPSyntax {
            oid,
            desc: required_tag("DESC", &span, &tags)?
                .as_string()
                .unwrap()
                .to_string(),
            x_binary_transfer_required: *optional_tag("X-BINARY-TRANSFER-REQUIRED", &tags)
                .unwrap_or(LDAPSchemaTagValue::Boolean(false))
                .as_boolean()
                .unwrap(),
            x_not_human_readable: *optional_tag("X-NOT-HUMAN-READABLE", &tags)
                .unwrap_or(LDAPSchemaTagValue::Boolean(false))
                .as_boolean()
                .unwrap(),
        })
    })
}

taladar avatar Dec 16 '21 15:12 taladar

Perhaps this sort of thing could be implemented with an extension crate? Intuitively, I feel like it's a bit too complex for the main crate though, and rather diverges from the goal of context-free parsing and into the domain of semantic analysis.

zesterer avatar Dec 17 '21 15:12 zesterer

I think I found myself looking for something like this today. I have the same situation: there are a bunch of parsers that could exist in any other.

So it should accept both this input:

a=1 b="B"

as well as the other way around:

b="B" a=1

it's important to note that each entry may have its own type.

I want to turn it into a struct:

struct S {
   a: i64,
   b: String
}

So I could implement this by something like:

let a = just("a=").then(int);
let b = just("b=").then(string);

(a.then(b).map(|(a, b) => S { a, b }).or(b.then(a).map(|(b, a) => S { a, b})

but this gets old really fast with more fields in S.

If I had a

(a.next_to(b)).map(|(a, b) => S { a, b })

where next_to was "I don't care about the order" then my life would get a lot easier. The order in the parser still determines the order they appear in map, but the parsers are applied in any order; any permutation of them is valid. If fields are required and missing, the parse would fail. If I need optional fields, I can use or_not(). If I need default values, I can use a combination of .or_not() with map that does an unwrap_or.

Does the objection that this is a context-sensitive grammar still apply here? Is this expressible in the type system? Generating all the permutations would be one way to implement it but would be explosive - is there a better way?

faassen avatar Nov 10 '23 17:11 faassen

I think my position for this sort of thing is that this is a semantic issue rather than a syntactic issue. Chumsky isn't intended as some sort of general-purpose input checking/verification system, it's intended to transform unstructured input into structured input: whether that final structuring is semantically correct according to some specification is a job for whatever system is invoking chumsky (although admittedly combinators like validate do allow chumsky to handle this sort of input validation to some extent). I think the best approach is to do ident.then_ignore(eq).then(value).repeated() and have a second pass that ensures that there are no duplicates.

zesterer avatar Nov 10 '23 18:11 zesterer

I'm doing that now of course, but I'm frustrated with the work I need to do to get the types out of the vector that creates.

Given the struct

struct S {
   a: i64
   b: String
}

I also need an enum:

enum SField {
   A(i64),
   B(String)
}

so I can do the repeated with arbitrary types. Then I have a vector of SField which I then need to turn back into an instance of the struct S. All of that is verbose.

The alternative is to keep the values as strings (in a hashmap, say) and then take the fields out to convert them later, but then I can't associate knowledge of how to decode a field in the parser anymore. Instead I get a big try_map that tries to convert each string, without the help of parser combinators to convert these strings.

The nice thing about chumsky is that you can compose structured input you get from previous steps. That's why I'm looking for a combinator to help me doing this type of composition.

I'll note that this use case is from a programming language, and not a general-purpose input checking use case. In this case the language is XSLT, which has instructions that are XML elements, which each have XML attributes. The XML is tokenized into a stream of tokens, and some of the tokens are attributes.

faassen avatar Nov 10 '23 18:11 faassen

I can definitely understand why this would be annoying. I can sort of imagine the design of such a combinator implemented generically, like so:

let (mut a_res, mut b_res, mut c_res) = (None, None, None);

loop {
    if a.is_none() && let Ok(a_out) = a.parse(...) {
        a_res = Some(a_out);
    } else if b.is_none() && let Ok(b_out) = b.parse(...) {
        b_res = Some(b_out);
    } else if c.is_none() && let Ok(c_out) = c.parse(...) {
        c_res = Some(c_out);
    } else if let (Some(a_res), Some(b_res), Some(c_res)) = (a_res, b_res, c_res) {
        break Ok((a_res, b_res, c_res))
    } else {
        break Err(...)
    }
}

API-wise, it would probably look similar to choice, but with an output type which is a tuple of the outputs of each parser in turn.

If someone is interested in having a knock at implementing this, I'd be happy to assist. The first step would probably be to duplicate and then adapt Choice.

zesterer avatar Nov 10 '23 19:11 zesterer