quick-xml icon indicating copy to clipboard operation
quick-xml copied to clipboard

Document a couple of recommended patterns of usage

Open dralley opened this issue 5 years ago • 4 comments

Hi, I've started using this library for a personal project, and I've found that it's difficult to figure out how my code should be structured. I think it would be great if there were some docs that were a little more prescriptive about certain patterns you can use to accomplish certain goals (such as when you might want to use a state machine) or to provide clean abstractions in a larger non-trivial codebase.

One example: A pattern like this is really great for parsing nested objects using nested readers. (The provided nested reader example uses no abstractions - if you try to make it more sophisticated than it already is it would get messy very quickly).

Another example could be the state machine pattern used in this blog post: https://usethe.computer/posts/14-xmhell.html. The issue68 example is somewhat similar but the namespaces make it more difficult to understand what the general case might look like.

If you're not keen on putting too much detail in the quick-xml docs, then maybe just linking to a few projects / blog posts which use quick-xml "well" would be a good idea, or explain some of the general principles.

A sidenote:

Nearly every project I've looked at has some kind of implementation of get_element_text or get_attribute (https://github.com/tafia/quick-xml/issues/146) or write_text_element. I actually think it might be a good idea to include them in the library outright, but otherwise, showing some basic helpers like these in the examples would be great as well.

dralley avatar Mar 01 '21 16:03 dralley

I agree more documentation is always better. I am not sure I'll find time to write it soon but in a sketch:

  1. small xml / performance not critical / xml "simple" enough => serde
  2. the "items" are simple and not too nested => simple function with state machine
fn parse_items<R>(reader: R) -> Result<Vec<(String, String, Vec<String>)>, Error> {

    #[derive(Debug)]
    enum State {
        Start,
        Level0,
        Level1(String),
        Level2(String, String, Vec<String>),
    }

    let mut items = Vec::new();
    let mut state = State::Start;
    let mut buf = Vec::new();
    let mut txt_but = Vec::new();

    fn att_to_string(reader: &Reader<R>, event: BytesStart, name: &[u8]) -> Result<String, Error> {
        for a in event.attributes() {
            let a = a?;
            if a.key == name {
                return Ok(a.unescape_and_decode_value(reader)?);
            }
        }
        Ok(String::new())
    }

    loop {
        state = match (state, reader.read_event(buf)?) {
            (State::Start, Event::Start(e)) if e.name == b"level0" => State::Level0,
            (State::Level0, Event::Start(e)) if e.name == b"level1" => {
                State::Level1(att_to_string(reader, event, b"attr1")?)
            }
            (State::Level1(att1), Event::Start(e)) if e.name == b"level2" => {
                State::Level2(att1, att_to_string(reader, event, b"attr2")?, Vec::new())
            }
            (State::Level2(att1, att2, lev3), Event::Start(e)) if e.name == b"level3" => {
                lev3.push(reader.read_text(b"level3", &mut txt_buf)?);
                txt_buf.clear();
                State::Level2(att1, att2, lev3)
            }
            (State::Level2(att1, att2, lev3), Event::End(e)) if e.name() == b"level2" => {
                items.push((att1.clone(), att2, lev3)); // flatten level1
                State::Level1(att1)
            }
            (State::Level1(_), Event::End(e)) if e.name() == b"level1" => {
                State::Level0
            }
            (State::Level0, Event::End(e)) if e.name() == b"level0" => return Ok(items),
            (state, Event::Eof) => return Err(Error::UnexpectedEof(state)),
            state => state,
        };
        buf.clear();
    }
}
  1. Else => state machine split into many functions as specified in your example

In terms of occurrence I believe 1 >> 2 >> 3.

Thank you also for the sidenote, these functions are indeed very common and we would benefit having them implemented by default.

tafia avatar Mar 04 '21 06:03 tafia

Thanks, that is helpful! What about quick-xml without a state machine, just nested readers? I've seen a couple of projects doing it, and it's the way my code is written atm, but are there downsides? I haven't gotten around to strict validation or anything like that yet, if that is where it becomes helpful.

https://github.com/dralley/rpmrepo_rs/blob/master/src/metadata/repomd.rs#L235-L319

dralley avatar Mar 04 '21 13:03 dralley

Nested readers are good when there are really lot of levels.

I find them more complicated than simple state machines but this is subjective (matching the state and the event at once really shows what we're expecting). One potential drawback of nested parsers is that it is hard to reuse the same buf (hence in some case you may need to allocate large chunks over and over (tags_buf in your example is created many times).

tafia avatar Mar 11 '21 10:03 tafia

I am interested by your comment that performance intensive code would be better served using Reader/Writer APIs rather than Serde. I have been using Serde for speed of development but am coming to realize that I should probably use these lower-level tools instead (the objects themselves are fairly simple but it's performance critical). However, I have a very large number of things that have to be parsed (the full protocol includes probably ~100). Do you have recommendations for implementing these parse_*/decode_* functions reusably/in a way that minimizes boilerplate, or will I have just have to buckle down and hand-parse everything?

Additionally, does something spring to mind for good patterns when writing with nested elements?

phdavis1027 avatar Feb 27 '24 20:02 phdavis1027