rust-plist icon indicating copy to clipboard operation
rust-plist copied to clipboard

Add Ascii Reader

Open fstephany opened this issue 5 years ago • 23 comments

As discussed in issue #42, here's a first pass on an ASCII reader. I haven't used the fuzzer. If you have direction on how to setup and test it, I would be glad to add id.

fstephany avatar Nov 25 '19 16:11 fstephany

The fuzzer is fairly straightforward, we use cargo-fuzz. If you follow the (minimal) guide to get it installed and then create a new fuzz-target based on the xml_reader.rs target and run it that should just work. You can then put some valid ascii plists in the corpus folder to speed things up a bit if you want.

ebarnard avatar Nov 27 '19 17:11 ebarnard

Woah! I am interested in this. I'm working in the font design space and the dominant font design application https://glyphsapp.com uses a "dialect" of the old ASCII plist format for primary storage, with some adjustment here and there "to save bytes". I.e. it will leave quotes out if something is probably either an integer or string. @anthrotype created a Python/Cython implementation at https://github.com/fonttools/openstep-plist that handles these.

I am going to play around with this PR and see how far I can get.

madig avatar Dec 01 '19 13:12 madig

@madig That's another use case indeed ;) If you have some sample files for testing, it could be nice to check that the code is covering all the cases. If you want to port some unit tests from the Python lib, go ahead!

@ebarnard Thanks a lot for the comments. If you see further things to change do not hesitate. Even if it is code style conformance.

I'm not that sure how to bubble the IO errors from read_exact so I've just added another ErrorKind. If you had something clever in mind, I'll change my naive implementation.

fstephany avatar Dec 01 '19 17:12 fstephany

@fstephany: On it. I have test files over at https://github.com/madig/glyphsrw-rs/tree/master/data that I'm feeding into my toy implementation.

[Long code listing deleted because of irrelevance]

Edit: The optional interpretation of unquoted strings may need to optional maybe :thinking:

madig avatar Dec 01 '19 17:12 madig

Oh, this is much more involved than what I'm currently doing:

match Integer::from_str(&string_literal) {
       Ok(i) => Ok(Some(Event::Integer(i))),
       Err(_) => Ok(Some(Event::String(string_literal))),
}

I just accumulate the characters and then try to convert the resulting string to an integer. I do not support floats (actually, even Integer are not part of the Apple spec).

fstephany avatar Dec 01 '19 18:12 fstephany

Fair enough. This may be overkill for anything not Glyphs.

madig avatar Dec 01 '19 18:12 madig

(If there is interest in upstreaming the more complicated logic, maybe this guessing should be optional and off by default since it is not part of the spec.)

madig avatar Dec 01 '19 18:12 madig

(Ugh, ignore my code above. Your code is more robust: either it parses as an int (or maybe a float) or it doesn't and then is a string)

madig avatar Dec 01 '19 18:12 madig

Hm. Strings with escaped quotes aren't handled properly right now, example:

{
filter = "subCategory == \"Currency\"";
locked = 1;
name = currency;
position = "{-196, 760}";
}

Edit: Let's see if I can implement the tests at https://github.com/fonttools/openstep-plist/blob/master/tests/test_parser.py#L135-L152.

madig avatar Dec 01 '19 21:12 madig

@madig I've fixed it. Can you check the simple test case to be sure that it was the expected behavior ?

That makes me thing that I should add unit tests for invalid inputs. I don't know if I should completely reject malformed files (e.g., missing comma in a list) or be tolerant. I feel like this format has been abused by so many tools with their own conventions that it might be more effective to accept inputs even if they don't comply 100% with the format.

fstephany avatar Dec 03 '19 17:12 fstephany

Thanks! Can do at some point this week. Yeah, I want to look into tests for this at some point. There's more stuff like escape sequences (think "hello\012world") and some tricky handling of Unicode escapes that openstep-plist handles.

madig avatar Dec 03 '19 17:12 madig

Dialects

This library aims to be minimal and focussed on parsing Apple spec Plists. This means that a Plist that follows Apple's spec should always parse per the spec.

I'm happy to support dialects/extensions so long as they do not affect the parsing of Apple spec Plists and have a simple implementation.

Extensions that require complex code to implement are, unfortunately, out of scope for this library. Any extension that requires a configuration option falls into this category.

@madig I'd be happy to support the dialect you've made for glyphsapp provided we can get it parsing without configuration options.

Malformed Files

@fstephany I'm not keen to make an effort to parse obviously malformed files as I don't want people to rely on how this library interprets them, which could change as bugs are fixed or new extensions added. I'd rather we work to explicitly support various common dialects (per the limitations above).

ebarnard avatar Dec 03 '19 19:12 ebarnard

@ebarnard thanks for the feedback. I understand that if I wanted any crazy parsing, I'd need to make my own library (I did not come up with that dialect, I just need to parse it :grin:). I think the code should be able to handle the Glyphs dialect by simply not interpreting strings as anything and instead leaving that to client code. I think that is how the specification intends it. What may be needed is slightly more complex parsing of strings like marked in https://github.com/fonttools/openstep-plist/blob/master/tests/test_parser.py#L135-L152. I can see if I can implement that when I get some time. Maybe a fuzzer could find even more corner cases :thinking:

madig avatar Dec 03 '19 20:12 madig

@madig Those examples all look to be supported by Apple's parser as well: https://pewpewthespells.com/blog/dangers_of_ascii_plists.html. It's probably worth reading the whole of CFOldStylePlist.c to check exactly what escapes it supports as \ escaped as \\ is notable absent from the code listing on the blog post.

The only thing we would be unable to support is unpaired surrogates are these are not supported in UTF8 and therefore are not surpported by Rust's str or String. As I can't imagine a reason for using them it's probably fine to error in that case.

ebarnard avatar Dec 04 '19 21:12 ebarnard

The test should probably be

    #[test]
    fn escaped_sequences_in_strings() {
        let plist = r#"{
            key1 = "va\"lue";
            key2 = 'va"lue';
            key3 = "va\a\b\f\n\r\t\v\"\nlue";
            key4 = "a\012b";
            key5 = "\\UD83D\\UDCA9";
            key6 = "\UD83D\UDCA9";
        }"#;
        let cursor = Cursor::new(plist.as_bytes());
        let streaming_parser = AsciiReader::new(cursor);
        let events: Vec<Event> = streaming_parser.map(|e| e.unwrap()).collect();

        let comparison = &[
            StartDictionary(None),
            String("key1".to_owned()),
            String(r#"va"lue"#.to_owned()),
            String("key2".to_owned()),
            String(r#"va"lue"#.to_owned()),
            String("key3".to_owned()),
            String("va\u{07}\u{08}\u{0C}\n\r\t\u{0B}\"\nlue".to_owned()),
            String("key4".to_owned()),
            String("a\nb".to_owned()),
            String("key5".to_owned()),
            String("\\UD83D\\UDCA9".to_owned()),
            String("key6".to_owned()),
            String("💩".to_owned()),
            EndCollection,
        ];

        assert_eq!(events, comparison);
    }

Parses with openstep-plist:

{'key1': 'va"lue',
 'key2': 'va"lue',
 'key3': 'va\x07\x08\x0c\n\r\t\x0b"\nlue',
 'key4': 'a\nb',
 'key5': '\\UD83D\\UDCA9',
 'key6': '💩'}

Next step: implement the equivalent of https://github.com/opensource-apple/CF/blob/master/CFOldStylePList.c#L123-L171. Decoding table for values 128-255 at https://github.com/fonttools/openstep-plist/blob/master/src/openstep_plist/parser.pyx#L87-L106.

madig avatar Dec 08 '19 21:12 madig

What is left for this PR to reach mergeable state, besides fixing the merge conflicts? Is @fstephany still interested in picking this one up?

anthrotype avatar Dec 05 '22 11:12 anthrotype

I think the only thing missing was correct handling of escapes per @madig's comment, and some tests for those escapes.

ebarnard avatar Dec 09 '22 08:12 ebarnard

Oh, I forgot about this. Sorry for letting it slip :/ I've forgot about the details, @anthrotype go ahead if you need it!

fstephany avatar Dec 09 '22 14:12 fstephany

I'd like to see this feature land. @anthrotype @fstephany if you don't mind I'd be happy to finish it off, and if either of you have any branches with outstanding work I can take it from there, let me know.

steven-joruk avatar Mar 15 '24 18:03 steven-joruk

I still need to add escape handling and I'll go through all the previous comments to check there's nothing else, but wanted to point out two things. My branch here.

The commit that fixes one of the serde tests highlighted that documents that don't begin with exactly <?xml will be treated as ascii, which might not be desirable.

The master branch is broken due to denying warnings and a deprecation, I updated to swap_remove in my branch.

error: use of deprecated method `indexmap::IndexMap::<K, V, S>::remove`: `remove` disrupts the map order -- use `swap_remove` or `shift_remove` for explicit behavior.
  --> src/dictionary.rs:70:18
   |
70 |         self.map.remove(key)

steven-joruk avatar Mar 16 '24 17:03 steven-joruk

My branch has been updated to include the escape sequence handling and the unicode mappings. It passes the tests provided by madig.

I had to allow escaping \ (\\) because the test that parses netnewswire.pbxproj fails without it.

@ebarnard Would you prefer me to open a new PR? I don't think I'll be able to address any code review comments without assistance if we continue in this PR.

steven-joruk avatar Mar 18 '24 01:03 steven-joruk

It would be amazing to see this land. It would be very useful with some of apple CLIs like launchctl that return old style plists in stdout.

pronebird avatar Mar 20 '24 08:03 pronebird

@steven-joruk If you can't push to this branch you should definitely open a new PR and I'll look at it afresh. I can't remember much of this one.

ebarnard avatar Mar 20 '24 21:03 ebarnard

Closed via #136.

ebarnard avatar Jun 30 '24 21:06 ebarnard