quick-xml icon indicating copy to clipboard operation
quick-xml copied to clipboard

Add normalize_line_end for unescape and test

Open yorkz1994 opened this issue 1 year ago • 11 comments

Regarding #806 I added a normalize_line_end function in escape module and related tests. If unescape function is called, then line end will be normalized.

yorkz1994 avatar Sep 25 '24 02:09 yorkz1994

I think the failed test is because we did the line end normalization after the unescape, but unescape can contain \r, so these new \rs should not be normalized. We have to do the normalization only for data in the not unescape parts.

yorkz1994 avatar Sep 26 '24 01:09 yorkz1994

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 60.36%. Comparing base (3ebe221) to head (d3ee1ad). Report is 3 commits behind head on master.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #807      +/-   ##
==========================================
+ Coverage   60.17%   60.36%   +0.18%     
==========================================
  Files          41       41              
  Lines       15958    15980      +22     
==========================================
+ Hits         9603     9646      +43     
+ Misses       6355     6334      -21     
Flag Coverage Δ
unittests 60.36% <100.00%> (+0.18%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Sep 26 '24 07:09 codecov-commenter

Good. But I would also like to see the tests when use API of Attribute and BytesText, which are listed here.

Sorry, don't know what do you want me to do. I am not familiar with XML specification on attributes. What I did is just do normalization of the line end before your further unescape work. So it removes \r before unescape means it is just like no such character in your input at all, so your previous logic should continue to work. Now all your previous cases passed and the added case for testing normalization also works. I don't see anything I can do.

yorkz1994 avatar Sep 26 '24 14:09 yorkz1994

I like just see tests that will call unescape family of methods on Attribute and BytesText which is get from the reader and check that the result does not contain \r. Something like:

// XML with \n \r\n and \r style newlines in various places
const XML: &str = "...";

let mut reader = Reader::from_str(XML);

match reader.read_event().unwrap() {
  Event::Start(event) => {
    let iter = event.attributes();
    let a = iter.next().unwrap();
    #[cfg(not(feature = "encoding"))]
    assert_eq!(a.unescape_value(), "...");
    assert_eq!(a.decode_and_unescape_value(), "...");
  }
  event => panic!("Expected Start, found {:?}", event),
}

match reader.read_event().unwrap() {
  Event::Text(event) => assert_eq!(event.unescape(), "..."),
  event => panic!("Expected Text, found {:?}", event),
}

Mingun avatar Sep 26 '24 14:09 Mingun

Like I said, I am not familiar with XML specification, so I don't know where is the proper places to put \r in it and what should be expected after unescape if one appears. For example is it valid to put a \r in a attribute key or value, in tag name? If you know better than me how about you try it yourself.

yorkz1994 avatar Sep 27 '24 00:09 yorkz1994

Just add it everywhere where spaces can occur, we are not not talking about correctness for now (this is another question, we are definitely do not process everything according to the specification). I only want to have a starting point and be sure that this feature worked as assumed when you would use actual API.

Mingun avatar Sep 27 '24 09:09 Mingun

Could you provide such input data. I don't want to create such invalid XML. Normally I can think is that the \r will only appear in BytesText due to line end diffs in OS.

yorkz1994 avatar Sep 27 '24 09:09 yorkz1994

<root attribute="\r\r\n\nvalue1\r\r\n\nvalue2\r\r\n\n">\r\r\n\nvalue3\r\r\n\nvalue4\r\r\n\n</root>

Mingun avatar Sep 27 '24 09:09 Mingun

f970370 This commit is due to forgot to normalize if input has nothing to unescape.

I add a case from your input, don't know the result is your expected:

#[test]
fn line_ends() {
    const XML: &str = "<root attribute=\"\r\r\n\nvalue1\r\r\n\nvalue2\r\r\n\n\">\r\r\n\nvalue3\r\r\n\nvalue4\r\r\n\n</root>";
    let mut reader = Reader::from_str(XML);
    match reader.read_event().unwrap() {
        Event::Start(event) => {
            let mut iter = event.attributes();
            let a = iter.next().unwrap().unwrap();
            #[cfg(not(feature = "encoding"))]
            assert_eq!(
                a.unescape_value().unwrap(),
                "\n\n\nvalue1\n\n\nvalue2\n\n\n"
            );
            assert_eq!(
                a.decode_and_unescape_value(reader.decoder()).unwrap(),
                "\n\n\nvalue1\n\n\nvalue2\n\n\n"
            );
        }
        event => panic!("Expected Start, found {:?}", event),
    }
    match reader.read_event().unwrap() {
        Event::Text(event) => {
            assert_eq!(event.unescape().unwrap(), "\n\n\nvalue3\n\n\nvalue4\n\n\n")
        }
        event => panic!("Expected Text, found {:?}", event),
    }
}

There is a case in serde-se.rs always fail. Don't know if we should modify serde implementation?

serialize_as!(tuple:
        // Use to_string() to get owned type that is required for deserialization
        ("<\"&'>".to_string(), "with\t\r\n spaces", 3usize)
        => "<root>&lt;\"&amp;'&gt;</root>\
            <root>with\t\r\n spaces</root>\
            <root>3</root>");
---- with_root::tuple stdout ----
thread 'with_root::tuple' panicked at tests/serde-se.rs:1956:5:
deserialization roundtrip: Custom("invalid type: string \"with\\t\\n spaces\", expected a borrowed string")

yorkz1994 avatar Sep 27 '24 13:09 yorkz1994

The failed test is because there is \r in the roundtrip test. It is impossible to have equal serialize and deserialize due to line end normalization. Therefore I have to remove the \r in roundtrip test to get the case pass.

yorkz1994 avatar Sep 27 '24 13:09 yorkz1994

I did more line ends normalization. Also changed the normalization implementation to use iterator to avoid extra allocation during normalization.

yorkz1994 avatar Sep 29 '24 09:09 yorkz1994