dash.js XML parser

trafficstars

The goal of this PR is to provide a new XML parser in order to speed up parsing time. Parsing time may be criticial for long contents with SegmentTimelines.

What has been achieved in this PR:

integration of tXml parser: https://github.com/TobiasNickel/tXml
modification of the tXml parser in order to apply simplification process (https://github.com/TobiasNickel/tXml#xmlsimplify-txml_dom_object) on-the-fly during parsing, while also applying matchers on the attributes values and managing children that shall be stored as arrays
remove the StringMatcher since it is not useful anymore as every attribute is parsed by default as a string (BTW simply removing this matcher could reduce up to 30-40% the original parsing processing time)
update of the NumericMatcher in order to avoid convert into numbers attributes that shall be in string format but that can be filled as a numeric value ("id" attribute for example)
remove all references to '_asArray' notation (no more duplication of child nodes as a single object and as an array of object)

Here are some numbers for the XML parsing processing time (in ms) for a 240mn content having one AdaptationSet with a SegmentTimeline without repeat pattern:

	Chrome desktop (core i7 2x2.7GHz 16Go)	Chromecast 3rd Gen	Chromecast Ultra
dash.js	110	2 500	1 800
dash.js + tXml	30	500	350

Sep 23 '20 13:09 bbert

@technogeek00 can you please check this PR on your side regarding all the changes from your PR #3451? Can you check if does not break anything? Also can you check the unit tests since some of them do not success anymore? But I don't know how to fix them.

Feb 10 '21 09:02 bbert

@bbert I can take a look, will take me a few days to get to it.

Feb 16 '21 17:02 technogeek00

@technogeek00 Did you have time to check this PR and unit tests?

Mar 26 '21 08:03 bbert

Reviewing tonight

Apr 08 '21 01:04 technogeek00

First I apologize for the delay in my review, but hopefully this is insightful feedback.

Thanks a lot for this review.

I pulled this branch and attempted to run the player and the tests locally, both appeared broken as indicated, but upon further investigation there seems to be deeper problems with the library proposed. The points I would highlight for discussion:

The ordering of sibling elements of different tag names is not persisted, this is critical to patching operations as the add/remove/replace operations operate on the consecutively modified document

That's already the case with current version. Attributes and nodes are stored as Object properties, thus you can't guarantee having them stored in original order.

There appears to be little to no proper support for xml namespacing, the prefixes are directly preserved, but these are arbitrary in a document and need to be properly handled

Is there a use case not correclty supported with this version?

As a note on the first point, this is technically only critical for the patch document, the non-modified format of the tXml library does appear to support parsing and simplification independently. If that version was taken directly the DashParser.js file could apply simplification to the MPD element to keep footprints small while the Patch could be left untouched to persist the ordering information for usage in application.

I did modified the original library to speed parsing time. If we keep parsing and simplification independants then we'll loose efficiency.

While I agree the parsing and internal structuring of the current library should be replaced with a more performant option, this particular library seems very purpose built for a specific use case and may not be fully conformant. If this library was used, I'd recommend reverting back to the original version, re-porting the attribute matcher framework, and then utilize the simplified version for mpds and lossless simplified version for patches instead of manually modifying the original parse process.

For which use cases this library is not conformant? If there are some conformance issues, maybe we can check to modify/patch this library to make it fully conformant.

Outside the library we may need to take a close look at document usage cases to ensure single vs array element usage is done consistently and accurately with the updated parsed structure. I highlighted a couple bugs in this review, but I'd worry about areas that did not reference the _asArray and still always expected it or others that always assumed standalone elements.

Apr 16 '21 08:04 bbert

@technogeek00 Have you seen my answers to your reviews ans my last comment?

May 12 '21 08:05 bbert

@bbert Sorry I've fixed my email notifications, no more long delays

That's already the case with current version. Attributes and nodes are stored as Object properties, thus you can't guarantee having them stored in original order.

Within a DASH manifest the ordering of elements is only truly significant with respect to elements of the same name and the ordering of attributes is not significant. This means the representation of attributes and nodes as object properties is generally okay, the caveat being you must have an ordered access to sibling same name elements which are granted by the _asArray copies of the structure which guarantees the ordering and even patch application ensures properly updated ordering.

With a DASH Patch manifest the ordering of all operation elements is significant among all operations, meaning you must have access to the exact ordering of the elements to apply the operations in the right order. The current library again bridges this gap with the __children array.

Is there a use case not correclty supported with this version?

All namespacing is not supported properly by this version, some explicit features common encryption, patching, and custom element extensions. Namespace prefixes are arbitrary in xml schemas, the library will need to properly parse and assign elements to namespaces (or omit if in the default).

I did modified the original library to speed parsing time. If we keep parsing and simplification independants then we'll loose efficiency.

Understood, but in doing so left no way to properly handle DASH Patching use cases. If you can persist the full information set in the simplified parse operation then you can keep it as modified, this will just be a greater lift in complexity. My reasoning for keeping these separate is that the non-simplified state is only applicable during in-memory MPD management after which the player can easily operate on the simplified model. If the data parse followed by data projection is expensive the single pass will need to capture all use cases.

For which use cases this library is not conformant?

If there are some conformance issues, maybe we can check to modify/patch this library to make it fully conformant. Namespacing and element ordering are the cases I'm referring to, details above

May 18 '21 06:05 technogeek00

@technogeek00 in order to progress effectively, do you have any test streams that highlights the issues with current implementation in this PR?

Jun 01 '21 15:06 bbert

@bbert with the namespace parsing issues I'm thinking a stream providing a cenc:pssh element would be revealing. Though I would look to @dsilhavy to confirm that Dash.js actually utilizes this node today.

The ordering of elements would be important in patches which I don't think there is a good public test stream for yet.

Jun 04 '21 01:06 technogeek00

Did update to support tag namespaces. Did update unit tests. Sibling same name elements order is respected, but unit tests required to be updated to consider some elements being always represented as arrays, conforming to DASH spec.

Jun 09 '21 06:06 bbert

Closed, see #4180

May 05 '23 08:05 bbert

dash.js dash.js copied to clipboard

XML parser

dash.js
dash.js copied to clipboard