ccextractor icon indicating copy to clipboard operation
ccextractor copied to clipboard

[QUESTION] Valid xml document

Open jamoore5 opened this issue 4 years ago • 8 comments

Should the XML generator be updated to guarantee that it is generating valid XML?

Example output from one of my video that I exacted subtitles from

1029 | <!--
1030 | And that's exactly it--   that
1031 | human beings perceive patterns
1032 | -->

I was getting an invalid token error from 3 different parsers that I tried to read it with. When I validated the XML in a validator I found

1030: | 25 | The string "--" is not permitted within comments.

Is this a possible issue in the code base?

jamoore5 avatar Feb 21 '20 22:02 jamoore5

Yes, I don't think we're escaping anything - just happily concatenating strings without a care in the world :-)

cfsmp3 avatar Feb 21 '20 22:02 cfsmp3

Is there a reason that the subtitles are a comments instead of text within the element?

Sorry, I do not know much about subtitles and formats.

jamoore5 avatar Feb 21 '20 22:02 jamoore5

Is there a reason that the subtitles are a comments instead of text within the element?

If you mean when writing spupng, because they're not supposed there at all - the XML is just an index to the image files that contain the subtitles. We do add them to the XML for debugging purposes, and only if the OCR subsystem is present. spupng doesn't need a text representation of the subtitles.

cfsmp3 avatar Feb 21 '20 22:02 cfsmp3

@cfsmp3 thanks that was a helpful explanation.

I got introduced to this library and the concept of spupng when following this raspberry pi tutorial. https://magpi.raspberrypi.org/articles/make-comics-from-tv-recordings

The funny thing is that the content in the XML is a lot cleaner then what I exact out of my pngs using the above tutorial. On my raspberry pi, I get boxes instead of text in the png, but the right text in the xml. On my mac, I get the text in the pngs but it is not clear as the text in the xml.

For this project I was going to take advantage of the helpful debug addition and parse the text from the xml.

jamoore5 avatar Feb 21 '20 23:02 jamoore5

hey @cfsmp3 do we need the -- in the xml file . I mean clearly -- is not allowed in XML files. So do we remove it or add a space between the two hyphens?

PulkitMishra avatar Feb 22 '20 21:02 PulkitMishra

@PulkitMishra if we're doing any work on this it should be to generate proper XML. But honestly I don't think this is high priority (based on number of users), but it's low hanging fruit so if you want to do it to get started by all means go ahead.

Just remember that the goal is always to improve quality of output -just removing things we don't like is not a good idea :-)

cfsmp3 avatar Feb 22 '20 23:02 cfsmp3

@PulkitMishra From your discussion I conclude that subtitle text is provided inside the spu-xml only for debugging purposes. I also assume that the source for invalid xml tokens is from the subtitle itself (implying this is not a decoding bug). I suggest the solutions:

  1. Remove the subtitle text comments from the spu-xml for production. Only keep it for debug builds. We can display a warning message while using spu-xml on debug builds for the case when someone might accidentally generate it.

  2. Filter the text written to the spu-xml to avoid having invalid token. Might escape the strings or update/remove some tokens.

Please suggest what you think is appropriate. And also confirm my above assumption.

dak-x avatar Mar 21 '21 05:03 dak-x

@PulkitMishra From your discussion I conclude that subtitle text is provided inside the spu-xml only for debugging purposes. I also assume that the source for invalid xml tokens is from the subtitle itself (implying this is not a decoding bug). I suggest the solutions:

The invalid XML is caused by our writer not escaping some characters at all to comply with XML specifications. It's not a problem with the subtitles (which don't care at all about XML) or the format itself.

It's our own code being very sloppy here.

  1. Filter the text written to the spu-xml to avoid having invalid token. Might escape the strings or update/remove some tokens.

Yes, this is the right approach.

cfsmp3 avatar Mar 21 '21 19:03 cfsmp3