php-mediainfo icon indicating copy to clipboard operation
php-mediainfo copied to clipboard

Parsing MediaInfo fails on Chinese chars in XML

Open Fossil01 opened this issue 4 years ago • 11 comments

In the following XML between the <Copyright> tags there are some Chinese chars. SimpleXML doesn't seem to like those and crashes the process.

<Copyright>�꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>

   ErrorException  : simplexml_load_string(): Entity: line 54: parser error : Char 0xFFFE out of allowed range

  at /var/www/removed/vendor/mhor/php-mediainfo/src/Parser/AbstractXmlOutputParser.php:18
    14|         if (mb_detect_encoding($xmlString, 'UTF-8', true) === false) {
    15|             $xmlString = utf8_encode($xmlString);
    16|         }
    17|
  > 18|         $xml = simplexml_load_string($xmlString);
    19|         $json = json_encode($xml);
    20|
    21|         return json_decode($json, true);
    22|     }

  Exception trace:

  1   simplexml_load_string("<?xml version="1.0" encoding="UTF-8"?>
<Mediainfo version="19.09">
<File>
<track type="General">
<Count>331</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>General</Kind_of_stream>
<Kind_of_stream>General</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<Count_of_video_streams>1</Count_of_video_streams>
<Count_of_audio_streams>1</Count_of_audio_streams>
<Video_Format_List>VC-1</Video_Format_List>
<Video_Format_WithHint_List>VC-1 (WMV3)</Video_Format_WithHint_List>
<Codecs_Video>VC-1</Codecs_Video>
<Audio_Format_List>WMA</Audio_Format_List>
<Audio_Format_WithHint_List>WMA</Audio_Format_WithHint_List>
<Audio_codecs>WMA</Audio_codecs>
<Complete_name>/mnt/ramdisk/5/54c9f93b-8550-4100-8eeb-328841dc00d6/782247_ohrly-rh131aaso.wmv</Complete_name>
<Folder_name>/mnt/ramdisk/5/54c9f93b-8550-4100-8eeb-328841dc00d6</Folder_name>
<File_name_extension>782247_ohrly-rh131aaso.wmv</File_name_extension>
<File_name>782247_ohrly-rh131aaso</File_name>
<File_extension>wmv</File_extension>
<Format>Windows Media</Format>
<Format>Windows Media</Format>
<Format_Extensions_usually_used>asf dvr-ms wma wmv</Format_Extensions_usually_used>
<Commercial_name>Windows Media</Commercial_name>
<Internet_media_type>video/x-ms-wmv</Internet_media_type>
<File_size>760169</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742.4 KiB</File_size>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09;03</Duration>
<Duration>00:08:09.056 (00:08:09;03)</Duration>
<Overall_bit_rate>12435</Overall_bit_rate>
<Overall_bit_rate>12.4 kb/s</Overall_bit_rate>
<Maximum_Overall_bit_rate>5136894</Maximum_Overall_bit_rate>
<Maximum_Overall_bit_rate>5 137 kb/s</Maximum_Overall_bit_rate>
<Frame_rate>29.970</Frame_rate>
<Frame_rate>29.970 FPS</Frame_rate>
<Frame_count>14657</Frame_count>
<HeaderSize>1046</HeaderSize>
<DataSize>759123</DataSize>
<Performer>Ron Harris</Performer>
<Encoded_date>UTC 2012-05-14 00:53:44.000</Encoded_date>
<File_last_modification_date>UTC 2019-12-17 17:20:55</File_last_modification_date>
<File_last_modification_date__local_>2019-12-17 18:20:55</File_last_modification_date__local_>
<Copyright>�꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>
<Comment>HD Videos</Comment>
</track>
<track type="Video">
<Count>377</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>Video</Kind_of_stream>
<Kind_of_stream>Video</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<StreamOrder>0</StreamOrder>
<ID>1</ID>
<ID>1</ID>
<Format>VC-1</Format>
<Format>VC-1</Format>
<Commercial_name>VC-1</Commercial_name>
<Format_profile>Main</Format_profile>
<Internet_media_type>video/vc1</Internet_media_type>
<Codec_ID>WMV3</Codec_ID>
<Codec_ID_Info>Windows Media Video 9</Codec_ID_Info>
<Codec_ID_Hint>WMV3</Codec_ID_Hint>
<Codec_ID_Url>http://www.microsoft.com/windows/windowsmedia/format/codecdownload.aspx</Codec_ID_Url>
<Description_of_the_codec>Windows Media Video 9 - 2-pass VBR</Description_of_the_codec>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09;03</Duration>
<Duration>00:08:09.056 (00:08:09;03)</Duration>
<Bit_rate>5000000</Bit_rate>
<Bit_rate>5 000 kb/s</Bit_rate>
<Width>1920</Width>
<Width>1 920 pixels</Width>
<Height>1080</Height>
<Height>1 080 pixels</Height>
<Pixel_aspect_ratio>1.000</Pixel_aspect_ratio>
<Display_aspect_ratio>1.778</Display_aspect_ratio>
<Display_aspect_ratio>16:9</Display_aspect_ratio>
<Frame_rate>29.970</Frame_rate>
<Frame_rate>29.970 (29970/1000) FPS</Frame_rate>
<FrameRate_Num>29970</FrameRate_Num>
<FrameRate_Den>1000</FrameRate_Den>
<Frame_count>14657</Frame_count>
<Color_space>YUV</Color_space>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Bit_depth>8</Bit_depth>
<Bit_depth>8 bits</Bit_depth>
<Scan_type>Progressive</Scan_type>
<Scan_type>Progressive</Scan_type>
<Compression_mode>Lossy</Compression_mode>
<Compression_mode>Lossy</Compression_mode>
<Bits__Pixel_Frame_>0.080</Bits__Pixel_Frame_>
<Stream_size>305660000</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>291.5 MiB</Stream_size>
</track>
<track type="Audio">
<Count>280</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>Audio</Kind_of_stream>
<Kind_of_stream>Audio</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<StreamOrder>1</StreamOrder>
<ID>2</ID>
<ID>2</ID>
<Format>WMA</Format>
<Format>WMA</Format>
<Commercial_name>WMA</Commercial_name>
<Format_version>Version 2</Format_version>
<Codec_ID>161</Codec_ID>
<Codec_ID_Info>Windows Media Audio</Codec_ID_Info>
<Codec_ID_Url>http://www.microsoft.com/windows/windowsmedia/format/codecdownload.aspx</Codec_ID_Url>
<Description_of_the_codec>Windows Media Audio 9 - 128 kbps, 44 kHz, stereo CBR</Description_of_the_codec>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09.056</Duration>
<Bit_rate>128000</Bit_rate>
<Bit_rate>128 kb/s</Bit_rate>
<Channel_s_>2</Channel_s_>
<Channel_s_>2 channels</Channel_s_>
<Sampling_rate>44100</Sampling_rate>
<Sampling_rate>44.1 kHz</Sampling_rate>
<Samples_count>21567370</Samples_count>
<Bit_depth>16</Bit_depth>
<Bit_depth>16 bits</Bit_depth>
<Stream_size>7824896</Stream_size>
<Stream_size>7.46 MiB</Stream_size>
<Stream_size>7 MiB</Stream_size>
<Stream_size>7.5 MiB</Stream_size>
<Stream_size>7.46 MiB</Stream_size>
<Stream_size>7.462 MiB</Stream_size>
</track>
</File>
</Mediainfo>

")
      /var/www/removed/vendor/mhor/php-mediainfo/src/Parser/AbstractXmlOutputParser.php:18

  2   Mhor\MediaInfo\Parser\AbstractXmlOutputParser::transformXmlToArray("<?xml version="1.0" encoding="UTF-8"?>
<Mediainfo version="19.09">
<File>
<track type="General">
<Count>331</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>General</Kind_of_stream>
<Kind_of_stream>General</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<Count_of_video_streams>1</Count_of_video_streams>
<Count_of_audio_streams>1</Count_of_audio_streams>
<Video_Format_List>VC-1</Video_Format_List>
<Video_Format_WithHint_List>VC-1 (WMV3)</Video_Format_WithHint_List>
<Codecs_Video>VC-1</Codecs_Video>
<Audio_Format_List>WMA</Audio_Format_List>
<Audio_Format_WithHint_List>WMA</Audio_Format_WithHint_List>
<Audio_codecs>WMA</Audio_codecs>
<Complete_name>/mnt/ramdisk/5/54c9f93b-8550-4100-8eeb-328841dc00d6/782247_ohrly-rh131aaso.wmv</Complete_name>
<Folder_name>/mnt/ramdisk/5/54c9f93b-8550-4100-8eeb-328841dc00d6</Folder_name>
<File_name_extension>782247_ohrly-rh131aaso.wmv</File_name_extension>
<File_name>782247_ohrly-rh131aaso</File_name>
<File_extension>wmv</File_extension>
<Format>Windows Media</Format>
<Format>Windows Media</Format>
<Format_Extensions_usually_used>asf dvr-ms wma wmv</Format_Extensions_usually_used>
<Commercial_name>Windows Media</Commercial_name>
<Internet_media_type>video/x-ms-wmv</Internet_media_type>
<File_size>760169</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742.4 KiB</File_size>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09;03</Duration>
<Duration>00:08:09.056 (00:08:09;03)</Duration>
<Overall_bit_rate>12435</Overall_bit_rate>
<Overall_bit_rate>12.4 kb/s</Overall_bit_rate>
<Maximum_Overall_bit_rate>5136894</Maximum_Overall_bit_rate>
<Maximum_Overall_bit_rate>5 137 kb/s</Maximum_Overall_bit_rate>
<Frame_rate>29.970</Frame_rate>
<Frame_rate>29.970 FPS</Frame_rate>
<Frame_count>14657</Frame_count>
<HeaderSize>1046</HeaderSize>
<DataSize>759123</DataSize>
<Performer>Ron Harris</Performer>
<Encoded_date>UTC 2012-05-14 00:53:44.000</Encoded_date>
<File_last_modification_date>UTC 2019-12-17 17:20:55</File_last_modification_date>
<File_last_modification_date__local_>2019-12-17 18:20:55</File_last_modification_date__local_>
<Copyright>�꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>
<Comment>HD Videos</Comment>
</track>
<track type="Video">
<Count>377</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>Video</Kind_of_stream>
<Kind_of_stream>Video</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<StreamOrder>0</StreamOrder>
<ID>1</ID>
<ID>1</ID>
<Format>VC-1</Format>
<Format>VC-1</Format>
<Commercial_name>VC-1</Commercial_name>
<Format_profile>Main</Format_profile>
<Internet_media_type>video/vc1</Internet_media_type>
<Codec_ID>WMV3</Codec_ID>
<Codec_ID_Info>Windows Media Video 9</Codec_ID_Info>
<Codec_ID_Hint>WMV3</Codec_ID_Hint>
<Codec_ID_Url>http://www.microsoft.com/windows/windowsmedia/format/codecdownload.aspx</Codec_ID_Url>
<Description_of_the_codec>Windows Media Video 9 - 2-pass VBR</Description_of_the_codec>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09;03</Duration>
<Duration>00:08:09.056 (00:08:09;03)</Duration>
<Bit_rate>5000000</Bit_rate>
<Bit_rate>5 000 kb/s</Bit_rate>
<Width>1920</Width>
<Width>1 920 pixels</Width>
<Height>1080</Height>
<Height>1 080 pixels</Height>
<Pixel_aspect_ratio>1.000</Pixel_aspect_ratio>
<Display_aspect_ratio>1.778</Display_aspect_ratio>
<Display_aspect_ratio>16:9</Display_aspect_ratio>
<Frame_rate>29.970</Frame_rate>
<Frame_rate>29.970 (29970/1000) FPS</Frame_rate>
<FrameRate_Num>29970</FrameRate_Num>
<FrameRate_Den>1000</FrameRate_Den>
<Frame_count>14657</Frame_count>
<Color_space>YUV</Color_space>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Bit_depth>8</Bit_depth>
<Bit_depth>8 bits</Bit_depth>
<Scan_type>Progressive</Scan_type>
<Scan_type>Progressive</Scan_type>
<Compression_mode>Lossy</Compression_mode>
<Compression_mode>Lossy</Compression_mode>
<Bits__Pixel_Frame_>0.080</Bits__Pixel_Frame_>
<Stream_size>305660000</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>291.5 MiB</Stream_size>
</track>
<track type="Audio">
<Count>280</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>Audio</Kind_of_stream>
<Kind_of_stream>Audio</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<StreamOrder>1</StreamOrder>
<ID>2</ID>
<ID>2</ID>
<Format>WMA</Format>
<Format>WMA</Format>
<Commercial_name>WMA</Commercial_name>
<Format_version>Version 2</Format_version>
<Codec_ID>161</Codec_ID>
<Codec_ID_Info>Windows Media Audio</Codec_ID_Info>
<Codec_ID_Url>http://www.microsoft.com/windows/windowsmedia/format/codecdownload.aspx</Codec_ID_Url>
<Description_of_the_codec>Windows Media Audio 9 - 128 kbps, 44 kHz, stereo CBR</Description_of_the_codec>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09.056</Duration>
<Bit_rate>128000</Bit_rate>
<Bit_rate>128 kb/s</Bit_rate>
<Channel_s_>2</Channel_s_>
<Channel_s_>2 channels</Channel_s_>
<Sampling_rate>44100</Sampling_rate>
<Sampling_rate>44.1 kHz</Sampling_rate>
<Samples_count>21567370</Samples_count>
<Bit_depth>16</Bit_depth>
<Bit_depth>16 bits</Bit_depth>
<Stream_size>7824896</Stream_size>
<Stream_size>7.46 MiB</Stream_size>
<Stream_size>7 MiB</Stream_size>
<Stream_size>7.5 MiB</Stream_size>
<Stream_size>7.46 MiB</Stream_size>
<Stream_size>7.462 MiB</Stream_size>
</track>
</File>
</Mediainfo>

")
      /var/www/removed/vendor/mhor/php-mediainfo/src/Parser/MediaInfoOutputParser.php:22

  Please use the argument -v to see more details.

Fossil01 avatar Dec 17 '19 18:12 Fossil01

@Fossil01 Thanks for reporting this issue. An old pull request consider removing utf8_encode to solve a "bug". Could you try to remove this call and see if that solve the problem ? If not I will try to fix this issue this weekend.

mhor avatar Dec 18 '19 21:12 mhor

@mhor nope same thing happens if I remove those 3 lines.

Fossil01 avatar Dec 18 '19 21:12 Fossil01

Thanks for your quick answer, so it's definitively related to xml string returned by mediainfo. This is looking as an acceptable solution for me, I will try to implement this as soon as possible but if you want feel free to open a pull request with your solution I will be happy to review it.

mhor avatar Dec 18 '19 22:12 mhor

I'll have a crack at it after Christmas. Cheers.

Fossil01 avatar Dec 25 '19 12:12 Fossil01

@Fossil01 did you have test the fix I've done (PR #93) ?

mhor avatar Jan 23 '20 15:01 mhor

Completely forgot about this. It seems to work now.

Fossil01 avatar Apr 12 '20 12:04 Fossil01

Looks like I am still having this issue.

ErrorException

  simplexml_load_string(): Entity: line 10: parser error : Char 0xFFFE out of allowed range

  at vendor/mhor/php-mediainfo/src/Parser/AbstractXmlOutputParser.php:18
    14|         if (mb_detect_encoding($xmlString, 'UTF-8', true) === false) {
    15|             $xmlString = utf8_encode($xmlString);
    16|         }
    17|
  > 18|         $xml = simplexml_load_string($xmlString);
    19|         $json = json_encode($xml);
    20|
    21|         return json_decode($json, true);

Maybe we can use a function like this to strip out invalid chars: https://stackoverflow.com/a/3466049

Fossil01 avatar Oct 13 '21 08:10 Fossil01

Aha. It looks like https://github.com/mhor/php-mediainfo/commit/aca11985672076caac7e5e683c668862b0bf92a4 never made it into the master branch and thus in a release.

When I add these lines it seems to fix the issue too:

$xmlString = preg_replace(
    '/[\x00-\x08\x0B\x0C\x0E-\x1F]|\xED[\xA0-\xBF].|\xEF\xBF[\xBE\xBF]/',
    "\xEF\xBF\xBD",
    $xmlString
);

Fossil01 avatar Oct 13 '21 08:10 Fossil01

XML it fails on currently:

<?xml version="1.0" encoding="UTF-8"?>
<Mediainfo version="20.03">
<File>
<track type="General">
<Count>331</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>General</Kind_of_stream>
<Kind_of_stream>General</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<Complete_name>/mnt/ramdisk/1/15f4594a-c211-4acc-9f58-cae2b09c8151/160095_[ Kuro Ookami ] Pet Life [ISO DVD-RIP 1920x1080 x264 10bits AC-3] [69A9399D].mkv</Complete_name>
<Folder_name>/mnt/ramdisk/1/15f4594a-c211-4acc-9f58-cae2b09c8151</Folder_name>
<File_name_extension>160095_[ Kuro Ookami ] Pet Life [ISO DVD-RIP 1920x1080 x264 10bits AC-3] [69A9399D].mkv</File_name_extension>
<File_name>160095_[ Kuro Ookami ] Pet Life [ISO DVD-RIP 1920x1080 x264 10bits AC-3] [69A9399D]</File_name>
<File_extension>mkv</File_extension>
<File_size>1048394</File_size>
<File_size>1 024 KiB</File_size>
<File_size>1 024 KiB</File_size>
<File_size>1 024 KiB</File_size>
<File_size>1 024 KiB</File_size>
<File_size>1 023.8 KiB</File_size>
<Stream_size>1048394</Stream_size>
<Stream_size>1 024 KiB (100%)</Stream_size>
<Stream_size>1 024 KiB</Stream_size>
<Stream_size>1 024 KiB</Stream_size>
<Stream_size>1 024 KiB</Stream_size>
<Stream_size>1 023.8 KiB</Stream_size>
<Stream_size>1 024 KiB (100%)</Stream_size>
<Proportion_of_this_stream>1.00000</Proportion_of_this_stream>
<File_last_modification_date>UTC 2021-10-13 08:47:31</File_last_modification_date>
<File_last_modification_date__local_>2021-10-13 10:47:31</File_last_modification_date__local_>
</track>
</File>
</Mediainfo>

Fossil01 avatar Oct 13 '21 09:10 Fossil01

@Fossil01 Oops I don't know why I've never merge #93. I've open a new PR (#128). Could you check if it fix the bug.

mhor avatar Oct 24 '21 20:10 mhor

I'll have a look this week, thanks. In the mean time I manually edited the file in the vendor dir and added that preg_replace I pasted here before as an ugly temp fix :-)

Fossil01 avatar Oct 25 '21 07:10 Fossil01

Closed for now, due to inactivity.

mhor avatar Jun 20 '23 20:06 mhor