ebu-tt-live-toolkit
ebu-tt-live-toolkit copied to clipboard
Occasional encoding errors
Using the EBU-TT-D Encoder I'm occasionally getting Unicode errors like:
Unhandled Error
Traceback (most recent call last):
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/python/log.py", line 103, in callWithLogger
return callWithContext({"system": lp}, func, *args, **kw)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/python/log.py", line 86, in callWithContext
return context.call({ILogContext: newCtx}, func, *args, **kw)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/python/context.py", line 122, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/python/context.py", line 85, in callWithContext
return func(*args,**kw)
--- <exception caught here> ---
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/internet/selectreactor.py", line 149, in _doReadOrWrite
why = getattr(selectable, method)()
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/internet/tcp.py", line 208, in doRead
return self._dataReceived(data)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/internet/tcp.py", line 214, in _dataReceived
rval = self.protocol.dataReceived(data)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/twisted/websocket.py", line 131, in dataReceived
self._dataReceived(data)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/websocket/protocol.py", line 1175, in _dataReceived
self.consumeData()
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/websocket/protocol.py", line 1187, in consumeData
while self.processData() and self.state != WebSocketProtocol.STATE_CLOSED:
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/websocket/protocol.py", line 1553, in processData
fr = self.onFrameEnd()
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/websocket/protocol.py", line 1674, in onFrameEnd
self._onMessageEnd()
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/twisted/websocket.py", line 159, in _onMessageEnd
self.onMessageEnd()
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/websocket/protocol.py", line 627, in onMessageEnd
self._onMessage(payload, self.message_is_binary)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/twisted/websocket.py", line 162, in _onMessage
self.onMessage(payload, isBinary)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/twisted/websocket.py", line 362, in onMessage
self._write_to_consumer(payload, sequence_identifier=self._sequence_identifier)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/twisted/websocket.py", line 111, in _write_to_consumer
self.consumer.write(data, **kwargs)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/twisted/websocket.py", line 208, in write
self._custom_consumer.on_new_data(data, **kwargs)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/carriage/websocket.py", line 32, in on_new_data
self.consumer_node.process_document(data, **kwargs)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/adapters/node_carriage.py", line 174, in process_document
self.consumer_node.process_document(conv_doc, **new_kwargs)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/node/encoder.py", line 48, in process_document
self.producer_carriage.emit_data(data=converted_doc, sequence_identifier='default', time_base='media', **kwargs)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/adapters/node_carriage.py", line 116, in emit_data
self.producer_carriage.emit_data(conv_data, **new_kwargs)
File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/carriage/filesystem.py", line 158, in emit_data
destfile.write(data)
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 1523: ordinal not in range(128)
This is annoying. I don't know what's causing it, but there's probably an easy fix (though possibly a dangerous one) - https://docs.python.org/2.7/howto/unicode.html#the-unicode-type suggests using codecs.open
and setting errors='ignore'
will at least make the error go away...
Using codecs.open
with errors='ignore'
doesn't fix the issue - it still sometimes arises. Need to do more digging into the content that triggers it and trace back to the source of the error. It could be something to do with a specific feed and the way that is made.
Needs to be re-reviewed in the context of Python3, where the issue may no longer arise.
The exception seems to be caused by characters which occupy more than just a single byte in UTF-8 i.e. characters with a Unicode code point > 127 (= not from the lower half of ASCII). For example also the German umlauts äöüÄÖÜ
and the "sharp s" ß
- I'm affected, too.
With codecs.open
and encoding='utf-8'
(taken from Python 2's Unicode HOWTO), tested with the filesystem
output, the exception doesn't occur.
Sounds promising @spoeschel , does this mean you can generate a test case? That would be great because even if we fix it for Python2, we will also need to check it still works in Python3 when we migrate.
@spoeschel I made a comment in #484 a long long time ago suggesting this was worth re-testing in Python3. I don't know if Python3 would work for you, but I've pushed a working Python3 build to the release/3.0
branch; if you have a repeatable test case would you be interested in trying that branch and seeing if this bug is indeed resolved by moving to Python3?
I havent't yet worked into the testing subsystem, but I will create a test case for this.
Testing with the Python 3 branch this issue indeed no longer occurs when using one of the German letters mentioned above.
However I get an exception when using the WebSocket output with the Python 3 branch (the WS input works), regardless of using any of the problematic letters or not. The filesystem output works though. I will have a look into that and probably open a new issue.
Thank you @spoeschel !
With
codecs.open
andencoding='utf-8'
(taken from Python 2's Unicode HOWTO), tested with thefilesystem
output, the exception doesn't occur.
It just turned out that this quick fix for the Python 2 branch only worked when I used the Resequencer. With the buffer-delay
, the exception still occurs though the UTF-8 encoding is set for writing the output file. So it seems that the processing of the Resequencer somehow helps/sanitizes here - and the received documents cannot be forwarded to the output without such further processing, without triggering the exception. So it is maybe the easiest to go the Python 3 way here.
I think this is a strong argument for tying up the release/2.1.2 work, releasing it as our final Python2 release and moving all future work into release/3.0.
I agree; this makes more sense than fixing a complex issue for a Python version that will be deprecated very soon anyway.