nbformat
nbformat copied to clipboard
Clarify the the role of utf-8
In several places nbformat seems to choose utf-8 as the default encoding (in particular when read or write get filenames as input).
Does that mean that notebook files must be utf-8 encoded? If yes, perhaps it would be worth to state it explicitly.
See https://github.com/jupyter/jupyter-sphinx/pull/125 for an example context where this matters.
Yes most of the toolchains around ipynb files assume utf-8 encoding. That's a good point that it's not documented.
The spec for JSON essentially says it's stored in UTF-8 unless a specific 'closed ecosystem' needs something different:
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8
But it doesn't hurt to make it explicit.
Hi, I'd like to try to work on stating explicitly that UTF-8 encoding must be used. Where should this information be stated, in the nbformat documentation? This would be my first attempt at a contribution to code, so please let me know if there is anything else I need to consider. Thanks!
Since it's a statement about the overall format, I'd say it belongs close to the statement that the file is JSON, so perhaps around here: https://nbformat.readthedocs.io/en/latest/format_description.html
That's my take, the maintainers might have a different opinion though.
To quote in full from the stdlib json module docs: https://docs.python.org/3/library/json.html#character-encodings :+1:
Standard Compliance and Interoperability
----------------------------------------
The JSON format is specified by :rfc:`7159` and by
`ECMA-404 <http://www.ecma-international.org/publications/standards/Ecma-404.htm>`_.
This section details this module's level of compliance with the RFC.
For simplicity, :class:`JSONEncoder` and :class:`JSONDecoder` subclasses, and
parameters other than those explicitly mentioned, are not considered.
This module does not comply with the RFC in a strict fashion, implementing some
extensions that are valid JavaScript but not valid JSON. In particular:
- Infinite and NaN number values are accepted and output;
- Repeated names within an object are accepted, and only the value of the last
name-value pair is used.
Since the RFC permits RFC-compliant parsers to accept input texts that are not
RFC-compliant, this module's deserializer is technically RFC-compliant under
default settings.
Character Encodings
^^^^^^^^^^^^^^^^^^^
The RFC requires that JSON be represented using either UTF-8, UTF-16, or
UTF-32, with UTF-8 being the recommended default for maximum interoperability.
As permitted, though not required, by the RFC, this module's serializer sets
*ensure_ascii=True* by default, thus escaping the output so that the resulting
strings only contain ASCII characters.
Other than the *ensure_ascii* parameter, this module is defined strictly in
terms of conversion between Python objects and
:class:`Unicode strings <str>`, and thus does not otherwise directly address
the issue of character encodings.
The RFC prohibits adding a byte order mark (BOM) to the start of a JSON text,
and this module's serializer does not add a BOM to its output.
The RFC permits, but does not require, JSON deserializers to ignore an initial
BOM in their input. This module's deserializer raises a :exc:`ValueError`
when an initial BOM is present.
The RFC does not explicitly forbid JSON strings which contain byte sequences
that don't correspond to valid Unicode characters (e.g. unpaired UTF-16
surrogates), but it does note that they may cause interoperability problems.
By default, this module accepts and outputs (when present in the original
:class:`str`) code points for such sequences.
Are you suggesting that the nbformat spec needs to say:
- UTF-8 only
- No BOM: Byte Order Mark
- ensure_ascii=yes|no
What use cases would this unnecessarily impede?
Is this solving for an actual current problem?
I suggest stating that some particular encoding (like UTF-8) is preferred or recommended, or is the default.
JSON already specifies and Python implementations of JSON default to UTF-8.
Nbformat has not needed to specify UTF-8.
Why does nbformat need to choose the Unicode character representation, and how and which existing .ipynb notebooks files would be affected?
On Sun, Nov 12, 2023, 9:58 AM sls1005 @.***> wrote:
I suggest stating that some particular encoding (like UTF-8) is preferred or recommended, or is the default.
— Reply to this email directly, view it on GitHub https://github.com/jupyter/nbformat/issues/181#issuecomment-1807151482, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMNSYMUM6R6ZEQ6ZTH33DYEDP2VAVCNFSM4NOF4AU2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBQG4YTKMJUHAZA . You are receiving this because you commented.Message ID: @.***>
To make people know better how to generate or decode this kind of files.
It will not affect the files but those who use or will use them.