nb-clean
nb-clean copied to clipboard
Option to preserve tag metadata
The metadata element in a notebook may include "reserved" tags list data that may be used to to modify the display of a cell in a live notebook UI or when rendering the notebook using a tool such as Jupyter Book.
It would be useful to be able to clear all metadata except the reserved tags element when cleaning notebooks.
Hi @psychemedia, can you give the example notebook and the preferred outputs? See the test notebook here for references.
The nbformat docs identify the list of cell tags that are conventionally defined:
| Key | Value | Interpretation |
|---|---|---|
| collapsed | bool | Whether the cell's output container should be collapsed |
| scrolled | bool or 'auto' | Whether the cell's output is scrolled, unscrolled, or autoscrolled |
| deletable | bool | If False, prevent deletion of the cell |
| editable | bool | If False, prevent editing of the cell (by definition, this also prevents deleting the cell) |
| format | 'mime/type' | The mime-type of a {ref}Raw NBConvert Cell <raw nbconvert cells> |
| name | str | A name for the cell. Should be unique across the notebook. Uniqueness must be verified outside of the json schema. |
| tags | list of str | A list of string tags on the cell. Commas are not allowed in a tag |
| jupyter | dict | A namespace holding jupyter specific fields. See docs below for more details |
| execution | dict | A namespace holding execution specific fields. See docs below for more details |
I'll PR a test notebook with such examples.
I'm potentially interested in making a PR for this feature. nb-clean would be a welcome addition to our pre-commit config, but we'd like to keep the tags. Other metadata should be removed in our use case.
Before giving it a try, it would be good to discuss the command-line interface. I'm assuming backward compatibility is important. It's not entirely trivial because the -m option is currently a binary switch, while preserving some conventional metadata is blurring that binary line a bit.
The following would work for our use case, and offers a lot of flexibility:
import argparse
class MetadataAction(argparse.Action):
def __init__(self, option_strings, dest, nargs="*", **kwargs):
if nargs != "*":
raise ValueError("nargs can only be *")
super().__init__(option_strings, dest, nargs=nargs, **kwargs)
def __call__(self, parser, namespace, values, option_string=None):
# print('%r %r %r' % (namespace, values, option_string))
if len(values) == 0:
setattr(namespace, self.dest, ["__all__"])
else:
setattr(namespace, self.dest, values)
parser = argparse.ArgumentParser()
parser.add_argument(
"-m",
"--preserve-cell-metadata",
action=MetadataAction,
default=["__none__"],
help="preserve cell metadata",
)
print(parser.parse_args([]))
print(parser.parse_args("-m".split()))
print(parser.parse_args("-m tags".split())) # This is what we would like.
print(parser.parse_args("-m tags format".split()))
print(parser.parse_args("-m __all__".split()))
print(parser.parse_args("-m __nbformat__".split()))
print(parser.parse_args("-m __nbformat__ foobar".split()))
The output of this snippet is
Namespace(preserve_cell_metadata=['__none__'])
Namespace(preserve_cell_metadata=['__all__'])
Namespace(preserve_cell_metadata=['tags'])
Namespace(preserve_cell_metadata=['tags', 'format'])
Namespace(preserve_cell_metadata=['__all__'])
Namespace(preserve_cell_metadata=['__nbformat__'])
Namespace(preserve_cell_metadata=['__nbformat__', 'foobar'])
The dunder options should be treated as special cases:
__all__preserves all metadata__nbformat__preserves those specified in thenbformatdocumentation.
More special cases may be added if relevant. The only limitation of this approach is that preserving specifically metadata fields with names __all__ or __nbformat__ would not work. I don't believe these likely to ever appear, though.
One may overcome the previous limitation by dropping support for the special case __nbformat__ and by replacing ["__all__"] by "__all__" or another constant that is not a list of strings. Preserving all nbformat metadata would then require listing them as -m collapsed scrolled deletable editable format name tags jupyter execution. Preserving any metadata would then only be possible with a bare -m. I'd actually prefer this approach because it makes the CLI easier to understand.
Any thoughts?
P.S. The simpler version (without __nbformat__) can be implemented with less code, which may be useful:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument(
"-m",
"--preserve-cell-metadata",
default=None,
nargs="*",
help="preserve cell metadata",
)
print(parser.parse_args([]))
print(parser.parse_args("-m".split()))
print(parser.parse_args("-m tags".split()))
print(parser.parse_args("-m tags format".split()))
print(parser.parse_args("-m __all__".split()))
print(parser.parse_args("-m __nbformat__".split()))
print(parser.parse_args("-m __nbformat__ foobar".split()))
Output:
Namespace(preserve_cell_metadata=None)
Namespace(preserve_cell_metadata=[])
Namespace(preserve_cell_metadata=['tags'])
Namespace(preserve_cell_metadata=['tags', 'format'])
Namespace(preserve_cell_metadata=['__all__'])
Namespace(preserve_cell_metadata=['__nbformat__'])
Namespace(preserve_cell_metadata=['__nbformat__', 'foobar'])
Three possible cases:
None: preserve no metadata[]: preserve all metadata- a list of keywords: preserve only these.
I think the ability to preserve the __nbformat__ tags is useful, eg in cases where you are running notebooks against extensions that exploited documented tags/metadata attributes but aren't necessarily familiar with the mechanics of those extensions.
The alternative would be to make it very clear in nb-clean documentation how to add all the __nbformat__ tags, eg with an example -m extended command.
What out the following idea? We could add, in addition to the simple approach for -m proposed above, a second option:
parser.add_argument(
"-M",
"--preserve-nbformat-cell-metadata",
default=False,
action="Store True",
help="preserve cell metadata specified in nbformat",
)
When this is switched on, the list of metadata fields to preserve is extended with ["collapsed", "scrolled", "deletable", "editable", "format", "name", "tags", "jupyter", "execution"], with duplicates removed.
This make the following possible:
# do not preserve cell metadata
nb-clean clean example.ipynb
# preserve all cell metadata
nb-clean clean example.ipynb -m
# preserve nbformat cell metadata
nb-clean clean example.ipynb -M
# preserve only "tags" field in the cell metadata
nb-clean clean example.ipynb -m tags
# preserve "myspecialfield", "otherspecial" and nbformat cell metadata
nb-clean clean example.ipynb -M -m myspecialfield otherspecial
One possible variation on this theme is to require one argument to -M, e.g. a class of metadata fields to preserve. (The long version should then be renamed, e.g. --preserve-cell-metadata-preset.) This may seem a bit far-fetched, but it could be potentially useful. One would then have to replace -M by -M nbformat to achieve the same effect. Other possibilities could then be -M deepnote, etc. Even if nbformat is the only one supported initially, it would make -M more future-proof.
I think, it all need to be started with an example. The original dirty file and the new clean file.
I've made a few examples for testing PR #159:
dirtywas extended with a few extra metadata fieldsclean_with_tags_metadata.ipynbis a clean version, with only thetagsmetadata preserved.clean_with_tags_special_metadata.ipynbis a clean version, with only thetagsandspecialmetadata preserved.
These are based on what would be useful for my use case, but feel free to add other perspectives.
Based on the example, I vote:
-m behave as current implementation, preserve all cell metadata
-m [list] to preserve only specified cell metadata
-M lets not use it for now, and not making nbformat tags any special. We can refer user to a doc of best practices while working on nbformat tags.
I also believe that is the best way forward.
The -M is not critical, and if later it turns out to be worth the extra code, it can still be added. For that, it would be good to hear from cases where -M would be useful.
For my usage, -M would not be attractive, because I'd like to use nb-clean with pre-commit, to remove as much as possible. Only the metadata that we consciously put in, as part of the writing process, e.g. tags and slideshow, are worth keeping. All the other fields only cause noise in git diff.
@yasirroni PR #159 essentially implements the -m option as you suggest. (I guess our thoughts were pretty much aligned, even before discussing it.) Would you be ok reviewing it, provided you have time for it, obviously?
I realize it is not a small PR. I tried to cover everything, including unit tests, making poetry pass, and updating documentation. This all resulted in a significant number of changes.
@yasirroni PR #159 essentially implements the
-moption as you suggest. (I guess our thoughts were pretty much aligned, even before discussing it.) Would you be ok reviewing it, provided you have time for it, obviously?I realize it is not a small PR. I tried to cover everything, including unit tests, making poetry pass, and updating documentation. This all resulted in a significant number of changes.
Sadly, I'm not the maintainer, just a regular contributor like you. Hope that @srstevenson got the time to read the PR.
Thanks all for your patience whilst I got to this. I've read the discussion above and agree with the consensus here to support an optional list of metadata fields passed to --preserve-cell-metadata. I'll take a look at #159 and we can carry that through!
If https://github.com/srstevenson/nb-clean/pull/159 was merged and released in 2.4.0, should this issue be closed as completed?
Unless, someone feels the need to implement the -M idea above, this issue can be closed. In #159, only -m was implemented because that seemed sufficient.
Personally, I don't think the -M option has sufficient added value for the extra complexity it adds. It is potentially also fragile, as one has to keep it in sync with nbformat. I also don't see a great use case for it.