nb-clean Option to preserve tag metadata

The metadata element in a notebook may include "reserved" tags list data that may be used to to modify the display of a cell in a live notebook UI or when rendering the notebook using a tool such as Jupyter Book.

It would be useful to be able to clear all metadata except the reserved tags element when cleaning notebooks.

Jan 10 '22 15:01 psychemedia

Hi @psychemedia, can you give the example notebook and the preferred outputs? See the test notebook here for references.

Aug 29 '22 04:08 yasirroni

The nbformat docs identify the list of cell tags that are conventionally defined:

Key	Value	Interpretation
collapsed	bool	Whether the cell's output container should be collapsed
scrolled	bool or 'auto'	Whether the cell's output is scrolled, unscrolled, or autoscrolled
deletable	bool	If False, prevent deletion of the cell
editable	bool	If False, prevent editing of the cell (by definition, this also prevents deleting the cell)
format	'mime/type'	The mime-type of a {ref}`Raw NBConvert Cell <raw nbconvert cells>`
name	str	A name for the cell. Should be unique across the notebook. Uniqueness must be verified outside of the json schema.
tags	list of str	A list of string tags on the cell. Commas are not allowed in a tag
jupyter	dict	A namespace holding jupyter specific fields. See docs below for more details
execution	dict	A namespace holding execution specific fields. See docs below for more details

I'll PR a test notebook with such examples.

Aug 30 '22 10:08 psychemedia

I'm potentially interested in making a PR for this feature. nb-clean would be a welcome addition to our pre-commit config, but we'd like to keep the tags. Other metadata should be removed in our use case.

Before giving it a try, it would be good to discuss the command-line interface. I'm assuming backward compatibility is important. It's not entirely trivial because the -m option is currently a binary switch, while preserving some conventional metadata is blurring that binary line a bit.

The following would work for our use case, and offers a lot of flexibility:

import argparse

class MetadataAction(argparse.Action):
    def __init__(self, option_strings, dest, nargs="*", **kwargs):
        if nargs != "*":
            raise ValueError("nargs can only be *")
        super().__init__(option_strings, dest, nargs=nargs, **kwargs)

    def __call__(self, parser, namespace, values, option_string=None):
        # print('%r %r %r' % (namespace, values, option_string))
        if len(values) == 0:
            setattr(namespace, self.dest, ["__all__"])
        else:
            setattr(namespace, self.dest, values)

parser = argparse.ArgumentParser()
parser.add_argument(
    "-m",
    "--preserve-cell-metadata",
    action=MetadataAction,
    default=["__none__"],
    help="preserve cell metadata",
)
print(parser.parse_args([]))
print(parser.parse_args("-m".split()))
print(parser.parse_args("-m tags".split()))  # This is what we would like.
print(parser.parse_args("-m tags format".split()))
print(parser.parse_args("-m __all__".split()))
print(parser.parse_args("-m __nbformat__".split()))
print(parser.parse_args("-m __nbformat__ foobar".split()))

The output of this snippet is

Namespace(preserve_cell_metadata=['__none__'])
Namespace(preserve_cell_metadata=['__all__'])
Namespace(preserve_cell_metadata=['tags'])
Namespace(preserve_cell_metadata=['tags', 'format'])
Namespace(preserve_cell_metadata=['__all__'])
Namespace(preserve_cell_metadata=['__nbformat__'])
Namespace(preserve_cell_metadata=['__nbformat__', 'foobar'])

The dunder options should be treated as special cases:

__all__ preserves all metadata
__nbformat__ preserves those specified in the nbformat documentation.

More special cases may be added if relevant. The only limitation of this approach is that preserving specifically metadata fields with names __all__ or __nbformat__ would not work. I don't believe these likely to ever appear, though.

One may overcome the previous limitation by dropping support for the special case __nbformat__ and by replacing ["__all__"] by "__all__" or another constant that is not a list of strings. Preserving all nbformat metadata would then require listing them as -m collapsed scrolled deletable editable format name tags jupyter execution. Preserving any metadata would then only be possible with a bare -m. I'd actually prefer this approach because it makes the CLI easier to understand.

Any thoughts?

Oct 10 '22 20:10 tovrstra

P.S. The simpler version (without __nbformat__) can be implemented with less code, which may be useful:

import argparse

parser = argparse.ArgumentParser()
parser.add_argument(
    "-m",
    "--preserve-cell-metadata",
    default=None,
    nargs="*",
    help="preserve cell metadata",
)
print(parser.parse_args([]))
print(parser.parse_args("-m".split()))
print(parser.parse_args("-m tags".split()))
print(parser.parse_args("-m tags format".split()))
print(parser.parse_args("-m __all__".split()))
print(parser.parse_args("-m __nbformat__".split()))
print(parser.parse_args("-m __nbformat__ foobar".split()))

Output:

Namespace(preserve_cell_metadata=None)
Namespace(preserve_cell_metadata=[])
Namespace(preserve_cell_metadata=['tags'])
Namespace(preserve_cell_metadata=['tags', 'format'])
Namespace(preserve_cell_metadata=['__all__'])
Namespace(preserve_cell_metadata=['__nbformat__'])
Namespace(preserve_cell_metadata=['__nbformat__', 'foobar'])

Three possible cases:

None: preserve no metadata
[]: preserve all metadata
a list of keywords: preserve only these.

Oct 10 '22 20:10 tovrstra

I think the ability to preserve the __nbformat__ tags is useful, eg in cases where you are running notebooks against extensions that exploited documented tags/metadata attributes but aren't necessarily familiar with the mechanics of those extensions.

The alternative would be to make it very clear in nb-clean documentation how to add all the __nbformat__ tags, eg with an example -m extended command.

Oct 11 '22 08:10 psychemedia

What out the following idea? We could add, in addition to the simple approach for -m proposed above, a second option:

parser.add_argument(
    "-M",
    "--preserve-nbformat-cell-metadata",
    default=False,
    action="Store True",
    help="preserve cell metadata specified in nbformat",
)

When this is switched on, the list of metadata fields to preserve is extended with ["collapsed", "scrolled", "deletable", "editable", "format", "name", "tags", "jupyter", "execution"], with duplicates removed.

This make the following possible:

# do not preserve cell metadata
nb-clean clean example.ipynb

# preserve all cell metadata
nb-clean clean example.ipynb -m

# preserve nbformat cell metadata
nb-clean clean example.ipynb -M

# preserve only "tags" field in the cell metadata
nb-clean clean example.ipynb -m tags

# preserve "myspecialfield", "otherspecial" and nbformat cell metadata
nb-clean clean example.ipynb -M -m myspecialfield otherspecial

One possible variation on this theme is to require one argument to -M, e.g. a class of metadata fields to preserve. (The long version should then be renamed, e.g. --preserve-cell-metadata-preset.) This may seem a bit far-fetched, but it could be potentially useful. One would then have to replace -M by -M nbformat to achieve the same effect. Other possibilities could then be -M deepnote, etc. Even if nbformat is the only one supported initially, it would make -M more future-proof.

Oct 14 '22 07:10 tovrstra

I think, it all need to be started with an example. The original dirty file and the new clean file.

Oct 17 '22 02:10 yasirroni

I've made a few examples for testing PR #159:

dirty was extended with a few extra metadata fields
clean_with_tags_metadata.ipynb is a clean version, with only the tags metadata preserved.
clean_with_tags_special_metadata.ipynb is a clean version, with only the tags and special metadata preserved.

These are based on what would be useful for my use case, but feel free to add other perspectives.

Oct 17 '22 07:10 tovrstra

Based on the example, I vote:

-m behave as current implementation, preserve all cell metadata -m [list] to preserve only specified cell metadata -M lets not use it for now, and not making nbformat tags any special. We can refer user to a doc of best practices while working on nbformat tags.

Oct 17 '22 21:10 yasirroni

I also believe that is the best way forward.

The -M is not critical, and if later it turns out to be worth the extra code, it can still be added. For that, it would be good to hear from cases where -M would be useful.

For my usage, -M would not be attractive, because I'd like to use nb-clean with pre-commit, to remove as much as possible. Only the metadata that we consciously put in, as part of the writing process, e.g. tags and slideshow, are worth keeping. All the other fields only cause noise in git diff.

Oct 18 '22 18:10 tovrstra

@yasirroni PR #159 essentially implements the -m option as you suggest. (I guess our thoughts were pretty much aligned, even before discussing it.) Would you be ok reviewing it, provided you have time for it, obviously?

I realize it is not a small PR. I tried to cover everything, including unit tests, making poetry pass, and updating documentation. This all resulted in a significant number of changes.

Oct 26 '22 17:10 tovrstra

@yasirroni PR #159 essentially implements the -m option as you suggest. (I guess our thoughts were pretty much aligned, even before discussing it.) Would you be ok reviewing it, provided you have time for it, obviously?

I realize it is not a small PR. I tried to cover everything, including unit tests, making poetry pass, and updating documentation. This all resulted in a significant number of changes.

Sadly, I'm not the maintainer, just a regular contributor like you. Hope that @srstevenson got the time to read the PR.

Oct 27 '22 10:10 yasirroni

Thanks all for your patience whilst I got to this. I've read the discussion above and agree with the consensus here to support an optional list of metadata fields passed to --preserve-cell-metadata. I'll take a look at #159 and we can carry that through!

Oct 30 '22 10:10 srstevenson

If https://github.com/srstevenson/nb-clean/pull/159 was merged and released in 2.4.0, should this issue be closed as completed?

Apr 20 '23 07:04 jamesbraza

Unless, someone feels the need to implement the -M idea above, this issue can be closed. In #159, only -m was implemented because that seemed sufficient.

Personally, I don't think the -M option has sufficient added value for the extra complexity it adds. It is potentially also fragile, as one has to keep it in sync with nbformat. I also don't see a great use case for it.

Apr 20 '23 08:04 tovrstra

nb-clean nb-clean copied to clipboard

Option to preserve tag metadata

nb-clean
nb-clean copied to clipboard