nb-clean icon indicating copy to clipboard operation
nb-clean copied to clipboard

Option to preserve tag metadata

Open psychemedia opened this issue 3 years ago • 2 comments

The metadata element in a notebook may include "reserved" tags list data that may be used to to modify the display of a cell in a live notebook UI or when rendering the notebook using a tool such as Jupyter Book.

It would be useful to be able to clear all metadata except the reserved tags element when cleaning notebooks.

psychemedia avatar Jan 10 '22 15:01 psychemedia

Hi @psychemedia, can you give the example notebook and the preferred outputs? See the test notebook here for references.

yasirroni avatar Aug 29 '22 04:08 yasirroni

The nbformat docs identify the list of cell tags that are conventionally defined:

Key Value Interpretation
collapsed bool Whether the cell's output container should be collapsed
scrolled bool or 'auto' Whether the cell's output is scrolled, unscrolled, or autoscrolled
deletable bool If False, prevent deletion of the cell
editable bool If False, prevent editing of the cell (by definition, this also prevents deleting the cell)
format 'mime/type' The mime-type of a {ref}Raw NBConvert Cell <raw nbconvert cells>
name str A name for the cell. Should be unique across the notebook. Uniqueness must be verified outside of the json schema.
tags list of str A list of string tags on the cell. Commas are not allowed in a tag
jupyter dict A namespace holding jupyter specific fields. See docs below for more details
execution dict A namespace holding execution specific fields. See docs below for more details

I'll PR a test notebook with such examples.

psychemedia avatar Aug 30 '22 10:08 psychemedia

I'm potentially interested in making a PR for this feature. nb-clean would be a welcome addition to our pre-commit config, but we'd like to keep the tags. Other metadata should be removed in our use case.

Before giving it a try, it would be good to discuss the command-line interface. I'm assuming backward compatibility is important. It's not entirely trivial because the -m option is currently a binary switch, while preserving some conventional metadata is blurring that binary line a bit.

The following would work for our use case, and offers a lot of flexibility:

import argparse

class MetadataAction(argparse.Action):
    def __init__(self, option_strings, dest, nargs="*", **kwargs):
        if nargs != "*":
            raise ValueError("nargs can only be *")
        super().__init__(option_strings, dest, nargs=nargs, **kwargs)

    def __call__(self, parser, namespace, values, option_string=None):
        # print('%r %r %r' % (namespace, values, option_string))
        if len(values) == 0:
            setattr(namespace, self.dest, ["__all__"])
        else:
            setattr(namespace, self.dest, values)

parser = argparse.ArgumentParser()
parser.add_argument(
    "-m",
    "--preserve-cell-metadata",
    action=MetadataAction,
    default=["__none__"],
    help="preserve cell metadata",
)
print(parser.parse_args([]))
print(parser.parse_args("-m".split()))
print(parser.parse_args("-m tags".split()))  # This is what we would like.
print(parser.parse_args("-m tags format".split()))
print(parser.parse_args("-m __all__".split()))
print(parser.parse_args("-m __nbformat__".split()))
print(parser.parse_args("-m __nbformat__ foobar".split()))

The output of this snippet is

Namespace(preserve_cell_metadata=['__none__'])
Namespace(preserve_cell_metadata=['__all__'])
Namespace(preserve_cell_metadata=['tags'])
Namespace(preserve_cell_metadata=['tags', 'format'])
Namespace(preserve_cell_metadata=['__all__'])
Namespace(preserve_cell_metadata=['__nbformat__'])
Namespace(preserve_cell_metadata=['__nbformat__', 'foobar'])

The dunder options should be treated as special cases:

  • __all__ preserves all metadata
  • __nbformat__ preserves those specified in the nbformat documentation.

More special cases may be added if relevant. The only limitation of this approach is that preserving specifically metadata fields with names __all__ or __nbformat__ would not work. I don't believe these likely to ever appear, though.

One may overcome the previous limitation by dropping support for the special case __nbformat__ and by replacing ["__all__"] by "__all__" or another constant that is not a list of strings. Preserving all nbformat metadata would then require listing them as -m collapsed scrolled deletable editable format name tags jupyter execution. Preserving any metadata would then only be possible with a bare -m. I'd actually prefer this approach because it makes the CLI easier to understand.

Any thoughts?

tovrstra avatar Oct 10 '22 20:10 tovrstra

P.S. The simpler version (without __nbformat__) can be implemented with less code, which may be useful:

import argparse

parser = argparse.ArgumentParser()
parser.add_argument(
    "-m",
    "--preserve-cell-metadata",
    default=None,
    nargs="*",
    help="preserve cell metadata",
)
print(parser.parse_args([]))
print(parser.parse_args("-m".split()))
print(parser.parse_args("-m tags".split()))
print(parser.parse_args("-m tags format".split()))
print(parser.parse_args("-m __all__".split()))
print(parser.parse_args("-m __nbformat__".split()))
print(parser.parse_args("-m __nbformat__ foobar".split()))

Output:

Namespace(preserve_cell_metadata=None)
Namespace(preserve_cell_metadata=[])
Namespace(preserve_cell_metadata=['tags'])
Namespace(preserve_cell_metadata=['tags', 'format'])
Namespace(preserve_cell_metadata=['__all__'])
Namespace(preserve_cell_metadata=['__nbformat__'])
Namespace(preserve_cell_metadata=['__nbformat__', 'foobar'])

Three possible cases:

  • None: preserve no metadata
  • []: preserve all metadata
  • a list of keywords: preserve only these.

tovrstra avatar Oct 10 '22 20:10 tovrstra

I think the ability to preserve the __nbformat__ tags is useful, eg in cases where you are running notebooks against extensions that exploited documented tags/metadata attributes but aren't necessarily familiar with the mechanics of those extensions.

The alternative would be to make it very clear in nb-clean documentation how to add all the __nbformat__ tags, eg with an example -m extended command.

psychemedia avatar Oct 11 '22 08:10 psychemedia

What out the following idea? We could add, in addition to the simple approach for -m proposed above, a second option:

parser.add_argument(
    "-M",
    "--preserve-nbformat-cell-metadata",
    default=False,
    action="Store True",
    help="preserve cell metadata specified in nbformat",
)

When this is switched on, the list of metadata fields to preserve is extended with ["collapsed", "scrolled", "deletable", "editable", "format", "name", "tags", "jupyter", "execution"], with duplicates removed.

This make the following possible:

# do not preserve cell metadata
nb-clean clean example.ipynb

# preserve all cell metadata
nb-clean clean example.ipynb -m

# preserve nbformat cell metadata
nb-clean clean example.ipynb -M

# preserve only "tags" field in the cell metadata
nb-clean clean example.ipynb -m tags

# preserve "myspecialfield", "otherspecial" and nbformat cell metadata
nb-clean clean example.ipynb -M -m myspecialfield otherspecial

One possible variation on this theme is to require one argument to -M, e.g. a class of metadata fields to preserve. (The long version should then be renamed, e.g. --preserve-cell-metadata-preset.) This may seem a bit far-fetched, but it could be potentially useful. One would then have to replace -M by -M nbformat to achieve the same effect. Other possibilities could then be -M deepnote, etc. Even if nbformat is the only one supported initially, it would make -M more future-proof.

tovrstra avatar Oct 14 '22 07:10 tovrstra

I think, it all need to be started with an example. The original dirty file and the new clean file.

yasirroni avatar Oct 17 '22 02:10 yasirroni

I've made a few examples for testing PR #159:

These are based on what would be useful for my use case, but feel free to add other perspectives.

tovrstra avatar Oct 17 '22 07:10 tovrstra

Based on the example, I vote:

-m behave as current implementation, preserve all cell metadata -m [list] to preserve only specified cell metadata -M lets not use it for now, and not making nbformat tags any special. We can refer user to a doc of best practices while working on nbformat tags.

yasirroni avatar Oct 17 '22 21:10 yasirroni

I also believe that is the best way forward.

The -M is not critical, and if later it turns out to be worth the extra code, it can still be added. For that, it would be good to hear from cases where -M would be useful.

For my usage, -M would not be attractive, because I'd like to use nb-clean with pre-commit, to remove as much as possible. Only the metadata that we consciously put in, as part of the writing process, e.g. tags and slideshow, are worth keeping. All the other fields only cause noise in git diff.

tovrstra avatar Oct 18 '22 18:10 tovrstra

@yasirroni PR #159 essentially implements the -m option as you suggest. (I guess our thoughts were pretty much aligned, even before discussing it.) Would you be ok reviewing it, provided you have time for it, obviously?

I realize it is not a small PR. I tried to cover everything, including unit tests, making poetry pass, and updating documentation. This all resulted in a significant number of changes.

tovrstra avatar Oct 26 '22 17:10 tovrstra

@yasirroni PR #159 essentially implements the -m option as you suggest. (I guess our thoughts were pretty much aligned, even before discussing it.) Would you be ok reviewing it, provided you have time for it, obviously?

I realize it is not a small PR. I tried to cover everything, including unit tests, making poetry pass, and updating documentation. This all resulted in a significant number of changes.

Sadly, I'm not the maintainer, just a regular contributor like you. Hope that @srstevenson got the time to read the PR.

yasirroni avatar Oct 27 '22 10:10 yasirroni

Thanks all for your patience whilst I got to this. I've read the discussion above and agree with the consensus here to support an optional list of metadata fields passed to --preserve-cell-metadata. I'll take a look at #159 and we can carry that through!

srstevenson avatar Oct 30 '22 10:10 srstevenson

If https://github.com/srstevenson/nb-clean/pull/159 was merged and released in 2.4.0, should this issue be closed as completed?

jamesbraza avatar Apr 20 '23 07:04 jamesbraza

Unless, someone feels the need to implement the -M idea above, this issue can be closed. In #159, only -m was implemented because that seemed sufficient.

Personally, I don't think the -M option has sufficient added value for the extra complexity it adds. It is potentially also fragile, as one has to keep it in sync with nbformat. I also don't see a great use case for it.

tovrstra avatar Apr 20 '23 08:04 tovrstra