pymarkdown icon indicating copy to clipboard operation
pymarkdown copied to clipboard

Linting Markdown Cells in Jupyter Notebook ipynb documents

Open psychemedia opened this issue 2 years ago • 13 comments

Is there a simple API for running linting checks over multiple strings from the same document context within a Python script?

For example, I'm looking for a way to link markdown in a Jupyter notebook.

in a Jupyter notebook .ipynb document, there are multiple "cells" which can be either markdown or code. It would be useful to be able to co-opt pymarkdown to run something of the form:

for notebook in directory:
  linter = Linter(context=directory/notebook)
  
  for i, cell in iterate(parse(notebook).cells):
    if cell.type=='markdown':
      linter.lint(cell.content, i)

The document context would allow for thinks like out of order cell headings across separate cells, whilst at the same time allow each cell to be reported on separately and identifiably.

A workaround is to convert an .ipynb document to Markdown using something like jupytext and then just lint the markdown document, but this loses the ability to provide cell level reports.

psychemedia avatar May 23 '22 13:05 psychemedia

Been thinking about this for a week or so while I work on dealing with container issues. I am good with Python, but not as much with Jupyter notebooks. Can you point me to a good definition that will help me understand your specific needs? (i.e. there is a lot to digest, and would like to do focused research).

jackdewinter avatar Jun 05 '22 23:06 jackdewinter

Jupyter notebook .ipynb format is a JSON structure that defines "cells" that can be of various content types, including code or markdown cells: https://nbformat.readthedocs.io/en/latest/format_description.html

The format can be converted into various other document types, including various flavours of enhanced markdown or python, using tools such as a jupytext.

However, when the document is in an .ipynb format it is trivial to get access to the cells by type. It would be useful to be able to run pymarkdown over each markdown cell in a notebook separately and and report on each cell separately.

That said, some tests would probably also need to run in a "global" / "all cell" context; for example, checks that header levels are correctly nested over the course of the notebook.

psychemedia avatar Jun 06 '22 11:06 psychemedia

Sorry to leave this around for so long, life is kinda escaping from me this summer. Can you provide a sample file that I can use as a test?

jackdewinter avatar Aug 07 '22 00:08 jackdewinter

This notebook has various simple bits of markdown in it; will try to see if there is a "reference" Jupyter md test notebook anywhere, else I will try to create one.

psychemedia avatar Aug 07 '22 11:08 psychemedia

Cool... adding some options that may help provide this feature, but still thinking through it. Just wanted another tool in my belt for handling it.

The cell by cell part I am pretty confident that it can handle, especially with the new changes. And I can probably expose a more programmatic way to call it.

The one thing that has me concerned is your request for executing an "all cell" context. For me, it kind of sounds like multiple scans, and I would want to try and avoid that.

Let me think on this during the week, but I am seriously thinking about this.

jackdewinter avatar Aug 16 '22 04:08 jackdewinter

If I have a markdown cell with an H1, and the next markdown cell has an H3, I would expect the linter to tell me I have a missing H2.

psychemedia avatar Aug 16 '22 08:08 psychemedia

Okay, this may seem like a stupid question, but bear with me. Can I assume that the ordering of the cells in the notebook are consistent and in the order that they appear in the file? Just asking for a friend... :-)

jackdewinter avatar Aug 17 '22 02:08 jackdewinter

Other hopefully not stupid question... when you are looking at the items in the notebooks, do you look at pages/indices, or the ids that seem to come with the items?
i.e. is

{
   "cell_type": "markdown",
   "id": "agreed-cancellation",
   "metadata": {},
   "source": [
    "# 2 The interactive read-writable notebook environment"
   ]
  },

best thought of as "index 0", "page 1", or "id agreed-cancellation"?

jackdewinter avatar Aug 17 '22 02:08 jackdewinter

Not a silly question at all...

The notebooks are linear documents and if you "Run All" cells, they execute in linear order. When you run a cell, the output element displays a cell run execution order number. In the interactive notebook UI, a user can run cells in any order. Notebooks that are saved as run notebooks often demonstrate output cell execution history numbers that are out of order, reflecting the execution path the interactive user followed. Similarly, when writing a document, a user might add new code or markdown cells above other cells (just as you might add a paragraph midway through a pre-existing document in a text editor). But for reproducible execution purposes, you should view the document as a linear one.

psychemedia avatar Aug 17 '22 08:08 psychemedia

This specific request, to lint a python Jupyter notebook is on my list, just not much time to work on it. The relatively new PyMarkdown API should allow you to scan it though. Thoughts?

jackdewinter avatar Sep 05 '23 02:09 jackdewinter

I'll try to have a look at the API - thanks for heads-up.

Some of my original notes on linting notebooks etc can be found here.

psychemedia avatar Sep 05 '23 23:09 psychemedia

Cool, will try looking at it in the next couple of weeks as I have time. One question that I do have about that entire process that you can help with...

Say I have a notebook with markdown, python, markdown, markdown, python.

  • Based on previous conversations, I assume that PyMarkdown should leave the python segments alone, correct?
  • There are 3 Markdown segments. Are they considered one? i.e. one of the rules is that each document should being with a top-level heading like # this is the title.
  • From a line numbering point of view, if PyMarkdown found an issue with the second line of the third document, what would be the best way to reflect that specific line to the user?

jackdewinter avatar Sep 06 '23 02:09 jackdewinter

I assume that PyMarkdown should leave the python segments alone, correct?

Probably... any text in a code cell would probably be handled by a code linter / formatter.

(This partly depends on whether your markdown linting would ordinarily try to modify anything inside a code block?)

There are 3 Markdown segments. Are they considered one? i.e. one of the rules is that each document should being with a top-level heading like # this is the title.

I would say the .ipynb file is the document. (It is possible to convert an .ipynb file to an extended markdown format file (e.g. a MyST markdown doc) or a code file (for example, .py). So ideally you would only have one H1 heading, and that at the top (note that some notebooks may have yaml config data in the first markdown cell).

From a line numbering point of view, if PyMarkdown found an issue with the second line of the third document, what would be the best way to reflect that specific line to the user?

Two ways come to mind:

  1. give the markdown cell number (e.g. 3 for the third markdown cell);
  2. give the notebook cell number (e.g. in your example, 4 for the third markdown cell)

psychemedia avatar Sep 07 '23 12:09 psychemedia