pymarkdown
pymarkdown copied to clipboard
Linting Markdown Cells in Jupyter Notebook ipynb documents
Is there a simple API for running linting checks over multiple strings from the same document context within a Python script?
For example, I'm looking for a way to link markdown in a Jupyter notebook.
in a Jupyter notebook .ipynb
document, there are multiple "cells" which can be either markdown or code. It would be useful to be able to co-opt pymarkdown
to run something of the form:
for notebook in directory:
linter = Linter(context=directory/notebook)
for i, cell in iterate(parse(notebook).cells):
if cell.type=='markdown':
linter.lint(cell.content, i)
The document context would allow for thinks like out of order cell headings across separate cells, whilst at the same time allow each cell to be reported on separately and identifiably.
A workaround is to convert an .ipynb
document to Markdown using something like jupytext
and then just lint the markdown document, but this loses the ability to provide cell level reports.
Been thinking about this for a week or so while I work on dealing with container issues. I am good with Python, but not as much with Jupyter notebooks. Can you point me to a good definition that will help me understand your specific needs? (i.e. there is a lot to digest, and would like to do focused research).
Jupyter notebook .ipynb
format is a JSON structure that defines "cells" that can be of various content types, including code or markdown cells: https://nbformat.readthedocs.io/en/latest/format_description.html
The format can be converted into various other document types, including various flavours of enhanced markdown or python, using tools such as a jupytext
.
However, when the document is in an .ipynb
format it is trivial to get access to the cells by type. It would be useful to be able to run pymarkdown
over each markdown cell in a notebook separately and and report on each cell separately.
That said, some tests would probably also need to run in a "global" / "all cell" context; for example, checks that header levels are correctly nested over the course of the notebook.
Sorry to leave this around for so long, life is kinda escaping from me this summer. Can you provide a sample file that I can use as a test?
This notebook has various simple bits of markdown in it; will try to see if there is a "reference" Jupyter md test notebook anywhere, else I will try to create one.
Cool... adding some options that may help provide this feature, but still thinking through it. Just wanted another tool in my belt for handling it.
The cell by cell part I am pretty confident that it can handle, especially with the new changes. And I can probably expose a more programmatic way to call it.
The one thing that has me concerned is your request for executing an "all cell" context. For me, it kind of sounds like multiple scans, and I would want to try and avoid that.
Let me think on this during the week, but I am seriously thinking about this.
If I have a markdown cell with an H1, and the next markdown cell has an H3, I would expect the linter to tell me I have a missing H2.
Okay, this may seem like a stupid question, but bear with me. Can I assume that the ordering of the cells in the notebook are consistent and in the order that they appear in the file? Just asking for a friend... :-)
Other hopefully not stupid question... when you are looking at the items in the notebooks, do you look at pages/indices, or the ids that seem to come with the items?
i.e. is
{
"cell_type": "markdown",
"id": "agreed-cancellation",
"metadata": {},
"source": [
"# 2 The interactive read-writable notebook environment"
]
},
best thought of as "index 0", "page 1", or "id agreed-cancellation"?
Not a silly question at all...
The notebooks are linear documents and if you "Run All" cells, they execute in linear order. When you run a cell, the output element displays a cell run execution order number. In the interactive notebook UI, a user can run cells in any order. Notebooks that are saved as run notebooks often demonstrate output cell execution history numbers that are out of order, reflecting the execution path the interactive user followed. Similarly, when writing a document, a user might add new code or markdown cells above other cells (just as you might add a paragraph midway through a pre-existing document in a text editor). But for reproducible execution purposes, you should view the document as a linear one.
This specific request, to lint a python Jupyter notebook is on my list, just not much time to work on it. The relatively new PyMarkdown API should allow you to scan it though. Thoughts?
I'll try to have a look at the API - thanks for heads-up.
Some of my original notes on linting notebooks etc can be found here.
Cool, will try looking at it in the next couple of weeks as I have time. One question that I do have about that entire process that you can help with...
Say I have a notebook with markdown, python, markdown, markdown, python.
- Based on previous conversations, I assume that PyMarkdown should leave the python segments alone, correct?
- There are 3 Markdown segments. Are they considered one? i.e. one of the rules is that each document should being with a top-level heading like
# this is the title
. - From a line numbering point of view, if PyMarkdown found an issue with the second line of the third document, what would be the best way to reflect that specific line to the user?
I assume that PyMarkdown should leave the python segments alone, correct?
Probably... any text in a code cell would probably be handled by a code linter / formatter.
(This partly depends on whether your markdown linting would ordinarily try to modify anything inside a code block?)
There are 3 Markdown segments. Are they considered one? i.e. one of the rules is that each document should being with a top-level heading like # this is the title.
I would say the .ipynb
file is the document. (It is possible to convert an .ipynb
file to an extended markdown format file (e.g. a MyST markdown doc) or a code file (for example, .py
). So ideally you would only have one H1 heading, and that at the top (note that some notebooks may have yaml
config data in the first markdown cell).
From a line numbering point of view, if PyMarkdown found an issue with the second line of the third document, what would be the best way to reflect that specific line to the user?
Two ways come to mind:
- give the markdown cell number (e.g.
3
for the third markdown cell); - give the notebook cell number (e.g. in your example,
4
for the third markdown cell)