jupytext icon indicating copy to clipboard operation
jupytext copied to clipboard

Pre commit hook workflow

Open MicheleQuasar opened this issue 3 years ago • 10 comments

I'm trying to set up a pre-commit hook workflow following the docs but having some issues with what I have in mind. The workflow I'm immagining right now is as follows:

  1. Make changes to notebook
  2. add notebook to staging area
  3. commit => before committing the pre-commit hook runs converts the notebook to .py adds it to the commit and then proceeds to committing

From reading the issue where support for the pre-commit framework was added, it looks like that the problem is in the last point: the pre-commit hook is supposed to generate the .py file but not add it to the staging area.

My issue with this behaviour is that this way I always have to make tha same commit twice, once for the notebook, once for the .py. Indeed, if I'm working on the notebook file and not touching the .py, from one commit to the next the .py is going to be the same so git won't event let me add it to the staging area. Then, when committing the first time around, the jupytext pre-commit hook fails telling me that the git index is outdated and asking me to add the .py to the staging area; then running the commit a second time adding the generated .py everything works fine.

My question is, is this the intended behaviour or am I missing something?

MicheleQuasar avatar Oct 18 '21 15:10 MicheleQuasar

Hi @MicheleQuasar , well that is a good question.

I think this is the intended behaviour - at least this also appears in our tests.

However I agree with you that having to add the paired file to the index makes the pre-commit workflow a bit more complex to use than standard hooks. As you state, trying to commit twice is not enough, unlike for most hooks.

I just had a look at the history of our work on the workflow, and in that comment we concluded that adding the paired file to the index was not the right approach. I think the reason is that if we add the file in the hook itself, then we don't have a guarantee that it passes the other pre-commit hooks.

Maybe I'll double check with @JohnPaton and @Skylion007 if my understanding is correct? Also, is anyone aware of other pre-commit hooks that like Jupytext might modify files outside of the current git index (and if so, how did they address this question?)

mwouts avatar Oct 19 '21 21:10 mwouts

This is correct and the intended behaviour. pre-commit themselves promise to never modify the staging area - the user needs to take responsibility for any code changes that are being committed, essentially. See for example this issue

JohnPaton avatar Oct 20 '21 10:10 JohnPaton

Thanks for the clarification @mwouts and @JohnPaton. The choice of not committing anything that hasn't been checked by a human makes sense in the general case. It can make the workflow a bit more cumbersome, like in this instance, but is for sure the safest!

@mwouts would it make sense to add something about this in the docs?

MicheleQuasar avatar Oct 26 '21 12:10 MicheleQuasar

Agreed, I admit once I trust a hook I commit its output pretty uncritically - but at the end of the day it's still my name on the commit.

JohnPaton avatar Oct 26 '21 13:10 JohnPaton

I'm sorry for coming back on an old thread, but this has left me confused: what is the purpose of the jupytext pre-commit hook then?

I mean, I think I'm getting something wrong here, but in my mind one of the main use cases (and the one I'm interested in at the moment) of jupytext is versioning exclusively .py files and no .ipynb ones. However, the pre-commit hook forces its user to stage the .ipynb files and does not remove them from the index afterwards (for the reasons you mentioned). The conclusion I've arrived at is if I wanted to only version .py files generated from jupytext, I'd be better off running jupytext manually before each commit instead of using the hook, which does not make sense to me.

The workflow I'm imagining is something like

git add *.ipynb
git commit ... # Runs hook and e.g. updates .py files using `jupytext --sync`
git rm --cached *.ipynb # Have to manually remove files before commiting again
git commit ...

whereas running manually you'd only have to do

jupytext --sync *.ipynb
git add *.py
git commit ...

Is there another use case for this hook in its current form that I'm missing?

tomaz-suller avatar May 17 '22 15:05 tomaz-suller

Hi @tomaz-suller , thank you for your input.

The use case for which the jupytext pre-commit hook was designed was to make sure both text and ipynb notebooks are in sync when committed.

Our pre-commit hook does not cover at all the case when you want to update local files (here, .ipynb files) that are not committed.

An alternative is to use a custom pre-commit script with a git reset HEAD **/*.ipynb instruction. While this is not documented any more in the current version, it should still work. To use it you can refer to this previous version of the docs.

Out of curiosity, may I ask why you are willing to use a pre-commit (or a post checkout?) hook to update the .ipynb files? Are you using another editor than Jupyter (which will take the source from the .py file when Jupytext is installed?)

mwouts avatar May 18 '22 09:05 mwouts

Hi @mwouts. I made this comment thinking about two possibilities for using jupytext, one I found in a personal project and another at the company I intern at:

  1. You want to version notebooks based on .py files. Quickly after I made the comment I realised I could just as well use the py:percent format for everything if I was okay with sacrificing the ability to save my output (which I was), so I only used jupytext once to convert all .ipynb files to .py files, no need for the hook here.

  2. I'm working for a company in which there is a need to version notebooks, so naturally .ipynb is less than ideal. The example I gave, using the --sync flag, was meant to give the developer flexibility on whether to modify the .ipynb or the .py according to their preference. Still, even if you use --from ipynb --to py:percent in the hook, you have to stage and then unstage the changes to the .ipynb notebooks, which comes back to how that can be cumbersome. I want to be able to edit exclusively the .ipynb files because that's how people at my organisation are used to doing things, and I'm just an intern who was asked to do this thing haha. Of course I could just talk to them to go to something like case 1. I mentioned above.

And on your question, my idea is to use VS Code, though it supports py:percent just as well as Jupyter.

For now I'll use another solution I found on StackOverflow from the (allegedly, haven't verified myself) creator of pre-commit. It's downside is evidently to running on every notebook every time, but if that causes problems I'll just create a custom hook like you suggested.

tomaz-suller avatar May 18 '22 10:05 tomaz-suller

there is a need to version notebooks

I want to be able to edit exclusively the .ipynb files

This is a good use case for the existing hook imo. Workflow:

  • Both the .ipynbs and the synced .pys (in whichever format) are tracked in git
  • Edit your notebooks in whichever editor you want
  • Stage them in git
  • Try to commit or run the hook manually
  • If the output produced doesn't match the .py files (because you have changed the code), those .pys will be updated
  • Add the updated files to git and try again
  • Commit succeeds
  • Only look at the diffs of the .pys when examining changes

It seems like a lot but in practice it's just

  • Do your work as usual
  • git add -u
  • git commit -m "Changes to notebooks" (fails, .pys updated)
  • git add -u
  • git commit -m "Changes to notebooks and synced .pys" (succeeds)
  • git push

JohnPaton avatar May 23 '22 12:05 JohnPaton

@JohnPaton in this situation though, adding both the py and ipynb could create merge conflicts on the ipynb files, couldn't it? The hook fails if you just stage the py files.

tomaz-suller avatar May 25 '22 15:05 tomaz-suller

Only if there are also conflicts on the py files (in the case of code changes), in which case they need to be resolved anyway, or in case people commit different outputs. If you're worried about that you could use something like kynan/nbstripout to additionally remove the outputs, and then there can only be conflicts in both files or neither, since the content will be essentially the same.

JohnPaton avatar May 27 '22 08:05 JohnPaton