nbstripout icon indicating copy to clipboard operation
nbstripout copied to clipboard

Binary file when outputs are not cleared

Open IsabellLehmann opened this issue 3 years ago • 13 comments

I use SourceTree, where I can see the changes on the right by clicking on a file. I have created a new file and run all cells. If I click on the file, I see this: image

I I stage it, I get this message: image If I clear the outputs in the notebook, it looks like this after staging: image

I thought that nbstripout should suppress the outputs and wonder why it is not working. I also have to say that this is not the case for all notebooks with plots. In another .ipynb-file it works as expected without cleaning the outputs.

IsabellLehmann avatar Mar 02 '21 14:03 IsabellLehmann

Is nbstripout enabled for the repository where output isn't being stripped? You can check by typing the following:

nbstripout --status

horseshoe107 avatar Mar 10 '21 03:03 horseshoe107

@horseshoe107 Thanks for you quick answer. It is in the same repository, where some outputs are not stripped and others are. The folder structure is: mlsp (repository) / notebooks.

I cd in the mlsp repository, activate my anaconda virtual environment (which is also named mlsp) and type: nbstripout --status.

This is the response:

nbstripout is installed b'C:\Users\Isi\Documents\Research\mlsp'

Filter: clean = b'"c:/users/isi/anaconda3/envs/mlsp/python.exe" -m nbstripout' smudge = b'cat' diff= b'"c:/users/isi/anaconda3/envs/mlsp/python.exe" -m nbstripout -t' extrakeys=

Attributes: b'*.ipynb: filter: nbstripout'

Diff Attributes: b'*.ipynb: diff: ipynb'

IsabellLehmann avatar Mar 10 '21 06:03 IsabellLehmann

I was just able to reproduce the error in some say:

When I load the datafile, I create an fMRI plot in the end. I get a really huge output figure which consists of 7x8 subplots (where the first 50 are filled with data and the last ones are empty). In this case, I get the problem with the binary file as shown above.

However, when I just plot half the data, i.e., 7x4 subplots with the first 25 are filled with data end the last are empty, I can stage the file with no problems and see all changes.

So I guess this happens if the output figures are too big. Is that possible?

With this code, you should get the problem with the binary file:

import matplotlib.pyplot as plt

data = np.random.randn(50, 250, 250)

fig, axes = plt.subplots(nrows=8, ncols=7, figsize=(7*3, 8*3))
for idx, ax in enumerate(axes.flatten()):
   ax.axis('off')
   if idx < data.shape[0]:
       ax.imshow(data[idx,:,:])

IsabellLehmann avatar Mar 10 '21 06:03 IsabellLehmann

It does look like nbstripout is installed. Can you try command-line git next: git diff (exit by pressing q)

If you don't see any binary data in the command-line diff, then the problem may how nbstripout is hooking into SourceTree. I don't reproduce the problem btw, but i'm using a different client (GitHub Desktop)

horseshoe107 avatar Mar 10 '21 13:03 horseshoe107

Hi! With git diff, I don't see any new files. Therefore, I tried with an existing file where I see the binary message in Source Tree. With git diff in the command line, I can see the changes. Then, I added the changes via command line and used git diff --cached to check them - works fine. In SourceTree, I can now see the files as staged, but still don't see the changed lines but the binary message. So yes, it seems to be a problem of nbstripout with SourceTree.

Thanks for clarification!

IsabellLehmann avatar Mar 11 '21 07:03 IsabellLehmann

@IsabellLehmann can you confirm what happens if you only use Git on the command line and not via SourceTree? Do you see any issues?

I've tried the example from your comment and when I run it in a cell in a notebook, save and then git add the notebook file I get the expected git diff --cached:

diff --git a/largedata.ipynb b/largedata.ipynb
new file mode 100644
index 0000000..024fde6
--- /dev/null
+++ b/largedata.ipynb
@@ -0,0 +1,44 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "altered-permit",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "\n",
+    "data = np.random.randn(50, 250, 250)\n",
+    "\n",
+    "fig, axes = plt.subplots(nrows=8, ncols=7, figsize=(7*3, 8*3))\n",
+    "for idx, ax in enumerate(axes.flatten()):\n",
+    "   ax.axis('off')\n",
+    "   if idx < data.shape[0]:\n",
+    "       ax.imshow(data[idx,:,:])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

I wonder if SourceTree uses its own heuristic to decide whether or not a file is binary?

The notebook file from your example is ~2.5M and almost all of it is a base64 encoded PNG, which is technically binary content.

kynan avatar Apr 12 '21 19:04 kynan

@kynan I did as you suggested. In the command line I get:

diff --git a/notebooks/Untitled.ipynb b/notebooks/Untitled.ipynb
new file mode 100644
index 0000000..9d30785
--- /dev/null
+++ b/notebooks/Untitled.ipynb
@@ -0,0 +1,52 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "chubby-tutorial",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "\n",
+    "data = np.random.randn(50, 250, 250)\n",
+    "\n",
+    "fig, axes = plt.subplots(nrows=8, ncols=7, figsize=(7*3, 8*3))\n",
+    "for idx, ax in enumerate(axes.flatten()):\n",
+    "   ax.axis('off')\n",
+    "   if idx < data.shape[0]:\n",
+    "       ax.imshow(data[idx,:,:])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "alive-retirement",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "sportsdata",
+   "language": "python",
+   "name": "sportsdata"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

When checking the staged file in SourceTree: image

IsabellLehmann avatar Apr 13 '21 07:04 IsabellLehmann

The diff you posted is from git diff --cached I assume?

Is SourceTree showing you the state after having staged the file? In this case, the diff should indeed be empty.

Does it show you the staged diff as well? And is it the same as the diff you get on the command line?

I don't have a Mac or Windows system, so can't test with SourceTree myself.

kynan avatar Apr 13 '21 21:04 kynan

Yes, it is from git diff --cached.

What do you mean by state? Usually, SourceTree shows the diff on the right in the file contents. If I clear all outputs, save and stage it in SourceTree, it looks like this: image So it gives the same output as git diff --cashed. However, if I do not clear the outputs, the file is detected as a binary file and I cannot see any changes (as in the screenshot above).

IsabellLehmann avatar Apr 14 '21 07:04 IsabellLehmann

Thanks for confirming. I'm not sure what I could do on the nbstripout side to fix this.

Does SourceTree have a community forum or similar where you could report this, to make sure it's not a behaviour of SourceTree itself? Or maybe how you can configure SourceTree not to treat .ipynb files as binary?

kynan avatar Apr 14 '21 20:04 kynan

Thanks for the idea! I have posted a question in the Atlassian community forum: https://community.atlassian.com/t5/Sourcetree-questions/Why-is-nbstripout-not-always-working-with-SourceTree/qaq-p/1664240

IsabellLehmann avatar Apr 15 '21 06:04 IsabellLehmann

@IsabellLehmann did you ever find a solution for this issue?

kynan avatar Jan 02 '22 20:01 kynan

@IsabellLehmann did you ever find a solution for this issue?

Nope, I just use the workaround that in case in SourceTree a binary file is detected, I manually clear the outputs in the Jupyter notebook...

IsabellLehmann avatar Jan 10 '22 08:01 IsabellLehmann

@IsabellLehmann is this still an issue for you?

kynan avatar Sep 24 '22 12:09 kynan

@kynan Using my example above, I still have the problem using SourceTree but not using git add in the console. So, I think we can close the issue here.

IsabellLehmann avatar Sep 27 '22 07:09 IsabellLehmann

Thanks, will do :)

kynan avatar Oct 02 '22 09:10 kynan