nbsphinx
nbsphinx copied to clipboard
Avoid sphinx searching on output cells
In our documentation, we have some notebooks rendered by nbsphinx
which include plotly plots. The output cells of the notebook include the full plotly javascript library. When we use sphinx's search bar in our documentation, we get hits for plotly's javascript (circled in red):
Is there any way to avoid this?
Can you please show the HTML of the affected page?
It looks like HTML tags are supposed to get stripped, see https://github.com/sphinx-doc/sphinx/commit/53ea1cb2808e90b51f0ed9468740a34c00decc2a
It would be ideal if you could reproduce this with the raw
directive (without using nbsphinx
), then you could raise this as a Sphinx issue.
Ok, so what I understand from the code there is that in principle everything inside a script
tag is ignored, right?
Thanks, I'll try to dig deeper!
This is the page: https://zerothi.github.io/sisl/visualization/viz_module/showcase/GeometryPlot.html#GeometryPlot
(I can't upload html files to github)
And if I grep on that html file:
grep -n "not have a valid GeoJSON geometry" geometry_plot.html | cut -d : -f 1
I get a match on line 208, which is where the plotly library is included inside a script tag.
Thanks for the link!
BTW, the "download ipynb" link is broken: https://raw.githubusercontent.com/zerothi/sisl/main//home/runner/work/sisl/sisl/docs/visualization/viz_module/showcase/GeometryPlot.ipynb
I guess it is meant to be this: https://raw.githubusercontent.com/zerothi/sisl/main/docs/visualization/viz_module/showcase/GeometryPlot.ipynb
However, this doesn't contain the outputs. Can you please provide the .iypnb
file with outputs?
Yes, I'll send it to you as soon as I get home 👍
Thanks for the broken link report!
Here it is: GeometryPlot.zip
Thanks for the notebook file!
Playing around with that, I could reduce this to a pure Sphinx problem: https://github.com/sphinx-doc/sphinx/issues/12052
It looks like the <script>
tag is indeed ignored when building the search index, but it is not ignored in the search preview.
Note that in your example the Plotly stuff is only shown because the word "geometry" is also used somewhere else on the page. If you search for "GeoJSON", you'll find nothing, even though the word is right next to "geometry".
Thank you very much! That's an interesting bug 😅
I guess I can close this then 👍
That's an interesting bug
Yes indeed!
It is a dangerous pattern to look out for: there is one piece of data (in our case the HTML source text) and there are two sub-systems handling that data separately (in our case the search index generation and the search preview generation). Those two systems are supposed to have the same behavior, but if they don't, we have a problem.
This reminds me of a vulnerability of the librsvg
library I've read recently: https://nvd.nist.gov/vuln/detail/CVE-2023-38633
In that case, the common piece of data was a URL, which was rejected by one sub-system, but not by another, which resulted in a potential exploit.
Love this! Thanks!