pandas
pandas copied to clipboard
DOC/DEV: document approach to bisect regressions
@simonjayhawkins you have been doing great work on finding the causes of bugs/regressions! To make it easier for other contributors to also do this, I think it would be nice to document this is the contributing docs (even though it is relying on git, using it with pandas (which needs to be rebuilt for each commit) requires a bit of custom setup I think?)
This would be good...I've got a local script mybisect.sh
:
python setup.py build_ext --inplace -j 8;
python pandas/tests/myscript.py
and then put whatever example I'm trying to check in pandas/tests/myscript.py
. Then I do
git bisect start
git bisect bad
git bisect good <some older commit>
git bisect run bash mybisect.sh
. It took me more than I'm proud to admit to figure this out, so making a note of this (or of whatever the proper way of doing it is) would be useful IMO
good idea. I'll use this issue to post a few notes in the first instance.
as well as finding bug/regressions it can also be useful to know when issues are fixed, e.g. #35650
In that issue the old behaviour raised ValueError and the new behaviour didn't. (more detail on exit status will be needed in doc)
so the script needed to exit with a non-zero exit code for the new behaviour and with a zero exit code for the old behaviour
so in this case, this script was used (print statements not needed)
import sys
import numpy as np
import pandas as pd
print(pd.__version__)
arr = np.array([np.datetime64("2015-11-20T15:06:58.000")])
arr.flags.writeable = False
try:
res = pd.factorize(arr)
print(res)
except:
pass
else:
sys.exit(1)
I've been wondering how exactly @simonjayhawkins does this but was too afraid to ask 😊. Great idea to document this.
for an regression which is not fixed on master and now raises an Exception can just use code sample (this will give a non-zero exit status for a bad commit and zero exit status for good commit eg. https://github.com/pandas-dev/pandas/issues/35700#issuecomment-673472111
(don't need to assert correct result but probably better to document a better practice)
~/test.py
import pandas as pd
print(pd.__version__)
df = pd.DataFrame({"a": [1, 2, 3] * 10000, "b": ["x", "y", "z"] * 10000})
res = df == "x"
print(res)
and execute the following commands in the shell (git bash on Windows) see examples in https://git-scm.com/docs/git-bisect#_examples
git bisect start HEAD v1.0.5
git bisect run /bin/bash -c "python setup.py build_ext -i;python ~/test.py"
git bisect reset
python setup.py build_ext -i # need to rebuild after reset
We could even put something like this into a script in scripts/, possibly even accessible in via make
possibly even accessible in via
make
if it for dev use that's probably ok. but if we also want to encourage users to bisect the issues themselves we need to be Windows friendly or have a guide that includes Linux/Mac and Windows instructions.
Thanks for writing these examples up so clearly here @simonjayhawkins, made it very easy to get started! A potentially useful addition is prepending CFLAGS="-O0"
to the run command, which made bisection much faster for me because of greatly decreased build time.
A potentially useful addition is prepending
CFLAGS="-O0"
to the run command
Is this only true for bisecting? Any reason to not use it all the time?
Anyway, here's a fully worked notebook I use to find regressions: https://www.kaggle.com/code/marcogorelli/pandas-regression-example
Right, let's try to move this forwards - I think this could be a nice addition to this page https://github.com/pandas-dev/pandas/blob/0dadc71dd4653e5b858d7b4153df1d7aded4ba46/doc/source/development/maintaining.rst
Doesn't need to be in-depth, I think it just needs:
- a link to the git bisect documentation
- a little example like the one Simon gave, e.g.:
If a user reported that
pd.Series([1,2,3]).sum()
returned9
in version pandas1.5.0
and6
in version1.4.0
, then you could find out which commit changed the result by making a filet.py
withimport pandas as pd assert pd.Series([1,2,3]).sum() == 6
and then running
git bisect start git bisect good v1.4.0 git bisect bad v1.5.0 git bisect run bash -c "python setup.py build_ext -j 4; python t.py" git bisect reset python setup.py build_ext -j 4 # need to rebuild after reset
Marking as good-first-issue then, if anyone wants to work on this feel free to open a pull request and to tag me
Is this only true for bisecting? Any reason to not use it all the time?
Gives a tradeoff between compile time and performance. No optimizations gives much slower code, but for a case like bisecting when (usually) there's a bunch of compiling going on (since cython files often change across bisection steps), it can be much faster provided that the bisection test case still runs quickly
Hi @MarcoGorelli I'd like to take a crack at writing this up.