pandas DOC/DEV: document approach to bisect regressions

@simonjayhawkins you have been doing great work on finding the causes of bugs/regressions! To make it easier for other contributors to also do this, I think it would be nice to document this is the contributing docs (even though it is relying on git, using it with pandas (which needs to be rebuilt for each commit) requires a bit of custom setup I think?)

Aug 12 '20 11:08 jorisvandenbossche

This would be good...I've got a local script mybisect.sh :

python setup.py build_ext --inplace -j 8;
python pandas/tests/myscript.py

and then put whatever example I'm trying to check in pandas/tests/myscript.py. Then I do

git bisect start
git bisect bad
git bisect good <some older commit>
git bisect run bash mybisect.sh

. It took me more than I'm proud to admit to figure this out, so making a note of this (or of whatever the proper way of doing it is) would be useful IMO

Aug 12 '20 11:08 MarcoGorelli

good idea. I'll use this issue to post a few notes in the first instance.

as well as finding bug/regressions it can also be useful to know when issues are fixed, e.g. #35650

In that issue the old behaviour raised ValueError and the new behaviour didn't. (more detail on exit status will be needed in doc)

so the script needed to exit with a non-zero exit code for the new behaviour and with a zero exit code for the old behaviour

so in this case, this script was used (print statements not needed)

import sys
import numpy as np
import pandas as pd

print(pd.__version__)

arr = np.array([np.datetime64("2015-11-20T15:06:58.000")])
arr.flags.writeable = False
try:
    res = pd.factorize(arr)
    print(res)
except:
    pass
else:
    sys.exit(1)

Aug 12 '20 17:08 simonjayhawkins

I've been wondering how exactly @simonjayhawkins does this but was too afraid to ask 😊. Great idea to document this.

Aug 13 '20 02:08 dsaxton

for an regression which is not fixed on master and now raises an Exception can just use code sample (this will give a non-zero exit status for a bad commit and zero exit status for good commit eg. https://github.com/pandas-dev/pandas/issues/35700#issuecomment-673472111

(don't need to assert correct result but probably better to document a better practice)

~/test.py

import pandas as pd
print(pd.__version__)
df = pd.DataFrame({"a": [1, 2, 3] * 10000, "b": ["x", "y", "z"] * 10000})
res = df == "x"
print(res)

and execute the following commands in the shell (git bash on Windows) see examples in https://git-scm.com/docs/git-bisect#_examples

git bisect start HEAD v1.0.5
git bisect run /bin/bash -c "python setup.py build_ext -i;python ~/test.py"
git bisect reset
python setup.py build_ext -i  # need to rebuild after reset

Aug 14 '20 15:08 simonjayhawkins

We could even put something like this into a script in scripts/, possibly even accessible in via make

Aug 14 '20 16:08 jbrockmendel

possibly even accessible in via make

if it for dev use that's probably ok. but if we also want to encourage users to bisect the issues themselves we need to be Windows friendly or have a guide that includes Linux/Mac and Windows instructions.

Aug 14 '20 16:08 simonjayhawkins

Thanks for writing these examples up so clearly here @simonjayhawkins, made it very easy to get started! A potentially useful addition is prepending CFLAGS="-O0" to the run command, which made bisection much faster for me because of greatly decreased build time.

Jan 02 '21 20:01 mzeitlin11

A potentially useful addition is prepending CFLAGS="-O0" to the run command

Is this only true for bisecting? Any reason to not use it all the time?

Anyway, here's a fully worked notebook I use to find regressions: https://www.kaggle.com/code/marcogorelli/pandas-regression-example

Sep 21 '22 07:09 MarcoGorelli

Right, let's try to move this forwards - I think this could be a nice addition to this page https://github.com/pandas-dev/pandas/blob/0dadc71dd4653e5b858d7b4153df1d7aded4ba46/doc/source/development/maintaining.rst

Doesn't need to be in-depth, I think it just needs:

a link to the git bisect documentation
a little example like the one Simon gave, e.g.:

If a user reported that pd.Series([1,2,3]).sum() returned 9 in version pandas 1.5.0 and 6 in version 1.4.0, then you could find out which commit changed the result by making a file t.py with
import pandas as pd
assert pd.Series([1,2,3]).sum() == 6
and then running
git bisect start
git bisect good v1.4.0
git bisect bad v1.5.0
git bisect run bash -c "python setup.py build_ext -j 4; python t.py"
git bisect reset
python setup.py build_ext -j 4  # need to rebuild after reset

Marking as good-first-issue then, if anyone wants to work on this feel free to open a pull request and to tag me

Sep 28 '22 21:09 MarcoGorelli

Is this only true for bisecting? Any reason to not use it all the time?

Gives a tradeoff between compile time and performance. No optimizations gives much slower code, but for a case like bisecting when (usually) there's a bunch of compiling going on (since cython files often change across bisection steps), it can be much faster provided that the bisection test case still runs quickly

Sep 28 '22 22:09 mzeitlin11

Hi @MarcoGorelli I'd like to take a crack at writing this up.

Sep 29 '22 21:09 seljaks

pandas pandas copied to clipboard

DOC/DEV: document approach to bisect regressions

pandas
pandas copied to clipboard