shap
shap copied to clipboard
Blacken source code?
Black is emerging as the standard code formatter for Python (https://github.com/python/black, https://www.youtube.com/watch?v=esZLCuWs_2Y)
Are you open to reformatting the code using this? If so, I can submit a PR for it.
@jamesmyatt That is great idea! Would you be interested in making PR? It should be added to pyproject.toml and maybe into pre-commit.
@thatlittleboy Could you confirm that there is no black formatted used at the moment?
@detrin We discussed that before over at https://github.com/dsgibbons/shap/issues/21, our conclusion was to not use black for the time being.
I think we can repoen the discussion for it at some stage, but I would say it's not a huge priority.
@thatlittleboy Okay so I read the other issue and all the points are valid.
My habit is that when I join new project that wasn't maintained properly using black and any fancy tools can make the whole developer experience a bit easier. I see it as motor oil.
One concern though is that reformatting with black might make it very difficult to merge any changes between forks, as the diff would be enormous and merge conflicts would likely arise everywhere. It would be great if the original repo were already formatted with black, but we have to be pragmatic.
Agree.
but I would say it's not a huge priority.
Also, agree. I think splitting the repository would also help to make it more maintainable.
@connortann Is this already solved by #3550?
No, although it's related: that PR only related to docstrings.
We have applied the black code style (implemented via ruff format) to the notebooks, but not yet to the code. We could indeed extend the autoformatting to the code base, although I've been a bit hesitant so far as the diff is huge and would be a bit of a pain to review. However, I think now is probably a good a time as any, so I'll have a look on Monday.
I made a quick script to see what the effect of the formatter for a range of line lengths, counting the lines changed in the resulting git diff.
A line length of 120 seems like a reasonable choice that would keep the diff to a minimum. FYI @CloseChoice
Script to reproduce:
import subprocess
import sys
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
REPO = Path("~/projects/shap").expanduser().resolve()
RUFF = "/home/connor/miniforge3/envs/shap/bin/ruff"
def run_repo_command(command) -> str:
result = subprocess.run(
command,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
shell=True,
text=True,
cwd=REPO,
)
if result.returncode != 0:
print(f"Error running command '{command}': {result.stderr}")
sys.exit(1)
return result.stdout.strip()
def evaluate_line_length_effect(line_length) -> dict[str, int]:
print("Evaluating line length effect for length:", line_length)
# Nb. on my local "format" branch, the .py files are included by the ruff formatter.
run_repo_command("git checkout format")
run_repo_command("git reset --hard")
run_repo_command(f"{RUFF} format shap tests --line-length {line_length}")
diff = run_repo_command("git diff --shortstat")
# E.g."181 files changed, 13157 insertions(+), 5131 deletions(-)"
metrics = [int(s) for s in diff.split() if s.isdigit()]
assert len(metrics) == 3, "Unexpected output from git diff --shortstat: " + diff
return dict(
length=line_length,
changed=metrics[0],
insertions=metrics[1],
deletions=metrics[2],
)
def scrape():
data = [evaluate_line_length_effect(length) for length in range(50, 200)]
df = pd.DataFrame(data)
df.to_csv("line_length_effect.csv", index=False)
def plot():
df = pd.read_csv("line_length_effect.csv")
fig, ax = plt.subplots()
df.plot(x="length", y="changed", ax=ax)
ax2 = df.plot(x="length", y="insertions", ax=ax, secondary_y=True)
ax2.set_ylim(bottom=0)
plt.show()
if __name__ == "__main__":
scrape()
plot()