altair icon indicating copy to clipboard operation
altair copied to clipboard

regression transform not correct when axis in log scale

Open robna opened this issue 1 year ago • 4 comments

When using transform_regression(...) (linear), while one axis is on log scale, i.e. x=alt.X('x', scale=alt.Scale(type='log')) I get something like the plot below with the blue regression line. It should however be like the red (hand drawn) line shows....

Altair version is 5.01, but I think this is not new.

visualization_RegTest

Here is some code that reproduces this:

import pandas as pd
import numpy as np

# create a dummy DataFrame with random data
np.random.seed(0)
df = pd.DataFrame({
    'Horsepower': np.random.randint(1, 10, size=4),
    'Miles_per_Gallon': np.random.randint(1, 10, size=4)
})

# create a scatter chart
scatter = alt.Chart(df).mark_point().encode(
    x=alt.X('Horsepower:Q', scale=alt.Scale(type='log')),
    y='Miles_per_Gallon:Q'
)

# create a linear regression line overlay
reg_line = scatter.transform_regression('Horsepower', 'Miles_per_Gallon').mark_line()

scatter + reg_line

robna avatar Oct 11 '23 02:10 robna

Thanks for the report! To me at first glance this seems like a Vega-Lite issue: Open the Chart in the Vega Editor. I tried searching Vega-Lite issues but didn't see anything. (Is there any chance this is intended behavior, with the line of best fit being a line with respect to the chosen scales? That doesn't seem correct to me, but I don't work much with log scales...)

ChristopherDavisUCI avatar Oct 11 '23 04:10 ChristopherDavisUCI

Yes, apparently when using linear regression, Vega-Lite represents the line is only by the first and last coordinate. Connecting them will result in a straight line also when the axis is on log scale.

Other regression methods, that may result in a curved line (like pow) will include intermediate points even if the relation is a perfectly straight line and optimally approximated by $y = x^1$.

This can be seen when switching the regression mark to circle.

import altair as alt
import pandas as pd
import numpy as np

np.random.seed(0)
df = pd.DataFrame({
    'x': [1, 3, 5, 7],
    'y': [1, 3, 5, 7],
})

scatter = alt.Chart(df).mark_point(size=400).encode(
    x=alt.X('x', scale=alt.Scale(type='log')),
    y='y'
)
reg_line = scatter.transform_regression('x', 'y', method='pow').mark_circle(size=100, color='red', )

scatter + reg_line

image

robna avatar Oct 11 '23 14:10 robna

Thanks @robna for investigating this further! Would it make sense to open an issue with Vega-Lite? https://github.com/vega/vega-lite/issues

ChristopherDavisUCI avatar Oct 11 '23 18:10 ChristopherDavisUCI