mars icon indicating copy to clipboard operation
mars copied to clipboard

[BUG] Result inconsistent when intermediate results executed

Open wjsi opened this issue 3 years ago • 1 comments

Describe the bug The same DataFrame expression outputs different results. The only difference is that the first piece of code does not execute intermediately while the second executes.

To Reproduce

import numpy as np
import pandas as pd
import mars.dataframe as md
import mars.tensor as mt

rs = np.random.RandomState(0)
raw_df = rs.rand(20, 10)
raw_df = pd.DataFrame(
    np.where(raw_df > 0.4, raw_df, np.nan), columns=list("ABCDEFGHIJ")
)
raw_df2 = rs.rand(20, 10)
raw_df2 = pd.DataFrame(
    np.where(raw_df2 > 0.4, raw_df2, np.nan), columns=list("ACDEGHIJKL")
)

df = md.DataFrame(raw_df, chunk_size=4)
df2 = md.DataFrame(raw_df2, chunk_size=6)

joined = md.merge(
    df.index.to_frame(), df2.index.to_frame(), how="outer", left_index=True, right_index=True
)
df, df2 = df.reindex(joined.index), df2.reindex(joined.index)

nna_df = df.notna().astype(np.float_)
nna_df2 = df2.notna().astype(np.float_)

df, df2 = df.fillna(0), df2.fillna(0)

print(df.mul(nna_df2, axis=0).sum(axis=0).execute().fetch())
"""
prints

A    0.0
B    0.0
C    0.0
D    0.0
E    0.0
F    0.0
G    0.0
H    0.0
I    0.0
J    0.0
K    0.0
L    0.0
dtype: float64
"""

df.execute()
nna_df2.execute()
print(df.mul(nna_df2, axis=0).sum(axis=0).execute().fetch())
"""
prints

A    4.392781
B    0.000000
C    5.063856
D    5.276695
E    4.752212
F    0.000000
G    5.451037
H    4.206434
I    6.948446
J    2.303648
K    0.000000
L    0.000000
dtype: float64
"""

wjsi avatar Apr 02 '22 06:04 wjsi

To simplify:

In [6]: rs = np.random.RandomState(0)
   ...: raw_df = rs.rand(20, 10)
   ...: raw_df = pd.DataFrame(
   ...:     np.where(raw_df > 0.4, raw_df, np.nan), columns=list("ABCDEFGHIJ")
   ...: )
   ...: raw_df2 = rs.rand(20, 10)
   ...: raw_df2 = pd.DataFrame(
   ...:     np.where(raw_df2 > 0.4, raw_df2, np.nan), columns=list("ACDEGHIJKL")
   ...: )
   ...:
   ...: df = md.DataFrame(raw_df, chunk_size=4)
   ...: df2 = md.DataFrame(raw_df2, chunk_size=6)
   ...:
   ...: joined = md.merge(
   ...:     df.index.to_frame(), df2.index.to_frame(), how="outer", left_index=True, right_index=True
   ...: )
   ...: df, df2 = df.reindex(joined.index), df2.reindex(joined.index)
   ...: print(df.mul(df2, axis=0).sum(axis=0).execute().fetch())
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0/100 [00:00<00:00, 289.25it/s]
A    0.0
B    0.0
C    0.0
D    0.0
E    0.0
F    0.0
G    0.0
H    0.0
I    0.0
J    0.0
K    0.0
L    0.0
dtype: float64

In [7]: df2.execute()
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0/100 [00:00<00:00, 755.23it/s]
Out[7]:
           A         C         D         E         G         H         I         J         K         L
0        NaN  0.696343       NaN       NaN       NaN       NaN  0.679393  0.453697  0.536579  0.896671
1   0.990339       NaN  0.663078       NaN       NaN  0.758379       NaN       NaN  0.588317  0.831048
2   0.628982  0.872651       NaN  0.798047       NaN  0.952792  0.687488       NaN  0.947371  0.730856
3        NaN       NaN  0.518201       NaN       NaN  0.424685       NaN  0.463575       NaN  0.586784
4   0.863856       NaN  0.517379       NaN  0.716860       NaN  0.565421       NaN       NaN  0.488056
5        NaN  0.940432  0.765325  0.748664  0.903720       NaN  0.552192  0.584476  0.961936       NaN
6        NaN       NaN       NaN  0.929529  0.669917  0.785153       NaN  0.586410       NaN  0.485628
7   0.977495  0.876505       NaN  0.961570       NaN  0.949319  0.941378  0.799203  0.630448  0.874288
8        NaN  0.848944  0.617877       NaN       NaN       NaN  0.981829  0.478370  0.497391  0.639473
9        NaN       NaN  0.822118       NaN  0.511319       NaN       NaN  0.862192  0.972919  0.960835
10  0.906555  0.774047       NaN       NaN  0.407241       NaN       NaN       NaN  0.725594       NaN
11  0.770581       NaN       NaN       NaN  0.672048       NaN  0.420539  0.557369  0.860551  0.727044
12       NaN       NaN       NaN       NaN       NaN  0.456141  0.683281  0.695625       NaN       NaN
13       NaN  0.788546       NaN  0.696997  0.778695  0.777408       NaN       NaN  0.587600       NaN
14       NaN       NaN  0.459856       NaN  0.799796       NaN  0.518835       NaN  0.577543  0.959433
15  0.645570       NaN  0.430402  0.510017  0.536177  0.681393       NaN       NaN       NaN  0.956406
16       NaN  0.903984  0.543806  0.456911  0.882041  0.458604  0.724168       NaN  0.904044  0.690025
17  0.699622       NaN  0.756779  0.636061       NaN       NaN  0.796391  0.959167  0.458139  0.590984
18  0.857723  0.457223  0.951874  0.575751  0.820767  0.908844  0.815524       NaN  0.628898       NaN
19       NaN  0.424032       NaN  0.849038       NaN  0.958983       NaN       NaN       NaN       NaN

In [8]: print(df.mul(df2, axis=0).sum(axis=0).execute().fetch())
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0/100 [00:00<00:00, 292.82it/s]
A    3.677647
B    0.000000
C    4.093799
D    3.071312
E    3.224467
F    0.000000
G    3.593087
H    3.225237
I    4.713964
J    1.146618
K    0.000000
L    0.000000
dtype: float64

hekaisheng avatar Apr 11 '22 07:04 hekaisheng