sdc icon indicating copy to clipboard operation
sdc copied to clipboard

Return a DataFrame from HPAT Jited function

Open bigwater opened this issue 5 years ago • 1 comments

Hi,

I am trying to use HPAT to accelerate ETL process. Although HPAT gave significant speedup on a multi-core CPU in terms of the data frame transformation, it has an issue that I could not figure out now.

It gives no speedup or raises an error when we return the data frame from the jitted function. The example with minimal code is listed as follows.

@hpat.jit
def test2():
    t0 = time.time()
    df = pandas.read_csv('random.csv', names=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], dtype={'A' : 'float', 'B' : 'float', 'C' : 'float', 'D' : 'float', 'E' : 'float', 'F' : 'float', 'G' : 'float', 'H': 'float', 'I': 'str', 'J': 'str'})
    t_readcsv = time.time() - t0
    print('t_readcsv = ', t_readcsv)

    t0 = time.time()
    res = ( df.A.mean(), df.A.max(), df.A.min() , df.B.mean(), df.B.max(), df.B.min())
    t_calc = time.time() - t0
    print('t_calc = ', t_calc)

    return df

df = test2()

In the baseline case, time python test_hpat3.py uses 30.31s.

time mpiexec -n 2 python test_hpat3.py
real    0m33.568s

time mpiexec -n 8 python test_hpat3.py
real    0m32.557s

time mpiexec -n 16 python test_hpat3.py
real    0m37.037s

time mpiexec -n 32 python test_hpat3.py
real    0m48.858s

We found that using more processes on MPI for this example program only gives more slowdown.

The observation is different when I remove the return df from the JITted function, where we have more speedup with the increasing number of processes used.

Besides, if I use even more processes, an error is reported.

time mpiexec -n 44 python test_hpat3.py

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 68909 RUNNING AT CR3PPM-SER010
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

real    0m38.648s
user    22m1.044s
sys     4m20.828s

I am not sure if the slowdown/error is supposed to happen since I am quite new to HPAT.

Could you give me more explanation and suggestions about it? Let me know if other information is needed.

Since I would like to feed the data frame after the ETL process, how can I return the data frame out of the HPAT jitted function?

Thank you so much.

Best regards, Hongyuan Liu

Software configuration: hpat 0.30.0 py37hc547734_15 intel/label/test numba 0.45.0 py37h962f231_0

bigwater avatar Sep 26 '19 00:09 bigwater

@bigwater thanks or your report. We'll look into it.

fschlimb avatar Sep 27 '19 14:09 fschlimb