sdc
sdc copied to clipboard
Return a DataFrame from HPAT Jited function
Hi,
I am trying to use HPAT to accelerate ETL process. Although HPAT gave significant speedup on a multi-core CPU in terms of the data frame transformation, it has an issue that I could not figure out now.
It gives no speedup or raises an error when we return the data frame from the jitted function. The example with minimal code is listed as follows.
@hpat.jit
def test2():
t0 = time.time()
df = pandas.read_csv('random.csv', names=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], dtype={'A' : 'float', 'B' : 'float', 'C' : 'float', 'D' : 'float', 'E' : 'float', 'F' : 'float', 'G' : 'float', 'H': 'float', 'I': 'str', 'J': 'str'})
t_readcsv = time.time() - t0
print('t_readcsv = ', t_readcsv)
t0 = time.time()
res = ( df.A.mean(), df.A.max(), df.A.min() , df.B.mean(), df.B.max(), df.B.min())
t_calc = time.time() - t0
print('t_calc = ', t_calc)
return df
df = test2()
In the baseline case, time python test_hpat3.py
uses 30.31s.
time mpiexec -n 2 python test_hpat3.py
real 0m33.568s
time mpiexec -n 8 python test_hpat3.py
real 0m32.557s
time mpiexec -n 16 python test_hpat3.py
real 0m37.037s
time mpiexec -n 32 python test_hpat3.py
real 0m48.858s
We found that using more processes on MPI for this example program only gives more slowdown.
The observation is different when I remove the return df
from the JITted function, where we have more speedup with the increasing number of processes used.
Besides, if I use even more processes, an error is reported.
time mpiexec -n 44 python test_hpat3.py
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 68909 RUNNING AT CR3PPM-SER010
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
real 0m38.648s
user 22m1.044s
sys 4m20.828s
I am not sure if the slowdown/error is supposed to happen since I am quite new to HPAT.
Could you give me more explanation and suggestions about it? Let me know if other information is needed.
Since I would like to feed the data frame after the ETL process, how can I return the data frame out of the HPAT jitted function?
Thank you so much.
Best regards, Hongyuan Liu
Software configuration: hpat 0.30.0 py37hc547734_15 intel/label/test numba 0.45.0 py37h962f231_0
@bigwater thanks or your report. We'll look into it.