kuzu icon indicating copy to clipboard operation
kuzu copied to clipboard

Pyarrow Performance Problem

Open mxwli opened this issue 6 months ago • 4 comments

Kùzu version

projection-py branch

What operating system are you using?

No response

What happened?

When scanning from large python dataframes, it seems like Arrow sources are noticeably slower than Numpy sources. For example, consider the following results, where we have a 100 million row table with two integer columns.

loadNumpy: 0.09682886581867933s
loadNumpyProj: 0.03378376178443432s

loadArrow: 0.4811821011826396s
loadArrowProj: 0.4280042313039303s

The query used for normal results is

LOAD FROM df RETURN max(col0), max(col1)

Results suffixed by proj used the following query

LOAD FROM df RETURN max(col0)

If we consider the absolute performance difference between not projecting and projecting, both sources improve by about 0.063s and 0.053s respectively, which is most likely the performance improvement gained from not scanning the second column, plus not attempting to determine the max value of the column. However, it looks like arrow scanning comes with a significant overhead. The most likely cause of this overhead is the binding process, though I can't be sure.

Are there known steps to reproduce?

The following script was used to generate results. It also contains commented out code for benchmarking json, but the performance results there are not expected to be significant.

import kuzu
import os
import random
import pandas as pd
import polars as pl
import numpy as np
import time

conn = kuzu.Connection(kuzu.Database())
conn.execute("LOAD EXTENSION 'extension/json/build/libjson.kuzu_extension';")
# conn.execute("CALL THREADS=1")

random.seed(100)
datalen = 100_000_000
maxnum = 1_000_000_000
mp = {'col0': np.random.randint(maxnum, size=(datalen)), 'col1': np.random.randint(maxnum, size=(datalen))}
polarsDF = pl.DataFrame(mp)
pandasDF = pd.DataFrame(mp)
# open('test.json', 'w').write('[\n' + ',\n'.join(['{"col0": ' + str(mp['col0'][i]) + ', "col1": ' + str(mp['col1'][i]) + '}' for i in range(datalen)]) + '\n]\n')

def benchmark(func, params):
        begin = time.perf_counter()
        func(*params)
        end = time.perf_counter()
        return end - begin

def loadNumpy(conn, df):
        conn.execute('LOAD FROM df RETURN max(col0), max(col1)').get_as_pl()

def loadNumpyProj(conn, df):
        result = conn.execute('LOAD FROM df RETURN max(col0)').get_as_pl()

def loadArrow(conn, df):
        conn.execute('LOAD FROM df RETURN max(col0), max(col1)').get_as_pl()

def loadArrowProj(conn, df):
        result = conn.execute('LOAD FROM df RETURN max(col0)').get_as_pl()

def loadJSON(conn):
        conn.execute('LOAD FROM "test.json" RETURN max(col0), max(col1)').get_as_pl()

def loadJSONProj(conn):
        conn.execute('LOAD FROM "test.json" RETURN max(col0)').get_as_pl()

print(f'loadNumpy: {benchmark(loadNumpy, [conn, pandasDF])}')
print(f'loadNumpyProj: {benchmark(loadNumpyProj, [conn, pandasDF])}')
print(f'loadArrow: {benchmark(loadArrow, [conn, polarsDF])}')
print(f'loadArrowProj: {benchmark(loadArrowProj, [conn, polarsDF])}')
# print(f'loadJSON: {benchmark(loadJSON, [conn])}')
# print(f'loadJSONProj: {benchmark(loadJSONProj, [conn])}')

# os.system('rm test.json')

mxwli avatar Aug 23 '24 15:08 mxwli