MonetDBLite-Python
MonetDBLite-Python copied to clipboard
Parallelized inserts using MonetDBLite are not persistent
- MonetDBLite-Python version: 0.6.3
- Python version: 3.7.3
- Pip version: 19.0.3
- Operating System: Arch Linux
Description
I'm importing point cloud data. Since some processing steps are pretty CPU-intensive, I'm parallelizing the processing. At the end of the preprocessing, the data is loaded into MonetDB from a Pandas data frame.
As long as the Python process is active, the size of the database on disk increases with each insert. But as soon as the process/worker terminates, disk sizes shrinks back to 1.5MB.
How can I make the changes persistent?
What I Did
This is a rough simplification of the code:
def process:
# preprocessing...
x, y = numpy.meshgrid(numpy.arange(1000), numpy.arange(1000))
z = numpy.random.rand(1000000)
data = pandas.DataFrame({"x": x, "y": y, "z": z})
conn = monetdblite.connectclient()
monetdblite.insert('points', data, client=conn)
del conn
datalist = [...]
monetdblite.init("./database/")
with Pool(processes=2, maxtasksperchild=1) as p:
p.map(process, datalist, 1)
monetdblite.shutdown()
Hi,
Without knowing the specifics of your application, I think the problem that you might be facing is how MonetDB, (and MonetDBLite as a consequence) handles transactions. Take a look at this blog post. Please note that, contrary to MonetDB, by default MonetDBLite is not in autocommit mode.
Thanks for your valuable time! I created a minimal working example that should demonstrate my case. Parallel inserts don't work anyways due to #33 I suppose, but even without parallel execution I don't get transactions to commit.
import numpy as np
import pandas as pd
import monetdblite
def process(idx):
# generate data
z = np.random.rand(10000000) + idx
data = pd.DataFrame({"z": z})
# create a new client connection
connection = monetdblite.make_connection("./database/")
cursor = connection.cursor()
# insert data
cursor.insert('points', data)
cursor.commit()
cursor.close()
connection.close()
del cursor
del connection
# init db
monetdblite.init("./database/")
monetdblite.sql('CREATE TABLE points (Z FLOAT);')
monetdblite.shutdown()
process(1)
Hi again. Thanks for the reports, I really appreciate them!
I will take a look at the snippet you just posted, but unfortunately it will take some time because of some pressing deadlines in other projects. One question is: do you get any error messages when you execute it?
Also try inserting the data using a plain dictionary instead of a Pandas dataframe. If that works, then the bug is better localized and I would know more or less where to look.
Cheers. There is no error message. With
cursor.insert('points', {"Z": np.random.rand(10000000)})
the behaviour does not change.
I have reproduced the problem. I will investigate further and let you know what I find, but again it might take some time to fix. Thanks again.