MonetDBLite-Python icon indicating copy to clipboard operation
MonetDBLite-Python copied to clipboard

Parallelized inserts using MonetDBLite are not persistent

Open sedot42 opened this issue 5 years ago • 5 comments

  • MonetDBLite-Python version: 0.6.3
  • Python version: 3.7.3
  • Pip version: 19.0.3
  • Operating System: Arch Linux

Description

I'm importing point cloud data. Since some processing steps are pretty CPU-intensive, I'm parallelizing the processing. At the end of the preprocessing, the data is loaded into MonetDB from a Pandas data frame.

As long as the Python process is active, the size of the database on disk increases with each insert. But as soon as the process/worker terminates, disk sizes shrinks back to 1.5MB.

How can I make the changes persistent?

What I Did

This is a rough simplification of the code:

def process:
   # preprocessing...
   x, y = numpy.meshgrid(numpy.arange(1000), numpy.arange(1000))
   z = numpy.random.rand(1000000)
   data = pandas.DataFrame({"x": x, "y": y, "z": z})
   conn = monetdblite.connectclient()
   monetdblite.insert('points', data, client=conn)
   del conn

datalist = [...]
monetdblite.init("./database/")
with Pool(processes=2, maxtasksperchild=1) as p:
    p.map(process, datalist, 1)
monetdblite.shutdown()

Related stackoverflow question

sedot42 avatar May 20 '19 13:05 sedot42

Hi,

Without knowing the specifics of your application, I think the problem that you might be facing is how MonetDB, (and MonetDBLite as a consequence) handles transactions. Take a look at this blog post. Please note that, contrary to MonetDB, by default MonetDBLite is not in autocommit mode.

kutsurak avatar May 20 '19 16:05 kutsurak

Thanks for your valuable time! I created a minimal working example that should demonstrate my case. Parallel inserts don't work anyways due to #33 I suppose, but even without parallel execution I don't get transactions to commit.

import numpy as np
import pandas as pd
import monetdblite

def process(idx):
    # generate data
    z = np.random.rand(10000000) + idx
    data = pd.DataFrame({"z": z})
    # create a new client connection
    connection = monetdblite.make_connection("./database/")
    cursor = connection.cursor()
    # insert data
    cursor.insert('points', data)
    cursor.commit()
    cursor.close()
    connection.close()
    del cursor
    del connection

# init db
monetdblite.init("./database/")
monetdblite.sql('CREATE TABLE points (Z FLOAT);')
monetdblite.shutdown()

process(1)

sedot42 avatar May 21 '19 13:05 sedot42

Hi again. Thanks for the reports, I really appreciate them!

I will take a look at the snippet you just posted, but unfortunately it will take some time because of some pressing deadlines in other projects. One question is: do you get any error messages when you execute it?

Also try inserting the data using a plain dictionary instead of a Pandas dataframe. If that works, then the bug is better localized and I would know more or less where to look.

kutsurak avatar May 21 '19 14:05 kutsurak

Cheers. There is no error message. With

cursor.insert('points', {"Z": np.random.rand(10000000)})

the behaviour does not change.

sedot42 avatar May 22 '19 05:05 sedot42

I have reproduced the problem. I will investigate further and let you know what I find, but again it might take some time to fix. Thanks again.

kutsurak avatar May 22 '19 07:05 kutsurak