pgfutter icon indicating copy to clipboard operation
pgfutter copied to clipboard

Is there a chance to implement multithreading?

Open linuzer opened this issue 8 years ago • 6 comments

I use pgfutter to import regularly json-files of several GBs and it works great!

The only thing is, it seems to run only on one thread, which leaves a lot of resources unused during the import process. Would be cool, if pgfutter would automatically split up the import file into junks which would get imported on multiple threads in parallel! Is there a chance to implement this?

linuzer avatar Jan 24 '17 11:01 linuzer

I actually did try this before and it did not yield significant speed improvements (for CSV) - I internally use the Postgres COPY command with streaming and it is usually all IO bound. If I run 2 goroutines it just got half as fast for copying for each routine.

However for JSON streams it might make sense to do the decoding/encoding in parallel.

On Jan 24, 2017 12:45 PM, "linuzer" [email protected] wrote:

I use pgfutter to import regularly json-files of several GBs and it works great!

The only thing is, it seems to run only on one thread, which leaves a lot of resources unused during the import process. Would be cool, if pgfutter would automatically split up the import file into junks which would get imported on multiple threads in parallel! Is there a chance to implement this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lukasmartinelli/pgfutter/issues/32, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOokwIQJho_5xqeIHi7CFu9UBGgcK3Bks5rVeRVgaJpZM4LsKiV .

lukasmartinelli avatar Jan 24 '17 11:01 lukasmartinelli

Oh do you import large JSON files or files in the JSON lines format (json object per line)?

On Jan 24, 2017 12:45 PM, "linuzer" [email protected] wrote:

I use pgfutter to import regularly json-files of several GBs and it works great!

The only thing is, it seems to run only on one thread, which leaves a lot of resources unused during the import process. Would be cool, if pgfutter would automatically split up the import file into junks which would get imported on multiple threads in parallel! Is there a chance to implement this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lukasmartinelli/pgfutter/issues/32, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOokwIQJho_5xqeIHi7CFu9UBGgcK3Bks5rVeRVgaJpZM4LsKiV .

lukasmartinelli avatar Jan 24 '17 11:01 lukasmartinelli

The latter, so it's one JSON object per line, but millions of lines. I see the pgfutter process using one (of eight) cores to 100% and the Postgresql (on the same machine), although running on several processes, using on average another core, so there are basically 6 cores doing nothing.

linuzer avatar Jan 24 '17 11:01 linuzer

OH, I have to mention, I have a fast SSD, that still is far of beeing on its limit.

linuzer avatar Jan 24 '17 11:01 linuzer

It's cool to hear someone else is importing large JSON files as well. I used it before to import few hundred GBs with 500 million lines of JSON objects - but there I also run the processes in parallel.

The only thing is, it seems to run only on one thread, which leaves a lot of resources unused during the import process. Would be cool, if pgfutter would automatically split up the import file into junks which would get imported on multiple threads in parallel! Is there a chance to implement this?

It might transactional failure harder. When the copy stream fails I can not guarantee that no data is inserted. But perhaps it's not an issue because I wrap everything into a transaction (might be that the copy operation is a transaction on its own though).

This is where one could optimize https://github.com/lukasmartinelli/pgfutter/blob/master/json.go#L28 Not sure whether I will get around this soon - it is interesting because optimizing is fun :)

For you I recommend to try import with multiple processes and check whether that makes it faster?

lukasmartinelli avatar Jan 24 '17 12:01 lukasmartinelli

When the copy stream fails I can not guarantee that no data is inserted

That would not be a huge issue for me, since I always make sure that the table is empty before the import.

This is where one could optimize

Unfortunately I'm not at all a GO-programmer, so I'm completely unable to work directly on the code.

optimizing is fun :)

100% agreed! That's why I was asking...

For you I recommend to try import with multiple processes and check whether that makes it faster?

For this I would need to split up the import file by myself. Since I'm not controlling the program that creates the JSON file, I can only try to use another 3rd-party tool to do the split, but since it is a 16GB-gzipped-json, this probably is also not easy. Do you have a recommendation, or tool which I could try to achieve this?

linuzer avatar Jan 24 '17 12:01 linuzer