RedditLemmyImporter icon indicating copy to clipboard operation
RedditLemmyImporter copied to clipboard

Slow import / performance.

Open dessalines opened this issue 3 years ago • 2 comments

This import should only take a few seconds, but it takes minutes because:

  • It uses many separate insert statements, instead of one as it should. Or it could use COPY.
  • It selects the community and user id from the name for each statement, rather than getting them from a join.

SQL performance is all about minimizing the # of queries / statements you have to run.

dessalines avatar Apr 10 '22 23:04 dessalines

It uses many separate insert statements, instead of one as it should. Or it could use COPY.

The problem is that we can't know the parent ID for each comment until after we insert the parent comment. So afaict all of the comments on a post cannot be inserted at once. This is also why we can't use COPY.

At best the comments for each post could be inserted in layers with comments at each depth inserted at once. That might speed up performance; I'd have to play around with it.

It selects the community and user id from the name for each statement, rather than getting them from a join.

You mean getting both the comm ID and the user ID in a single statement with a join instead of as two separate selects? I mean okay, that saves one select statement but it still accesses the same data, it's all the same transaction anyway, and it logically conflates two separate logical things just to reduce the statement count. I'm skeptical that this makes more than a couple of seconds of difference across even thousands of posts, with a warmed-up cache.

If you mean why I'm selecting them for each post instead of just once in an outer block, that was sort of by design. One line in the input corresponds to one PL/pgSQL block in the output, and you can run any one of them to test without having to disentangle it from nested blocks.

The number one consideration here for me was to be as simple as possible so that people could easily audit the code and decide it was trustworthy, since I'm a rando essentially asking to run queries on someone else's database. And also this is a one-off batch import which runs in under a minute anyway on my dockerized dev instance, at least with the GenZhou data and with no other load. If performance is a real constraint for someone then maybe I can revisit this.

mesmere avatar Apr 18 '22 12:04 mesmere

That does make sense for comments, since you have to build a tree, but all the posts could be a single insert statement:

INSERT INTO POST (...) VALUES (post1data, post2data, ...)

You mean getting both the comm ID and the user ID in a single statement with a join instead of as two separate selects?

No, you are selecting the same data thousands of times, every single insert starts with this, which isn't necessary.

SELECT id INTO STRICT comm_id FROM community WHERE name = 'genzhouarchive';                                                           SELECT id INTO STRICT user_id FROM person WHERE name = 'archive_bot';

You only need those values once for the entire program.

dessalines avatar Apr 18 '22 15:04 dessalines