dbdpg
dbdpg copied to clipboard
DBD::Pg Array and string optimization patch
Replace string operations with an efficient string concat library Replace all linked lists with arrays Use arrays to remove quadratic complexity algorithm.
Fix CPAN RT #105492
Hi, so this is a pretty huge rewrite of the guts of DBD::Pg. When using large numbers of placeholders (10,000 - 50,000) the performance boost is dramatic.
I'd like to get some reviews of this if anyone has the time available. Given the size of the patch it's quite possible I've goofed somewhere.
If you have any questions or concerns please let me know.
Hi Matt,
Thanks for the contribution. I note the following test errors introduced by your patch:
t/03smethod.t ....... 1/122
# Failed test 'Statement handle attribute pg_numbound returns 0 if no placeholders bound yet'
# at t/03smethod.t line 622.
# got: '2'
# expected: '0'
# Failed test 'Statement handle attribute pg_numbound returns 1 if one placeholder bound'
# at t/03smethod.t line 626.
# got: '2'
# expected: '1'
# Failed test 'Statement handle attribute pg_numbound returns 1 if one placeholders bound as NULL'
# at t/03smethod.t line 635.
# got: '2'
# expected: '1'
t/03smethod.t ....... 122/122 # Looks like you failed 3 tests of 122.
These don't exist in master, so they're definitely introduced by your patch; hopefully you'll have some quick insights there. Can you also add a test file for the high number of placeholders use case you mention?
Have you done any timing tests for the more common use case of smaller number of placeholders?
Also, was this original code you wrote for the efficient concat library?
Best,
David
Woah, sorry about those test errors. I don't see those test errors myself. What platform or distro are you using? I'm using fedora 24. t/03smethod.t returns no errors for me, which is strange.
I haven't done any timing tests for small placeholder lists. I'll do that and report back. I can also add tests for the larger lists in the next couple of days.
Yes, the string concat library was written by myself from scratch.
Hi David,
Sorry about that, an installed version of DBD::Pg was interfering with the tests. I've pushed a new commit that fixes the issue.
I'm also getting a test error in t/30unicode.t, It's like that without my patch though. BD::Pg::st execute failed: ERROR: invalid byte sequence for encoding "UTF8": 0xc9 0x6d at t/30unicode.t line 117.
I'll add the large placeholder test and the bencmarks in the next couple of days.
Thanks for the quick response!
Regards, Matt.
Hi Matt,
Thanks for the updates; the t/30unicode.t is a known failure at this point and is actually a broken test.
I'll review the code; I assume the issue was a naïve O(n^2) algorithm for multiple concats which wasn't revealed in the typical use case with a more pedestrian number of bind vars. :-)
Thanks,
David
Hi David,
I wrote this patch over a year ago, so my memory is a bit fuzzy.
The main issues addressed by replacing linked lists with arrays is on removed line 2754
for (x=1,currph=imp_sth->ph; NULL != currph; currph = currph->nextph,x++) {
if (x==phnum){
break;
}
}
And this section on removed line 2223. Looking at it now, It looks like I've removed this and not replaced it properly. I'll have to check this further.
/* Let the correct segment(s) point to it */
for (currseg=imp_sth->seg; NULL != currseg; currseg=currseg->nextseg) {
if (currseg->placeholder==xint) {
currseg->ph = newph;
}
}
The final performance issue is happens when building the query string to dispatch to postgres. Repeated calls to strchr() and strcat() were causing a serious slowdown. It's not really noticeable with a dozen placeholders, but with 20,000 arguments the query string can be a megabyte or more in size, which causes a noticeable performance hit.
I found all of this when porting bugzilla.redhat.com from MySQL 5.0 to Postgres 9.2. With the amount of data we have in Bugzilla, it can generate some pretty insane queries.
Have you done any timing tests for the more common use case of smaller number of placeholders?
results.txt - CSV file with some benchmarks with placeholder counts between 5 - 100 in increments of 5, and 1000 - 65000 in increments of 1000.
reproducer.txt - perl script that I used to generate those results.
You can run the reproducer in valgrind, and then look at the output in kcachegrind which will show you where the CPU spends all its time.
Note that you will probably need the perl, dbd and dbd-pg debug packages installed for this to give useful output.
$ valgrind --tool=callgrind perl reproducer.pl
$ kcachegrind callgrind.out.1234
Hi @mtyson01,
Has this patch been updated since the last correspondence in CPAN RT#105492? I'm just wondering if you consider this a little more bulletproof than you seemed in that ticket.
I'm going to review the code in a little more depth now that it is passing tests.
Best,
David
Hi David,
Has this patch been updated since the last correspondence in CPAN RT#105492?
Yes, The patch in CPAN RT only replaces the placeholder linked list (ph_t). This patch in github has been expanded to replace the segment (seg_t) linked list as well.
I'm just wondering if you consider this a little more bulletproof than you seemed in that ticket.
I'm fairly confident that the patch is good for my use case. The patch in the CPAN RT was originally written against 2.19.3 (What we are running in production now) and then ported to 3.5.
The 2.19.3 version of the patch has been running in production for a year with no problems. I'm pretty confident for my use case that it's bulletproof. (this does not include the seg_t array)
I've noticed that DBD::Pg makes an effort to support postgres-isms such has named and positional placeholders. We don't actually use these, so those codepaths have not had much testing. I did some quick tests to check that they still worked, and it seemed ok, but it hasn't received anywhere near the usage that the DBI ? style placeholders have.
I'm not in a rush to get this merged. I can maintain my fork for as long as needed. If you like we can wait until I push this version of the patch into production (this could be a long time though)
I tried implementing this recently, but ran into a number of compilation issues with the new files. Will try again someday, but anyone else is welcome to take a stab at it.
Anyone want to take another pass at this? I'm still having issues, but maybe you two (@mtyson01 @machack666) will have better luck?