PDL::Core::dog doesn't scale
For this, time perl -MPDL -e 'zeroes(X)->dog' is being discussed. This is prompted by an observation from @djerius.
X=1e4,6 takes no appreciable time. X=6,1e4 takes around 5s. The current implementation of dog does a Perl map slicing over the highest dimension. This reveals two problems:
- the current machinery of starting up an ndarray operation seems to be very expensive (as noted long ago by @run4flat) and needs profiling to seek improvements
- it's not trivial to make
dogbe a PP operation, because we still don't have a notation for aParto be expressed in a way that would help, nor do we have a way to return many ndarrays
A partial solution to the second point would be to pass in an OtherPar as the input (with no Pars at all), and an OtherPar [o] pdl *retval[] where retval_count gets set automatically as with any OtherPar that's a [], but used for output. The typemap would turn that into an AVREF populated with new ndarray SVs. The input version of that would allow a PP cat operation as well.
For comparison, within perldl, with_time { my @x = 1..1e7; $_++ for @x } takes about 750ms, and without the ++ operation, it's about 400ms just for the allocation.
Here's a plot of user time vs N, where the code run is zeroes(6,N)->dog.
This was run on an AMD Epyc 7763 w/ 128 cores & 638Gb, using 100 parallel processes. Noise might be due to clocking of cores at different speeds (3244MHz vs. 2445Mhz)
Code to generate the data:
{ for t in $(seq 1000 1000 1000000); do echo /bin/time --format "'%E %U %S'" perl -MPDL -e "'print $t, q{ };zeroes(6,$t)->dog'" ; done; } | parallel --jobs 100 --joblog joblog |& tee timelog
Plotting code is here: plot-times.txt

I've attached a zip file with NYTProf output for N= 1_000, 10_000, and 100_000
My analysis of the 1e5 results: for 100k calls, _slice_int took 126s, and DESTROY took 349s!
Here's a plot of user time vs N, where the code run is
zeroes(6,N)->dog.This was run on an AMD Epyc 7763 w/ 128 cores & 638Gb, using 100 parallel processes. Noise might be due to clocking of cores at different speeds (3244MHz vs. 2445Mhz)
This is really cool - how did you make it?
This is really cool - how did you make it?
For future reference, it's been added to the plot comment above.
This is really cool - how did you make it?
For future reference, it's been added to the plot comment above.
Thank you. That will be at least twice as valuable if you also include the rest, i.e. specifically how you turned those times into the plot shown?
Thank you. That will be at least twice as valuable if you also include the rest, i.e. specifically how you turned those times into the plot shown?
Check out the plot_times.txt file attached to the comment.
Thanks! This is the code, to avoid having to download things:
#! /usr/bin/env perl
use PDL;
use Path::Tiny;
use PDL::Graphics::PGPLOT;
my ( @n, @e, @u, @s );
for ( path('timelog')->lines ) {
chomp;
my ( $n, $e, $u, $s ) = split /\h+/;
for my $time ( $e, $u, $s ) {
my @c = reverse split /:/, $time;
$time = (pdl(@c) * 60**sequence(0+@c))->dsumover / 3600;
}
push @n, $n;
push @e, $e;
push @u, $u;
push @s, $s;
}
my $n = PDL->new( @n );
my $e = PDL->new( @e );
my $u = PDL->new( @u );
my $s = PDL->new( @s );
# points( $n, $e, { symbol => 2, symbolsize=>3});
# label_axes( 'N', 'Elapsed Time [hours]' );
points( $n, $u, { symbol => 2, symbolsize=>3});
label_axes( 'N', 'User Time [hours]' );
# points( $n, $s, { symbol => 2, symbolsize=>3});
# label_axes( 'N', 'System Time [hours]' );
Updated the plot in the above comment, just for kicks.
Updated the plot in the above comment, just for kicks.
Final update of the plot. Now the relationship looks really odd at higher N.
Maybe there was process contention on the machine after all? I'll run it again with fewer processes and see if it changes.
Multiple clear-ish lines is very confusing!
Rerunning as:
seq 1000 1000 1000000 \
| parallel --shuf -t --jobs 25 --joblog joblog \
taskset --cpu-list '{%}' \
/bin/time --format "'%E %U %S'" \
perl -MPDL -e "'print STDERR {}, q{ }; zeroes(6,{})->dog'" '2>&1' \
| tee timelog
With results (as plotted via code in above comment)

@mohawk2 new results finished; see plot in comment w/ revised code. Much cleaner.
Thank you. It does look like it might be O(n^2). Feel like making a log graph to see? :-)
To be comparable, with the current code:
pdl> with_time { @x = map sequence(10), 1..1e5 }
1957.02 ms
pdl> with_time { pdl(\@x) }
283.285 ms
pdl> with_time { cat(@x) }
1073.88 ms
Using cachegrind as noted in #451, 62% of the time for zeroes(6,1e4)->dog was taken in pdl__removetrans_children (called in DESTROY). Simply adding a count of how many children each ndarray so the linked-list walking can terminate early caused the typical time for that to go from about 800ms, to about 450ms.