pdl PDL::Core::dog doesn't scale

For this, time perl -MPDL -e 'zeroes(X)->dog' is being discussed. This is prompted by an observation from @djerius.

X=1e4,6 takes no appreciable time. X=6,1e4 takes around 5s. The current implementation of dog does a Perl map slicing over the highest dimension. This reveals two problems:

the current machinery of starting up an ndarray operation seems to be very expensive (as noted long ago by @run4flat) and needs profiling to seek improvements
it's not trivial to make dog be a PP operation, because we still don't have a notation for a Par to be expressed in a way that would help, nor do we have a way to return many ndarrays

A partial solution to the second point would be to pass in an OtherPar as the input (with no Pars at all), and an OtherPar [o] pdl *retval[] where retval_count gets set automatically as with any OtherPar that's a [], but used for output. The typemap would turn that into an AVREF populated with new ndarray SVs. The input version of that would allow a PP cat operation as well.

Mar 27 '23 04:03 mohawk2

For comparison, within perldl, with_time { my @x = 1..1e7; $_++ for @x } takes about 750ms, and without the ++ operation, it's about 400ms just for the allocation.

Mar 27 '23 22:03 mohawk2

Here's a plot of user time vs N, where the code run is zeroes(6,N)->dog.

This was run on an AMD Epyc 7763 w/ 128 cores & 638Gb, using 100 parallel processes. Noise might be due to clocking of cores at different speeds (3244MHz vs. 2445Mhz)

Code to generate the data: { for t in $(seq 1000 1000 1000000); do echo /bin/time --format "'%E %U %S'" perl -MPDL -e "'print $t, q{ };zeroes(6,$t)->dog'" ; done; } | parallel --jobs 100 --joblog joblog |& tee timelog Plotting code is here: plot-times.txt

times

Mar 27 '23 22:03 djerius

I've attached a zip file with NYTProf output for N= 1_000, 10_000, and 100_000

profile.zip.txt

Mar 27 '23 22:03 djerius

I've attached a zip file with NYTProf output for N= 1_000, 10_000, and 100_000

My analysis of the 1e5 results: for 100k calls, _slice_int took 126s, and DESTROY took 349s!

Mar 27 '23 23:03 mohawk2

Here's a plot of user time vs N, where the code run is zeroes(6,N)->dog.

This was run on an AMD Epyc 7763 w/ 128 cores & 638Gb, using 100 parallel processes. Noise might be due to clocking of cores at different speeds (3244MHz vs. 2445Mhz)

This is really cool - how did you make it?

Mar 27 '23 23:03 mohawk2

This is really cool - how did you make it?

For future reference, it's been added to the plot comment above.

Mar 27 '23 23:03 djerius

This is really cool - how did you make it?

For future reference, it's been added to the plot comment above.

Thank you. That will be at least twice as valuable if you also include the rest, i.e. specifically how you turned those times into the plot shown?

Mar 27 '23 23:03 mohawk2

Thank you. That will be at least twice as valuable if you also include the rest, i.e. specifically how you turned those times into the plot shown?

Check out the plot_times.txt file attached to the comment.

Mar 27 '23 23:03 djerius

Thanks! This is the code, to avoid having to download things:

#! /usr/bin/env perl

use PDL;
use Path::Tiny;
use PDL::Graphics::PGPLOT;

my ( @n, @e, @u, @s );
for ( path('timelog')->lines ) {

   chomp;
   my ( $n, $e, $u, $s ) = split /\h+/;

   for my $time ( $e, $u, $s ) {
       my @c = reverse split /:/, $time;
       $time = (pdl(@c) * 60**sequence(0+@c))->dsumover / 3600;
   }

   push @n, $n;
   push @e, $e;
   push @u, $u;
   push @s, $s;
}

my $n = PDL->new( @n );
my $e = PDL->new( @e );
my $u = PDL->new( @u );
my $s = PDL->new( @s );

# points( $n, $e, { symbol => 2, symbolsize=>3});
# label_axes( 'N', 'Elapsed Time [hours]' );

points( $n, $u, { symbol => 2, symbolsize=>3});
label_axes( 'N', 'User Time [hours]' );

# points( $n, $s, { symbol => 2, symbolsize=>3});
# label_axes( 'N', 'System Time [hours]' );

Mar 27 '23 23:03 mohawk2

Updated the plot in the above comment, just for kicks.

Mar 28 '23 21:03 djerius

Updated the plot in the above comment, just for kicks.

Final update of the plot. Now the relationship looks really odd at higher N.

Maybe there was process contention on the machine after all? I'll run it again with fewer processes and see if it changes.

Mar 29 '23 15:03 djerius

Multiple clear-ish lines is very confusing!

Mar 29 '23 15:03 mohawk2

Rerunning as:

seq 1000 1000 1000000                                                   \
| parallel --shuf -t --jobs 25 --joblog joblog                          \
  taskset --cpu-list '{%}'                                              \
  /bin/time --format "'%E %U %S'"                                       \
  perl -MPDL -e "'print STDERR {}, q{ }; zeroes(6,{})->dog'" '2>&1'     \
| tee timelog

With results (as plotted via code in above comment)

plot-times-taskset

Mar 29 '23 16:03 djerius

@mohawk2 new results finished; see plot in comment w/ revised code. Much cleaner.

Mar 31 '23 17:03 djerius

Thank you. It does look like it might be O(n^2). Feel like making a log graph to see? :-)

Mar 31 '23 17:03 mohawk2

To be comparable, with the current code:

pdl> with_time { @x = map sequence(10), 1..1e5 }
1957.02 ms
pdl> with_time { pdl(\@x) }
283.285 ms
pdl> with_time { cat(@x) }
1073.88 ms

Apr 01 '23 13:04 mohawk2

Using cachegrind as noted in #451, 62% of the time for zeroes(6,1e4)->dog was taken in pdl__removetrans_children (called in DESTROY). Simply adding a count of how many children each ndarray so the linked-list walking can terminate early caused the typical time for that to go from about 800ms, to about 450ms.

Mar 31 '24 17:03 mohawk2