pykokkos BUG: segfault on large host arrays with timeit

This segfaults for me locally with PyKokkos at 1e6 1D (host) array_size_1d array size, but not with CuPy or NumPy, and not with 1e4 elements with PyKokkos. Clearly, PyKokkos should be able to handle this--probably makes sense to cut the reproducer/regression test down and investigate (latest develop branch):

"""
Record sqrt() ufunc performance.
"""

import os

import pykokkos as pk

import numpy as np
import cupy as cp
import pandas as pd
from tqdm import tqdm


if __name__ == "__main__":
    scenario_name = "pk_gp160_sqrt"
    current_sqrt_fn = "pk.sqrt"
    space = pk.ExecutionSpace.OpenMP
    pk.set_default_space(space)

    num_global_repeats = 50
    num_repeats = 5000
    array_size_1d = int(1e6)


    import timeit
    rng = np.random.default_rng(18898787)
    arr = rng.random(array_size_1d).astype(float)
    view = pk.from_numpy(arr)
    arr = view
    #arr = cp.array(arr)

    num_threads = os.environ.get("OMP_NUM_THREADS")
    if num_threads is None:
        raise ValueError("must set OMP_NUM_THREADS for benchmarks!")

    df = pd.DataFrame(np.full(shape=(num_global_repeats, 2), fill_value=np.nan),
                      columns=["scenario", "time (s)"])
    df["scenario"] = df["scenario"].astype(str)
    print("df before trials:\n", df)

    counter = 0
    for global_repeat in tqdm(range(1, num_global_repeats + 1)):
        sqrt_time_sec = timeit.timeit(f"{current_sqrt_fn}(arr)",
                                      globals=globals(),
                                      number=num_repeats)
        df.iloc[counter, 0] = f"{scenario_name}"
        df.iloc[counter, 1] = sqrt_time_sec
        counter += 1

    print("df after trials:\n", df)

    filename = f"{scenario_name}_array_size_1d_{array_size_1d}_{num_global_repeats}_tials_{num_repeats}_calls_per_trial.parquet.gzip"
    df.to_parquet(filename,
                  engine="pyarrow",
                  compression="gzip")

Mar 15 '23 22:03 tylerjereddy

Not specific to usage in timeit, and the segfault happens during the execution of the compiled workunit/kernel.

Mar 15 '23 22:03 tylerjereddy

Ah yeah, at 1e5 1D array size (of doubles) PyKokkos is using 96 % of 128 GB of RAM... something is wrong here.

Mar 15 '23 22:03 tylerjereddy

PyKokkos with memory_profiler with timeit per above with 1e4 float64 elements:

Identical workflow with NumPy doesn't show the leak:

Mar 15 '23 23:03 tylerjereddy

Providing a smaller reproducer:

import pykokkos as pk

import numpy as np

@profile
def test_gh_184():
    arr = np.zeros(10000)
    view = pk.from_numpy(arr)
    for i in range(50000):
        pk.sqrt(view)

if __name__ == "__main__":
    test_gh_184()

And profiling the memory usage with memory_profiler:

Filename: test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     5  191.637 MiB  191.637 MiB           1   @profile
     6                                         def test_gh_184():
     7  191.637 MiB    0.000 MiB           1       arr = np.zeros(10000)
     8  191.637 MiB    0.000 MiB           1       view = pk.from_numpy(arr)
     9 4062.195 MiB    0.516 MiB       50001       for i in range(50000):
    10 4062.195 MiB 3870.043 MiB       50000           pk.sqrt(view)

I hypothesized that PyKokkos is not properly freeing the new out views/arrays used in ufuncs, which at first glance the profiler seems to agree with:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   220 4061.914 MiB  191.746 MiB       50000   @profile
   221                                         def sqrt(view):
   222                                             """
   223                                             Return the non-negative square root of the argument, element-wise.
   224                                         
   225                                             Parameters
   226                                             ----------
   227                                             view : pykokkos view
   228                                                    Input view.
   229                                         
   230                                             Returns
   231                                             -------
   232                                             y : pykokkos view
   233                                                 Output view.
   234                                         
   235                                             Notes
   236                                             -----
   237                                             .. note::
   238                                                 This function should exhibit the same branch cut behavior
   239                                                 as the equivalent NumPy ufunc.
   240                                             """
   241                                             # TODO: support complex types when they
   242                                             # are available in pykokkos?
   243 4061.914 MiB    0.000 MiB       50000       if len(view.shape) > 2:
   244                                                 raise NotImplementedError("only up to 2D views currently supported for sqrt() ufunc.")
   245 4062.172 MiB 3830.570 MiB       50000       out = pk.View(view.shape, view.dtype)
   246 4062.172 MiB    0.516 MiB       50000       if "double" in str(view.dtype) or "float64" in str(view.dtype):
   247 4062.172 MiB    0.000 MiB       50000           if view.shape == ():
   248                                                     pk.parallel_for(1, sqrt_impl_1d_double, view=view, out=out)
   249 4062.172 MiB    0.000 MiB       50000           elif len(view.shape) == 1:
   250 4062.172 MiB   38.824 MiB       50000               pk.parallel_for(view.shape[0], sqrt_impl_1d_double, view=view, out=out)
   251                                                 elif len(view.shape) == 2:
   252                                                     pk.parallel_for(view.shape[0], sqrt_impl_2d_double, view=view, out=out)
   253                                             elif "float" in str(view.dtype):
   254                                                 if view.shape == ():
   255                                                     pk.parallel_for(1, sqrt_impl_1d_float, view=view, out=out)
   256                                                 elif len(view.shape) == 1:
   257                                                     pk.parallel_for(view.shape[0], sqrt_impl_1d_float, view=view, out=out)
   258                                                 elif len(view.shape) == 2:
   259                                                     pk.parallel_for(view.shape[0], sqrt_impl_2d_float, view=view, out=out)
   260 4062.172 MiB    0.516 MiB       50000       return out

Mar 16 '23 00:03 tylerjereddy

Simplifying even more, I think this can be summarized as:

PyKokkos is not freeing the memory associated with a view that is passed into a workunit/kernel (at least, no fast enough), and there appears to be sensitivity of memory copy behavior to scope/destruction/replacement of the view that is fed in.

For example:

import pykokkos as pk
import numpy as np
from tqdm import tqdm


@pk.workunit
def nothing(tid: int, view: pk.View1D[pk.double]):
    view[tid] = view[tid]


def test_gh_184():
    for i in tqdm(range(500000)):
        v = pk.View((10_000,), pk.float64)
        pk.parallel_for(1, nothing, view=v)


if __name__ == "__main__":
    test_gh_184()

vs. hoisting the view to reduce the memory punishment (this also runs orders of magnitude slower...):

import pykokkos as pk
import numpy as np
from tqdm import tqdm


@pk.workunit
def nothing(tid: int, view: pk.View1D[pk.double]):
    view[tid] = view[tid]


def test_gh_184():
    v = pk.View((10_000,), pk.float64)
    for i in tqdm(range(500000)):
        pk.parallel_for(1, nothing, view=v)


if __name__ == "__main__":
    test_gh_184()

Mar 16 '23 02:03 tylerjereddy

did some testing ... to me this looks like a classic python garbage collector problem.

the gc is not obliged to instantly delete an object if it goes out of scope (this deletion would be costly, so why do it now). But it also is not allowed to reuse the allocated memory of an object on the death list as long as the object was not deleted yet. Furthermore, afaik it is not even obliged to reuse memory already allocated at all (for externally defined types this might even be dangerous or lead to memory corruption).

Thus my assessment is: the gc sees the view v that got created in every iteration but does not delete it ... bc why should it even do this now. This results in every iteration allocating new space for a view as this guarantees no memory corruption. Thus you see the time necessary for allocation of a new v in every iteration in the triangular plot. If the view is created before the loop, the gc does even less ... it will only delete the one existing v at exit of the function that does the iteration.

Mar 16 '23 13:03 JBludau

I’m confused, do you see the same leak with NumPy; I think the gc is pretty fast usually so I still think PyKokkos is a problem here.

On Thu, Mar 16, 2023 at 07:36 JBludau @.***> wrote:

did some testing ... to me this looks like a classic python garbage collector problem.

the gc is not obliged to instantly delete an object if it goes out of scope (this deletion would be costly, so why do it now). But it also is not allowed to reuse the allocated memory of an object on the death list as long as the object was not deleted yet. Furthermore, afaik it is not even obliged to reuse memory already allocated at all (for externally defined types this might even be dangerous or lead to memory corruption).

Thus my assessment is: the gc sees the view v that got created in every iteration but does not delete it ... bc why should it even do this now. This results in every iteration allocating new space for a view as this guarantees no memory corruption. Thus you see the time necessary for allocation of a new v in every iteration in the triangular plot. If the view is created before the loop, the gc does even less ... it will only delete the one existing v at exit of the function that does the iteration.

— Reply to this email directly, view it on GitHub https://github.com/kokkos/pykokkos/issues/184#issuecomment-1471967808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4JOZSGJ46EXJAL4GZOZTLW4MJPDANCNFSM6AAAAAAV4N72QE . You are receiving this because you authored the thread.Message ID: @.***>

Mar 16 '23 14:03 tylerjereddy

I still maintain that something is wrong here, and regardless of the details, we should be competitive with NumPy and other array API providers for usage of tools like timeit without saturating memory usage.

Consider, for example, that if we don't send the view to the workunit, the gc does exactly the right thing:

import pykokkos as pk
import numpy as np
from tqdm import tqdm


@pk.workunit
def nothing(tid: int, view: pk.View1D[pk.double]):
    view[tid] = view[tid]


def test_gh_184():
    for i in tqdm(range(500000)):
        v = pk.View((10_000,), pk.float64)
        #pk.parallel_for(1, nothing, view=v)


if __name__ == "__main__":
    test_gh_184()

And if you add the workunit back, suddently the garbage collector is lazy?

import pykokkos as pk
import numpy as np
from tqdm import tqdm


@pk.workunit
def nothing(tid: int, view: pk.View1D[pk.double]):
    view[tid] = view[tid]


def test_gh_184():
    for i in tqdm(range(500000)):
        v = pk.View((10_000,), pk.float64)
        pk.parallel_for(1, nothing, view=v)


if __name__ == "__main__":
    test_gh_184()

And if you switch the loop to use a NumPy ufunc, suddently the gc is working again?

import pykokkos as pk
import numpy as np
from tqdm import tqdm


def test_gh_184():
    for i in tqdm(range(500000)):
        v = np.ones(10_0000, dtype=np.float64)
        np.sqrt(v)


if __name__ == "__main__":
    test_gh_184()

Mar 16 '23 16:03 tylerjereddy

I've been looking into it, and I think the cause is the kernel argument caching in parallel_dispatch.py here https://github.com/kokkos/pykokkos/blob/a4108286fcc0fac28300670dbd35a494b00c4097/pykokkos/interface/parallel_dispatch.py#L220

I'm still figuring out the details, but @tylerjereddy could you try commenting out that line and see if it fixes your problem?

Mar 16 '23 16:03 NaderAlAwar

@NaderAlAwar unfortunately, I'm seeing identical memory saturation with that change own its own locally.

Mar 16 '23 16:03 tylerjereddy

Ah wait, that's for reduction instead of parallel_for as @JBludau points out, let me try

Mar 16 '23 16:03 tylerjereddy

@NaderAlAwar ok, if I apply a similar patch for parallel_for then the memory saturation does go away as below. I'll check if the PyKokkos test suite can still pass with this--if it can I could benchmark like that for now...

Mar 16 '23 16:03 tylerjereddy

@NaderAlAwar Well, the original timeit reproducer at the top still segfaults for me with large arrays (1e6). That doesn't appear to be memory saturation because it is just so fast, but maybe an illegal/unreasonable request or something, dunno... maybe I"ll just use older benchmarks for now.

Mar 16 '23 17:03 tylerjereddy

um … It also segfaults on my machine for sizes above 1e8. Running a debug version I get mixed results in the debugger: for different numbers of threads it segfaults at different indices in the view. All of them are valid indices though…

Mar 22 '23 10:03 JBludau

ok ... so I tried with valgind and it looks to me like the memory we are using from numpy by:

    arr = rng.random(array_size_1d).astype(float)
    view = pk.from_numpy(arr)
    arr = view

gets freed while we are still doing our work e.g.:

==26850== Invalid read of size 8
==26850==    at 0xF0AF5F74: pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::operator()(pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double const&, int) const (functor.hpp:42)
==26850==    by 0xF0AF0CF2: std::enable_if<(!std::integral_constant<bool, false>::value)&&std::is_same<pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double, pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double>::value, void>::type Kokkos::Impl::ParallelFor<pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>, Kokkos::RangePolicy<Kokkos::OpenMP, pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double>, Kokkos::OpenMP>::exec_work<pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double>(pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP> const&, unsigned long) (Kokkos_OpenMP_Parallel.hpp:106)
==26850==    by 0xF0AFF5E3: std::enable_if<!std::is_same<Kokkos::RangePolicy<Kokkos::OpenMP, pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double>::schedule_type::type, Kokkos::Dynamic>::value, void>::type Kokkos::Impl::ParallelFor<pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>, Kokkos::RangePolicy<Kokkos::OpenMP, pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double>, Kokkos::OpenMP>::execute_parallel<Kokkos::RangePolicy<Kokkos::OpenMP, pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double> >() const [clone ._omp_fn.0] (Kokkos_OpenMP_Parallel.hpp:131)
==26850==    by 0x12CFFB9D: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==26850==    by 0x4A29B42: start_thread (pthread_create.c:442)
==26850==    by 0x4ABABB3: clone (clone.S:100)
==26850==  Address 0xf044e040 is 2,368 bytes inside a block of size 8,224 free'd
==26850==    at 0x484B27F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==26850==    by 0x2E9119: _PyArena_Free (in /usr/bin/python3.10)

If I use a view allocated by pykokkos as arr for the test it passes without segfaults

arr = pk.View([array_size_1d], pk.double)

Furthermore, if I get rid of the self assignment on arr (to prevent the gc from deleting the numpy storage) it also works without segfaulting:

    np_array = rng.random(array_size_1d) #.astype(float)
    view = pk.from_numpy(np_array)
    arr = view

I guess we have to look at from_numpy again ... and tell gc to increase the reference counter for the associated memory we take from numpy.

Mar 22 '23 17:03 JBludau

I guess the easy workaround until then is to not do self assignments on arrays that we grab from numpy

Mar 22 '23 17:03 JBludau

pykokkos pykokkos copied to clipboard

BUG: segfault on large host arrays with timeit

pykokkos
pykokkos copied to clipboard