pykokkos
pykokkos copied to clipboard
BUG: segfault on large host arrays with timeit
This segfaults for me locally with PyKokkos at 1e6
1D (host) array_size_1d
array size, but not with CuPy or NumPy, and not with 1e4
elements with PyKokkos. Clearly, PyKokkos should be able to handle this--probably makes sense to cut the reproducer/regression test down and investigate (latest develop
branch):
"""
Record sqrt() ufunc performance.
"""
import os
import pykokkos as pk
import numpy as np
import cupy as cp
import pandas as pd
from tqdm import tqdm
if __name__ == "__main__":
scenario_name = "pk_gp160_sqrt"
current_sqrt_fn = "pk.sqrt"
space = pk.ExecutionSpace.OpenMP
pk.set_default_space(space)
num_global_repeats = 50
num_repeats = 5000
array_size_1d = int(1e6)
import timeit
rng = np.random.default_rng(18898787)
arr = rng.random(array_size_1d).astype(float)
view = pk.from_numpy(arr)
arr = view
#arr = cp.array(arr)
num_threads = os.environ.get("OMP_NUM_THREADS")
if num_threads is None:
raise ValueError("must set OMP_NUM_THREADS for benchmarks!")
df = pd.DataFrame(np.full(shape=(num_global_repeats, 2), fill_value=np.nan),
columns=["scenario", "time (s)"])
df["scenario"] = df["scenario"].astype(str)
print("df before trials:\n", df)
counter = 0
for global_repeat in tqdm(range(1, num_global_repeats + 1)):
sqrt_time_sec = timeit.timeit(f"{current_sqrt_fn}(arr)",
globals=globals(),
number=num_repeats)
df.iloc[counter, 0] = f"{scenario_name}"
df.iloc[counter, 1] = sqrt_time_sec
counter += 1
print("df after trials:\n", df)
filename = f"{scenario_name}_array_size_1d_{array_size_1d}_{num_global_repeats}_tials_{num_repeats}_calls_per_trial.parquet.gzip"
df.to_parquet(filename,
engine="pyarrow",
compression="gzip")
Not specific to usage in timeit
, and the segfault happens during the execution of the compiled workunit/kernel.
Ah yeah, at 1e5
1D array size (of doubles) PyKokkos is using 96 % of 128 GB of RAM... something is wrong here.
PyKokkos with memory_profiler
with timeit
per above with 1e4
float64 elements:
Identical workflow with NumPy doesn't show the leak:
Providing a smaller reproducer:
import pykokkos as pk
import numpy as np
@profile
def test_gh_184():
arr = np.zeros(10000)
view = pk.from_numpy(arr)
for i in range(50000):
pk.sqrt(view)
if __name__ == "__main__":
test_gh_184()
And profiling the memory usage with memory_profiler
:
Filename: test.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
5 191.637 MiB 191.637 MiB 1 @profile
6 def test_gh_184():
7 191.637 MiB 0.000 MiB 1 arr = np.zeros(10000)
8 191.637 MiB 0.000 MiB 1 view = pk.from_numpy(arr)
9 4062.195 MiB 0.516 MiB 50001 for i in range(50000):
10 4062.195 MiB 3870.043 MiB 50000 pk.sqrt(view)
I hypothesized that PyKokkos is not properly freeing the new out
views/arrays used in ufuncs, which at first glance the profiler seems to agree with:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
220 4061.914 MiB 191.746 MiB 50000 @profile
221 def sqrt(view):
222 """
223 Return the non-negative square root of the argument, element-wise.
224
225 Parameters
226 ----------
227 view : pykokkos view
228 Input view.
229
230 Returns
231 -------
232 y : pykokkos view
233 Output view.
234
235 Notes
236 -----
237 .. note::
238 This function should exhibit the same branch cut behavior
239 as the equivalent NumPy ufunc.
240 """
241 # TODO: support complex types when they
242 # are available in pykokkos?
243 4061.914 MiB 0.000 MiB 50000 if len(view.shape) > 2:
244 raise NotImplementedError("only up to 2D views currently supported for sqrt() ufunc.")
245 4062.172 MiB 3830.570 MiB 50000 out = pk.View(view.shape, view.dtype)
246 4062.172 MiB 0.516 MiB 50000 if "double" in str(view.dtype) or "float64" in str(view.dtype):
247 4062.172 MiB 0.000 MiB 50000 if view.shape == ():
248 pk.parallel_for(1, sqrt_impl_1d_double, view=view, out=out)
249 4062.172 MiB 0.000 MiB 50000 elif len(view.shape) == 1:
250 4062.172 MiB 38.824 MiB 50000 pk.parallel_for(view.shape[0], sqrt_impl_1d_double, view=view, out=out)
251 elif len(view.shape) == 2:
252 pk.parallel_for(view.shape[0], sqrt_impl_2d_double, view=view, out=out)
253 elif "float" in str(view.dtype):
254 if view.shape == ():
255 pk.parallel_for(1, sqrt_impl_1d_float, view=view, out=out)
256 elif len(view.shape) == 1:
257 pk.parallel_for(view.shape[0], sqrt_impl_1d_float, view=view, out=out)
258 elif len(view.shape) == 2:
259 pk.parallel_for(view.shape[0], sqrt_impl_2d_float, view=view, out=out)
260 4062.172 MiB 0.516 MiB 50000 return out
Simplifying even more, I think this can be summarized as:
PyKokkos is not freeing the memory associated with a view that is passed into a workunit/kernel (at least, no fast enough), and there appears to be sensitivity of memory copy behavior to scope/destruction/replacement of the view that is fed in.
For example:
import pykokkos as pk
import numpy as np
from tqdm import tqdm
@pk.workunit
def nothing(tid: int, view: pk.View1D[pk.double]):
view[tid] = view[tid]
def test_gh_184():
for i in tqdm(range(500000)):
v = pk.View((10_000,), pk.float64)
pk.parallel_for(1, nothing, view=v)
if __name__ == "__main__":
test_gh_184()
vs. hoisting the view to reduce the memory punishment (this also runs orders of magnitude slower...):
import pykokkos as pk
import numpy as np
from tqdm import tqdm
@pk.workunit
def nothing(tid: int, view: pk.View1D[pk.double]):
view[tid] = view[tid]
def test_gh_184():
v = pk.View((10_000,), pk.float64)
for i in tqdm(range(500000)):
pk.parallel_for(1, nothing, view=v)
if __name__ == "__main__":
test_gh_184()
did some testing ... to me this looks like a classic python garbage collector problem.
the gc is not obliged to instantly delete an object if it goes out of scope (this deletion would be costly, so why do it now). But it also is not allowed to reuse the allocated memory of an object on the death list as long as the object was not deleted yet. Furthermore, afaik it is not even obliged to reuse memory already allocated at all (for externally defined types this might even be dangerous or lead to memory corruption).
Thus my assessment is: the gc sees the view v
that got created in every iteration but does not delete it ... bc why should it even do this now. This results in every iteration allocating new space for a view as this guarantees no memory corruption. Thus you see the time necessary for allocation of a new v
in every iteration in the triangular plot.
If the view is created before the loop, the gc does even less ... it will only delete the one existing v
at exit of the function that does the iteration.
I’m confused, do you see the same leak with NumPy; I think the gc is pretty fast usually so I still think PyKokkos is a problem here.
On Thu, Mar 16, 2023 at 07:36 JBludau @.***> wrote:
did some testing ... to me this looks like a classic python garbage collector problem.
the gc is not obliged to instantly delete an object if it goes out of scope (this deletion would be costly, so why do it now). But it also is not allowed to reuse the allocated memory of an object on the death list as long as the object was not deleted yet. Furthermore, afaik it is not even obliged to reuse memory already allocated at all (for externally defined types this might even be dangerous or lead to memory corruption).
Thus my assessment is: the gc sees the view v that got created in every iteration but does not delete it ... bc why should it even do this now. This results in every iteration allocating new space for a view as this guarantees no memory corruption. Thus you see the time necessary for allocation of a new v in every iteration in the triangular plot. If the view is created before the loop, the gc does even less ... it will only delete the one existing v at exit of the function that does the iteration.
— Reply to this email directly, view it on GitHub https://github.com/kokkos/pykokkos/issues/184#issuecomment-1471967808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4JOZSGJ46EXJAL4GZOZTLW4MJPDANCNFSM6AAAAAAV4N72QE . You are receiving this because you authored the thread.Message ID: @.***>
I still maintain that something is wrong here, and regardless of the details, we should be competitive with NumPy and other array API providers for usage of tools like timeit
without saturating memory usage.
Consider, for example, that if we don't send the view to the workunit, the gc
does exactly the right thing:
import pykokkos as pk
import numpy as np
from tqdm import tqdm
@pk.workunit
def nothing(tid: int, view: pk.View1D[pk.double]):
view[tid] = view[tid]
def test_gh_184():
for i in tqdm(range(500000)):
v = pk.View((10_000,), pk.float64)
#pk.parallel_for(1, nothing, view=v)
if __name__ == "__main__":
test_gh_184()
And if you add the workunit back, suddently the garbage collector is lazy?
import pykokkos as pk
import numpy as np
from tqdm import tqdm
@pk.workunit
def nothing(tid: int, view: pk.View1D[pk.double]):
view[tid] = view[tid]
def test_gh_184():
for i in tqdm(range(500000)):
v = pk.View((10_000,), pk.float64)
pk.parallel_for(1, nothing, view=v)
if __name__ == "__main__":
test_gh_184()
And if you switch the loop to use a NumPy ufunc, suddently the gc
is working again?
import pykokkos as pk
import numpy as np
from tqdm import tqdm
def test_gh_184():
for i in tqdm(range(500000)):
v = np.ones(10_0000, dtype=np.float64)
np.sqrt(v)
if __name__ == "__main__":
test_gh_184()
I've been looking into it, and I think the cause is the kernel argument caching in parallel_dispatch.py
here https://github.com/kokkos/pykokkos/blob/a4108286fcc0fac28300670dbd35a494b00c4097/pykokkos/interface/parallel_dispatch.py#L220
I'm still figuring out the details, but @tylerjereddy could you try commenting out that line and see if it fixes your problem?
@NaderAlAwar unfortunately, I'm seeing identical memory saturation with that change own its own locally.
Ah wait, that's for reduction instead of parallel_for as @JBludau points out, let me try
@NaderAlAwar ok, if I apply a similar patch for parallel_for
then the memory saturation does go away as below. I'll check if the PyKokkos test suite can still pass with this--if it can I could benchmark like that for now...
@NaderAlAwar Well, the original timeit
reproducer at the top still segfaults for me with large arrays (1e6)
. That doesn't appear to be memory saturation because it is just so fast, but maybe an illegal/unreasonable request or something, dunno... maybe I"ll just use older benchmarks for now.
um … It also segfaults on my machine for sizes above 1e8. Running a debug version I get mixed results in the debugger: for different numbers of threads it segfaults at different indices in the view. All of them are valid indices though…
ok ... so I tried with valgind and it looks to me like the memory we are using from numpy by:
arr = rng.random(array_size_1d).astype(float)
view = pk.from_numpy(arr)
arr = view
gets freed while we are still doing our work e.g.:
==26850== Invalid read of size 8
==26850== at 0xF0AF5F74: pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::operator()(pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double const&, int) const (functor.hpp:42)
==26850== by 0xF0AF0CF2: std::enable_if<(!std::integral_constant<bool, false>::value)&&std::is_same<pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double, pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double>::value, void>::type Kokkos::Impl::ParallelFor<pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>, Kokkos::RangePolicy<Kokkos::OpenMP, pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double>, Kokkos::OpenMP>::exec_work<pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double>(pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP> const&, unsigned long) (Kokkos_OpenMP_Parallel.hpp:106)
==26850== by 0xF0AFF5E3: std::enable_if<!std::is_same<Kokkos::RangePolicy<Kokkos::OpenMP, pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double>::schedule_type::type, Kokkos::Dynamic>::value, void>::type Kokkos::Impl::ParallelFor<pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>, Kokkos::RangePolicy<Kokkos::OpenMP, pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double>, Kokkos::OpenMP>::execute_parallel<Kokkos::RangePolicy<Kokkos::OpenMP, pk_functor_sqrt_impl_1d_double<Kokkos::OpenMP>::sqrt_impl_1d_double> >() const [clone ._omp_fn.0] (Kokkos_OpenMP_Parallel.hpp:131)
==26850== by 0x12CFFB9D: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==26850== by 0x4A29B42: start_thread (pthread_create.c:442)
==26850== by 0x4ABABB3: clone (clone.S:100)
==26850== Address 0xf044e040 is 2,368 bytes inside a block of size 8,224 free'd
==26850== at 0x484B27F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==26850== by 0x2E9119: _PyArena_Free (in /usr/bin/python3.10)
If I use a view allocated by pykokkos as arr
for the test it passes without segfaults
arr = pk.View([array_size_1d], pk.double)
Furthermore, if I get rid of the self assignment on arr
(to prevent the gc from deleting the numpy storage) it also works without segfaulting:
np_array = rng.random(array_size_1d) #.astype(float)
view = pk.from_numpy(np_array)
arr = view
I guess we have to look at from_numpy
again ... and tell gc
to increase the reference counter for the associated memory we take from numpy.
I guess the easy workaround until then is to not do self assignments on arrays that we grab from numpy