pybind11 icon indicating copy to clipboard operation
pybind11 copied to clipboard

[BUG]: free(): invalid pointer

Open whybeyoung opened this issue 1 year ago • 1 comments

Required prerequisites

  • [X] Make sure you've read the documentation. Your issue may be addressed there.
  • [X] Search the issue tracker and Discussions to verify that this hasn't already been reported. +1 or comment there if it has.
  • [ ] Consider asking first in the Gitter chat room or in a Discussion.

Problem description

Like issue https://github.com/pybind/pybind11/issues/1472, we still have problem in 2.10.0

free(): invalid pointer

c++ code:

#include "pybind11/embed.h"
#include <iostream>
#include <thread>
#include <chrono>
#include <sstream>

namespace py = pybind11;
using namespace std::chrono_literals;

class Wrapper
{
public:
  Wrapper()
  {
    py::gil_scoped_acquire acquire;
    _obj = py::module::import("wrapper").attr("Wrapper")();
    _wrapperInit = _obj.attr("wrapperInit");
    _wrapperFini = _obj.attr("wrapperFini");

  }

  ~Wrapper()
  {
    _wrapperInit.release();
    _wrapperFini.release();
  }

  int wrapperInit()
  {
    py::gil_scoped_acquire acquire;
    return _wrapperInit(nullptr).cast<int>();
  }

  void wrapperFini(int x)
  {
    py::gil_scoped_acquire acquire;
    _wrapperFini(x);
  }

  private:
  py::object _obj;
  py::object _wrapperInit;
  py::object _wrapperFini;
};
void thread_func(int iteration)
{
  Wrapper w;

  for (int i = 0; i < 1; i++)
  {
    w.wrapperInit();
    std::stringstream msg;
    msg << "iteration: " << iteration << " thread: " << std::this_thread::get_id() << std::endl;
    std::cout << msg.str();
        std::this_thread::sleep_for(100ms);
  }
}

int main() {
  py::scoped_interpreter guard{};
  py::gil_scoped_release release; // add this to release the GIL

  std::vector<std::thread> threads;

  for (int i = 0; i < 1; ++i)
    threads.push_back(std::thread(thread_func, 1));

  for (auto& t : threads)
    t.join();

  return 0;
}

wrapper.py code is


class Wrapper():
    serviceId = "mmocr"
    version = "backup.0"


    '''
    服务初始化
    @param config:
        插件初始化需要的一些配置,字典类型
        key: 配置名
        value: 配置的值
    @return
        ret: 错误码。无错误时返回0
    '''

    def wrapperInit(cls, config: {}) -> int:
        import torch
        print(config)

        print("Initializing ..")
        return 0

    def wrapperFini(cls) -> int:
        return 0

we run this code in ubuntu18.04 docker container. and the repo is public.ecr.aws/iflytek-open/opensource/demo/mmocr:v3.1

Reproducible example code

No response

whybeyoung avatar Aug 11 '22 11:08 whybeyoung

I'm guessing this is https://github.com/pybind/pybind11/issues/4105.

henryiii avatar Aug 24 '22 17:08 henryiii

I verified this is not #4105, this code was broken in 2.9 as well.

henryiii avatar Oct 23 '22 03:10 henryiii

I couldn't reproduce the free(): invalid pointer crash using the code here, but there is certainly a GIL issue that you can confirm by using PR #4146. The problem in the reproducer code is that the GIL is not being held when the destructor for Wrapper::_obj is running. You can "fix" it by adding _obj.release(); in the Wrapper destructor. "fix" is in quotation marks because it is simply leaking the Python reference, "masking" would be a more fitting word. To not leak:

--- main_using_embed_h.cpp.orig 2022-10-23 21:29:46.559375849 -0700
+++ main_using_embed_h.cpp      2022-10-23 21:56:25.089334464 -0700
@@ -21,7 +21,12 @@

   ~Wrapper()
   {
+    py::gil_scoped_acquire hold_gil;
+    _obj.dec_ref();
+    _obj.release();
+    _wrapperInit.dec_ref();
     _wrapperInit.release();
+    _wrapperFini.dec_ref();
     _wrapperFini.release();
   }

I'm closing this bug because it's pretty likely that the free(): invalid pointer has nothing to do with a bug in pybind11.

Until we merge PR #4146, I recommend you patch it locally and run all your tests.

rwgk avatar Oct 24 '22 05:10 rwgk

I am encountering this with the same conditon this is my set-up that can be replicated

# dummy_python_script.py
import torch

def simple_return():
    
    return 1

the simple.cpp

#include <iostream>
#include <future>
#include <pybind11/embed.h>

namespace py = pybind11;

std::future<int> callPythonFunctionAsync(py::object &pyFunction) {
    return std::async(std::launch::async, [&](){
        py::gil_scoped_acquire acquire;
        int result = pyFunction().cast<int>();
        return result;
    });
}

int main() {
    py::scoped_interpreter guard{}; // Start the interpreter and keep it alive

    // Import the Python module
    py::module pyModule = py::module::import("dummy");
    py::object pyFunction = pyModule.attr("simple_return");

    // Call the function asynchronously
    std::cout << "Calling Python function asynchronously..." << std::endl;
    py::gil_scoped_release release;
    auto futureResult = callPythonFunctionAsync(pyFunction);

    // Wait for the result and print it
    try {
        int result = futureResult.get();
        std::cout << "Result from Python: " << result << std::endl;
    } catch (const std::exception& e) {
        std::cerr << "Exception caught: " << e.what() << std::endl;
    }

    return 0;
}

with the following cmake

cmake_minimum_required(VERSION 3.10)  # Updated minimum required version
project(py_cpp_func)

set(CMAKE_CXX_STANDARD 11)  # Setting C++ standard to C++11
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pthread")

# Manually set Python include directories and libraries
set(PYTHON_INCLUDE_DIR /usr/local/include/python3.10)
set(PYTHON_LIBRARY /usr/local/lib/libpython3.10.so)
include_directories(${PYTHON_INCLUDE_DIR})

# Include pybind11
# Include pybind11 from the external directory
add_subdirectory(external/pybind11)
add_executable(py_dummy simple.cpp)
target_link_libraries(py_dummy PRIVATE ${PYTHON_LIBRARIES} pybind11::embed)
configure_file(dummy.py ${CMAKE_BINARY_DIR}/dummy.py COPYONLY)

with the following dockerfile:

FROM ubuntu:18.04

RUN apt-get update && \
    apt-get install -y software-properties-common && \
    add-apt-repository ppa:ubuntu-toolchain-r/test && \
    apt-get update && \
    apt-get install -y \
    gcc \
    g++ \
    cmake \
    libboost-all-dev \
    wget

RUN apt-get remove -y cmake && \
    wget https://cmake.org/files/v3.10/cmake-3.10.0-Linux-x86_64.sh && \
    chmod +x cmake-3.10.0-Linux-x86_64.sh && \
    ./cmake-3.10.0-Linux-x86_64.sh --skip-license --prefix=/usr/local

RUN apt-get install -y git

# Clone pybind11 into the external directory
RUN mkdir -p /external && \
    git clone --branch v2.11.1 https://github.com/pybind/pybind11.git /external/pybind11

# Install Python 3.10.13
ENV PYTHON_VERSION 3.10.13

# Install necessary packages
RUN apt-get update && \
    apt-get install -y software-properties-common wget git \
    build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev \
    libssl-dev libsqlite3-dev libreadline-dev libffi-dev curl libbz2-dev liblzma-dev
RUN apt-get install -y libgomp1 libgl1-mesa-glx

# Download Python 3.10 source
RUN cd /tmp && \
    wget https://www.python.org/ftp/python/$PYTHON_VERSION/Python-$PYTHON_VERSION.tar.xz && \
    tar -xf Python-$PYTHON_VERSION.tar.xz

# Compile Python 3.10
RUN cd /tmp/Python-$PYTHON_VERSION && \
    ./configure --enable-optimizations --enable-shared && \
    make -j 8 && \
    make altinstall && \
    ldconfig

# Install pip for Python 3.10
RUN cd /tmp && \
    wget https://bootstrap.pypa.io/get-pip.py && \
    python3.10 get-pip.py && \
    rm get-pip.py

# Install OpenCV for C++
RUN DEBIAN_FRONTEND="noninteractive" apt-get install -y libopencv-dev

WORKDIR /usr/src/three-stage-object-detection
# Install Triton Inference Server
COPY three-stage-object-detection /usr/src/three-stage-object-detection/
RUN python3.10 -m pip install -e .

WORKDIR /usr/src/app
COPY CMakeLists.txt /usr/src/app/
COPY dummy.py /usr/src/app/
COPY simple.cpp /usr/src/app/
RUN mkdir external && \
    ln -s /external/pybind11 external/pybind11
RUN mkdir build && \
    cd build && \
    cmake -DCMAKE_BUILD_TYPE=Debug .. && \
    make

WORKDIR /usr/src/app/build

# Clean up
RUN apt-get clean && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

Davidnet avatar Jan 31 '24 19:01 Davidnet

I am encountering this with the same conditon this is my set-up that can be replicated

  • Does this run successfully if you remove import torch?
  • Do you have a stack trace from the crash?
  • I don't think that's it, but I'd make this change:
-std::future<int> callPythonFunctionAsync(py::object &pyFunction)
+std::future<int> callPythonFunctionAsync(py::handle pyFunction)
  • I don't think any of the maintainers will have the time to reproduce the crash. If this is important to you, I recommend you send a PR that adds a .github/workflows/reproducer.yml job to run in GitHub Actions.

  • I really really doubt the root cause is in pybind11.

rwgk avatar Jan 31 '24 20:01 rwgk

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff6c7f7f1 in __GI_abort () at abort.c:79
#2  0x00007ffff6cc8837 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff6df5a7b "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff6ccf8ba in malloc_printerr (str=str@entry=0x7ffff6df3c76 "free(): invalid pointer") at malloc.c:5342
#4  0x00007ffff6cd6dec in _int_free (have_lock=0, p=0x7fff280e49a8, av=0x7ffff702ac40 <main_arena>) at malloc.c:4167
#5  __GI___libc_free (mem=0x7fff280e49b8) at malloc.c:3134
#6  0x000055555542c508 in __gnu_cxx::new_allocator<std::_Fwd_list_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::destroy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (this=0x5555556dae78, __p=0x7fff280e6158) at /usr/include/c++/7/ext/new_allocator.h:140
#7  0x000055555542876b in std::allocator_traits<std::allocator<std::_Fwd_list_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::destroy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__a=..., __p=0x7fff280e6158) at /usr/include/c++/7/bits/alloc_traits.h:487
#8  0x000055555542319d in std::_Fwd_list_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_erase_after (this=0x5555556dae78, __pos=0x5555556dae78, __last=0x0) at /usr/include/c++/7/bits/forward_list.tcc:90
#9  0x000055555541e84a in std::_Fwd_list_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::~_Fwd_list_base (this=0x5555556dae78, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/forward_list.h:329
#10 0x000055555541a82c in std::forward_list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::~forward_list (this=0x5555556dae78, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/forward_list.h:559
#11 0x000055555540fb3b in pybind11::detail::internals::~internals (this=0x5555556dacd0, __in_chrg=<optimized out>) at /external/pybind11/include/pybind11/detail/internals.h:207
#12 0x0000555555419629 in pybind11::finalize_interpreter () at /external/pybind11/include/pybind11/embed.h:263
#13 0x00005555554196ea in pybind11::scoped_interpreter::~scoped_interpreter (this=0x7fffffffe533, __in_chrg=<optimized out>) at /external/pybind11/include/pybind11/embed.h:308
#14 0x0000555555407d2d in main () at /usr/src/app/simple.cpp:16

I got this backtrace also I was able to run if I update to 20.04 on the docker base image.

someone on gitter helped me to get the trace

Davidnet avatar Jan 31 '24 20:01 Davidnet

I am encountering this with the same conditon this is my set-up that can be replicated

  • Does this run successfully if you remove import torch?
  • Do you have a stack trace from the crash?
  • I don't think that's it, but I'd make this change:
-std::future<int> callPythonFunctionAsync(py::object &pyFunction)
+std::future<int> callPythonFunctionAsync(py::handle pyFunction)
  • I don't think any of the maintainers will have the time to reproduce the crash. If this is important to you, I recommend you send a PR that adds a .github/workflows/reproducer.yml job to run in GitHub Actions.
  • I really really doubt the root cause is in pybind11.

if I do not put torch, the code works, so definitly something with torch

Davidnet avatar Jan 31 '24 20:01 Davidnet

if I do not put torch, the code works, so definitly something with torch

I'd work on sending them a PR that reproduces the crash.

rwgk avatar Jan 31 '24 20:01 rwgk