oneflow icon indicating copy to clipboard operation
oneflow copied to clipboard

[bug][graph] Segfault when reading a registered buffer via module._buffers["0"] inside forward (eager OK)

Open tinywisdom opened this issue 2 months ago • 0 comments

Summary

Accessing a buffer that was registered with the name "0" (a numeric string) through module._buffers["0"] inside forward() works in eager mode, but segfaults in Graph mode during execution of functional::Add/AddN. The crash happens even in an otherwise trivial model (single Linear + add buffer).

Code to reproduce bug

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3"  # or any single GPU

import oneflow as flow
import oneflow.nn as nn

# ---- Minimal BufferList: registers a single buffer under key "0" ----
class BufferListTemplate(nn.Module):
    def __init__(self, *buffers):
        super().__init__()
        assert len(buffers) == 1, "MRE only needs one buffer"
        # Critical: register under the name "0" (numeric string)
        self.register_buffer("0", buffers[0])

    def __getitem__(self, idx: int):
        # Not used in this MRE; present to show a 'normal' accessor
        return getattr(self, str(idx))

# ---- Model: Linear + fetch buffer from _buffers["0"] and add ----
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 10)
        self.buflist = BufferListTemplate(flow.randn(10))  # float32

    def forward(self, x):
        y = self.fc(x)
        # Trigger: directly read from the internal _buffers dict
        buf = self.buflist._buffers["0"]
        return y + buf  # hits functional::Add/AddN in Graph

# ---- Graph wrapper ----
class G(nn.Graph):
    def __init__(self, m):
        super().__init__()
        self.m = m
    def build(self, x):
        return self.m(x)

def main():
    flow.manual_seed(0)
    model = MyModel()
    x = flow.randn(2, 10)

    print("------------------------")
    # Eager: usually OK
    out_eager = model(x)

    print("------------------------")
    # Graph: typically segfaults here
    g = G(model)
    out_graph = g(x)  # <-- crash expected

    print("------------------------")
    # If it doesn't crash, compare numerics (rarely reached)
    import numpy as np
    np.testing.assert_allclose(out_eager.numpy(), out_graph.numpy(), rtol=1e-3, atol=1e-2)

if __name__ == "__main__":
    main()

Observed Output (abridged)

------------------------
------------------------
Stack trace (most recent call last):
  ... oneflow/_oneflow_internal.cpython-310-...so
  ... functional::add(...)
  ... functional::Add(TensorTuple const&, bool)
  ... functional::impl::AddNFunctor::operator()
  ... OpInterpUtil::Dispatch(...)
Segmentation fault (Address not mapped to object [0x61])
Segmentation fault (core dumped)

System Information

  • OS: Ubuntu 22.04.4 LTS (x86_64)
  • OneFlow version : 1.0.0.dev20250921+cpu
  • Python version: 3.10.16

tinywisdom avatar Oct 16 '25 06:10 tinywisdom