mars icon indicating copy to clipboard operation
mars copied to clipboard

[BUG] supervisor memory leak

Open fyrestone opened this issue 3 years ago • 0 comments

Describe the bug A clear and concise description of what the bug is.

image

The Mars job executed hundreds of tasks, some task has a large graph:

image

Set environ var PYTHONTRACEMALLOC=1, then run the following code:

import tracemalloc
import gc
import mars
import mars.tensor as mt
import mars.dataframe as md


def run_round(extra_config=None):
    df = md.DataFrame(
        mt.random.rand(1200, 70, chunk_size=70),
        columns=[f"col{i}" for i in range(70)])

    for i in range(70):
        df[f"col{i + 70}"] = df[f"col{i}"].fillna(0)
        df[f"col{i + 140}"] = df[f"col{i}"].fillna(0)
    for i in range(70):
        df[f"col{i}"] = df[f"col{i}"] / 100
    df = df.fillna(0)
    df.map_chunk(lambda x: x).execute(extra_config=extra_config)


def main():
    mars.new_session()
    count = 0
    run_round(extra_config={"enable_profiling": True})
    gc.collect()
    s1 = tracemalloc.take_snapshot()

    while True:
        print(f"==============>run {count}")
        count += 1
        run_round()
        if count == 3:
            break

    gc.collect()
    s2 = tracemalloc.take_snapshot()
    diff = s2.compare_to(s1, "lineno")
    print("[ Top 10 differences ]")
    for stat in diff[:10]:
        print(stat)


if __name__ == "__main__":
    main()

output

[ Top 10 differences ]
/home/admin/mars/mars/core/base.py:96: size=10.5 MiB (+8098 KiB), count=59034 (+44276), average=187 B
/home/admin/mars/mars/core/operand/core.py:100: size=10200 KiB (+7647 KiB), count=61221 (+45900), average=171 B
/home/admin/mars/mars/core/graph/builder/base.py:72: size=9861 KiB (+7400 KiB), count=85245 (+63956), average=118 B
/home/admin/mars/mars/services/task/analyzer/analyzer.py:250: size=9629 KiB (+7222 KiB), count=91908 (+68931), average=107 B
/home/admin/mars/mars/core/graph/builder/base.py:62: size=9194 KiB (+6895 KiB), count=91812 (+68859), average=103 B
/home/admin/mars/mars/core/base.py:37: size=7590 KiB (+5691 KiB), count=131726 (+98781), average=59 B
/home/admin/mars/mars/core/operand/base.py:266: size=6733 KiB (+5048 KiB), count=132583 (+99408), average=52 B
/home/admin/mars/mars/core/operand/base.py:225: size=5691 KiB (+4268 KiB), count=132393 (+99294), average=44 B
/home/admin/mars/mars/core/entity/core.py:35: size=5180 KiB (+3883 KiB), count=66294 (+49706), average=80 B
/home/admin/mars/mars/core/operand/core.py:121: size=5025 KiB (+3769 KiB), count=30629 (+22971), average=168 B

After digging into the code, the memory leak is because the TaskProcessor is not removed. The tileable graph, chunk graph and subtask graph are not GC.

To Reproduce To help us reproducing this bug, please provide information below:

  1. Your Python version
  2. The version of Mars you use Latest master
  3. Versions of crucial packages, such as numpy, scipy and pandas
  4. Full stack of the error.
  5. Minimized code to reproduce the error.

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.

fyrestone avatar Jul 15 '22 04:07 fyrestone