chatnoir-resiliparse icon indicating copy to clipboard operation
chatnoir-resiliparse copied to clipboard

steady memory grouth while working on web pages

Open prnake opened this issue 3 months ago • 5 comments

I'm trying to use Resiliparse to handle CC, but I find that single-process memory usage grows slowly, and web pages get slower, and I can't find the source of the problem with the memory analysis tool, it feels like some sort of memory leak.

from fastwarc.warc import ArchiveIterator as FastWarcArchiveIterator
from fastwarc.warc import WarcRecordType, WarcRecord
from fastwarc.stream_io import FastWARCError
from fastwarc.warc import is_http

from resiliparse.parse.html import HTMLTree

import os
import psutil
import time


file = open("../CC-MAIN-20231129041834-20231129071834-00864.warc.gz", "rb")

archive_iterator = FastWarcArchiveIterator(
    file,
    record_types=WarcRecordType.response,
    parse_http=True,
    func_filter=is_http,
)
idx = 0
s = time.time()
process = psutil.Process(os.getpid())
for record in archive_iterator:
    raw = record.reader.read()
    try:
        str(HTMLTree.parse(raw.decode("utf-8")))
    except:
        pass
    

    idx += 1
    if idx % 1000 == 0:
        print(process.memory_info().rss / 1024**2, time.time() - s)
        s = time.time()

And here is the output:

91.0625 0.5237786769866943
108.921875 0.5350120067596436
132.015625 0.6991021633148193
132.828125 0.6604471206665039
144.25 0.6309971809387207
149.703125 0.9664597511291504
163.6875 1.0105879306793213
168.65625 0.9966611862182617
175.578125 1.1011531352996826
219.890625 0.9911088943481445
233.671875 1.1710970401763916
238.421875 1.0304219722747803
238.453125 0.9967031478881836
242.5625 1.016160011291504
246.40625 1.0688650608062744
246.40625 1.0952439308166504
246.40625 1.100167989730835
248.84375 1.077517032623291
248.84375 1.0870559215545654
248.84375 1.0127251148223877
248.84375 1.1437797546386719
253.125 1.0540661811828613
255.75 1.1424810886383057
255.75 1.0953960418701172
261.203125 1.1094980239868164
262.96875 1.293619155883789
265.6875 1.2223970890045166
265.6875 1.2639102935791016
265.6875 1.2537767887115479
265.6875 1.1750257015228271
266.203125 1.2749638557434082
266.203125 1.2159233093261719
272.484375 1.2846689224243164
275.515625 1.2536969184875488
275.515625 1.1752068996429443
275.515625 1.156343936920166
275.515625 1.249042272567749
276.765625 1.1668882369995117

prnake avatar Apr 06 '24 02:04 prnake