reddit-html-archiver icon indicating copy to clipboard operation
reddit-html-archiver copied to clipboard

Very high memory usage with write_html.py

Open ghost opened this issue 6 years ago • 10 comments

On my system running write_html.py without arguments requires too much memory and takes too long. After more than 30 minutes I had to manually stop it because my system became unresponsive. Memory usage increased slowly but relentlessly, until write_html.py used all 8 gigabytes of RAM plus 5 gigabytes of swap.

My data directory is currently 3.1 gigabytes. It will continue to expand in the future because I'm always fetching new subreddits.

How can I help debug this?

P.S. My knowledge of Python is still very small...

ghost avatar Oct 11 '19 10:10 ghost

I think you could add something like this to the start of every function in write_html.py, and try to determine which function grows memory the most.

import os
import psutil
# in functions:
process = psutil.Process(os.getpid())
print('function X runs, memory used: %s' % process.memory_info().rss / 1024)  # in kilobytes 

libertysoft3 avatar Oct 12 '19 05:10 libertysoft3

I launched the following command for one minute only:

timeout 60 ./write_html.py > write_html.log 2>&1

write_html.zip

ghost avatar Oct 12 '19 15:10 ghost

@fturco Awesome man, thanks for putting in the effort. I'll try to fix this this weekend.

libertysoft3 avatar Oct 26 '19 02:10 libertysoft3

Can you try commenting out these two lines and seeing how the memory goes?

sub_links.append(l)
user_index[l['author']].append(l)

The script is intentionally loading all content for a sub into memory, so it's kind of a big logical failure. I've gotta rewrite a bit of it and maybe a lot of it. Not having all of the comments in memory at once might be enough to get by.

libertysoft3 avatar Nov 05 '19 10:11 libertysoft3

After commenting out those lines, write_html.py no longer uses too much memory. But I noticed index.html files for each subreddit are now missing, so I can't display the archives with a browser.

ghost avatar Nov 05 '19 21:11 ghost

Okay I pushed an update. Not everything was optimized, but I think it should be a lot better.

If it's still bad, can you try commenting out this line:

user_index[l['author']].append(l)

libertysoft3 avatar Nov 12 '19 10:11 libertysoft3

I tried running write_html.py again after updating it, and it seems it successfully generated all HTML pages. While processing posts, my system reached a peak of 2.9 GiB of used RAM and then it went back to 1.1 GiB after write_html.py finished. So that's a lot better.

I haven't yet tried commenting out the line you specified.

ghost avatar Nov 12 '19 11:11 ghost

To give you a better idea, I have already archived 16 subreddits:

$ ./write_html.py
...
all done. 581830 links filtered to 581830

$ du -sm data
4715    data

ghost avatar Nov 12 '19 15:11 ghost

Thanks for the stats. Well how it stands now, I'm basically loading all /data/*/links.csv data into memory. So whenever you get 13GB of link data (only, not comments) in your archive, you won't be able to generate the html.

So uhh I donno. Maybe we'll leave this open and I'll do more optimization in the future. Bug me when you get to 8GB of data?

libertysoft3 avatar Nov 13 '19 07:11 libertysoft3

OK, sure.

ghost avatar Nov 13 '19 12:11 ghost