reddit-html-archiver
reddit-html-archiver copied to clipboard
Very high memory usage with write_html.py
On my system running write_html.py without arguments requires too much memory and takes too long. After more than 30 minutes I had to manually stop it because my system became unresponsive. Memory usage increased slowly but relentlessly, until write_html.py used all 8 gigabytes of RAM plus 5 gigabytes of swap.
My data directory is currently 3.1 gigabytes. It will continue to expand in the future because I'm always fetching new subreddits.
How can I help debug this?
P.S. My knowledge of Python is still very small...
I think you could add something like this to the start of every function in write_html.py, and try to determine which function grows memory the most.
import os
import psutil
# in functions:
process = psutil.Process(os.getpid())
print('function X runs, memory used: %s' % process.memory_info().rss / 1024) # in kilobytes
I launched the following command for one minute only:
timeout 60 ./write_html.py > write_html.log 2>&1
@fturco Awesome man, thanks for putting in the effort. I'll try to fix this this weekend.
Can you try commenting out these two lines and seeing how the memory goes?
sub_links.append(l)
user_index[l['author']].append(l)
The script is intentionally loading all content for a sub into memory, so it's kind of a big logical failure. I've gotta rewrite a bit of it and maybe a lot of it. Not having all of the comments in memory at once might be enough to get by.
After commenting out those lines, write_html.py no longer uses too much memory.
But I noticed index.html files for each subreddit are now missing, so I can't display the archives with a browser.
Okay I pushed an update. Not everything was optimized, but I think it should be a lot better.
If it's still bad, can you try commenting out this line:
user_index[l['author']].append(l)
I tried running write_html.py again after updating it, and it seems it successfully generated all HTML pages. While processing posts, my system reached a peak of 2.9 GiB of used RAM and then it went back to 1.1 GiB after write_html.py finished. So that's a lot better.
I haven't yet tried commenting out the line you specified.
To give you a better idea, I have already archived 16 subreddits:
$ ./write_html.py
...
all done. 581830 links filtered to 581830
$ du -sm data
4715 data
Thanks for the stats. Well how it stands now, I'm basically loading all /data/*/links.csv data into memory. So whenever you get 13GB of link data (only, not comments) in your archive, you won't be able to generate the html.
So uhh I donno. Maybe we'll leave this open and I'll do more optimization in the future. Bug me when you get to 8GB of data?
OK, sure.