sotoki Memory Leak (XML parsing, Templates, libzim ?)

Bounty Hunters: See https://github.com/openzim/sotoki/issues/243#issuecomment-1170237942 for additional details

When running sotoki on very large domain such as StackOverflow, memory becomes an issue. A full run (without images) on athena worker (which has slow CPU and disk) consumed about everything of its 96GiB configured RAM.

Note: because it took so long to complete, we've lost most of the netdata graph… Another run on a faster worker lasted half the time (but we have no RAM usage graph).

This is obviously unexpected and unsustainable. The expected RAM usage for stackOverflow is 36-38GiB (of which 15-18GiB) is used by redis.

While it allowed us to discover some memory issues in libzim and scraperlib, it is clear that sotoki is leaking most of its generated content.

Bellow is a graph of memory consumption over a full run when using a fake zim creator (so no data written to ZIM nor disk).

full-sotoki-nozim

Don't look at the initial spike which is debug artefact. While the absolute values are small (max of 215MiB), the curve should be flat. Here, we're seeing a 70% RAM increase.

Sep 10 '21 08:09 rgaudin

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Mar 02 '22 10:03 stale[bot]

@rgaudin @mgautierfr Maybe this new opensource tool would help to understand what is going on? https://github.com/bloomberg/memray

Apr 21 '22 05:04 kelson42

Note: a bounty can now be claimed for solving this issue. https://app.bountysource.com/issues/109534260-leaking-memory

Jun 23 '22 12:06 Popolechien

I guess you already made shure that no process is hanging, too? (I guess a graph with active running processes might also help) I just took a quick look through the code and saw some subprocess.run, but sadly you log no info about those (when they are started and finished). Maybe there exist some methods with timeout that you could use to make shure they finish in a defined time ...

PS: It could also be that "zimscraperlib" is leaking, but I didn't take a look at it for now ...

Jun 28 '22 17:06 TheCrazyT

@TheCrazyT those processes happen early on in the whole scraping process and complete before we start the intensive work. Now that we're expecting outsiders to take a look at this, I'll rewrite the issue in the coming day(s) with as much details as possible. Yes https://github.com/openzim/python-scraperlib is a possible culprit.

Jun 28 '22 17:06 rgaudin

I'm currently looking at: https://githubmemory.com/repo/openzim/python-libzim/issues/117?page=2 Does that mean that "4. Observe that memory consumption is stable" is wrong because you still have problems? If you let the Proof of concept-script on that site run long enough(with a bigger dataset), would you get the same curve?

If your answer is yes, then I'm wondering if it could be a python internal problem. Somewhere inside xml.sax.handler. Atleast it would also explain why most of the process is leaking. But the impact must be really really small, because you won't notice it on short runs.

Jun 28 '22 19:06 TheCrazyT

Here's a more complete description for those not familiar with sotoki and this issue.

sotoki creates a ZIM file given a stack-exchange domain (there are 356 sotoki --list-all)
a ZIM is a binary archive of a website: every resource is stored at a path and non-binary resources are compressed
It takes a ZIM reader to use/browse/read a ZIM. Most readers rely on libzim. End-user friendly ones are made by Kiwix. kiwix-serve is recommended for tests.
sotoki works off (7z-compressed) XML dumps provided by stack-exchange.
the leak is certainly affecting all stack-exchange domains but it's only a concern for StackOverflow because of its size. Dump is ~26GB that expands to somewhere near 120GB.
sotoki works in stages (see sotoki.scraper:start(). At the beginning, it downloads, extracts and then transforms (in sub processes) XML dumps into easier-to-work-with ones. This is sotoki.archives and sotoki.utils.preparation. Don't get distracted by how much memory those uses as those are tailored for speed and completes before the actual work begins.
then sotoki will parse (xml.sax) those prepared XML dumps and for each record either (or both) store some information or generate one-or-several HTML pages that are directly added to the ZIM
sotoki stores those to-be-used later information in a Redis DB.
redis DB is in-RAM and is quite large. For SO I believe it was around 18GB which is fine: that's actual content that we need to store.
In parallel, images found in entries are fetched online and added to the ZIM.
all our stackoverflow runs are done in docker.

Here's a graph that show how it looks like overtime (was an OOM'd run with ~60GB or RAM)

Huge spike at the begining is the preparation step. Then you can see sotoki processing the Users (storing info in redis and creating users pages in ZIM) followed by a drop in usage: this is when we release some of the info that we won't need anymore (see :cleanup_users(), :purge() and :defrag_external() in sotoki.utils.database). Then it starts processing the questions/answers and the RAM increases linearly until either it OOMs or completes.

Latest stackoverflow run using 2022-03 dumps used approx. 125GB of RAM (run wasn't monitored)

What the RAM consumption shows is that the entirety of the content that we read/write (that is unclear because we both transform and compress while making the ZIM) goes and stays into RAM.

Although Python doesn't releases memory back to the system, this memory is not available anymore as we hit OOM on systems without enough memory.

It seems (no evidence but match of XML size vs RAM usage) that the size of images (fetched online and added to the ZIM) is not impacting RAM consumption which may indicate that the leak doesn't affect binary item addition in scraperlib/libzim.

Python 3.8
XML parser (xml.sax)
sotoki's shortcuts (sotoki.shared.Global)
Templating (jinja2 3.0.1)
scraperlib (zimscraperlib 1.6.0)
libzim (libzim) which is a Cython wrapper around C++ libzim for which we have secured and verified that there is no such memory leak.

Jun 29 '22 16:06 rgaudin

Talking about the sax parser ... It is probably better to execute close after you are done:

import xml.sax
parser = xml.sax.make_parser()

#...

try:
  parser.close()
except xml.sax.SAXException:
  #not shure if anything can be done here
  #just ignoring this error for now ;-)
  pass

I tested it in google colab and the info about the close-function says the following:

This method is called when the entire XML document has been
passed to the parser through the feed method, to notify the
parser that there are no more data. This allows the parser to
do the final checks on the document and empty the internal
data buffer.

The parser will not be ready to parse another document until
the reset method has been called.

close may raise SAXException.

(same info as here : https://github.com/python/cpython/blob/main/Lib/xml/sax/xmlreader.py#L144 )

The default parser ExpatParser implements IncrementalParser and seem to use a "pyexpat C module". Atleast according to that comment: https://github.com/python/cpython/blob/f4c03484da59049eb62a9bf7777b963e2267d187/Lib/xml/sax/expatreader.py#L88

I also didn't find anything that executes the close automatically. So my guess is that this "pyexpat C module" still holds some resources (probably not the whole document) and memray can't figure it out because it is no pure python.

Jun 29 '22 18:06 TheCrazyT

It is still possible that the xml library is leaking. Atleast i created a proof of concept with a jupyter notebook:

https://gist.github.com/TheCrazyT/376bb0aac2d283f9ba505b889ef5da51

Bad thing is ... I currently can't really figure out what the problem is, but it seems to be caused by the html-entities in combination with long text inside the body-attribute of the row-tag. I was hoping that it is just a special entity that can cause problems.

Probably will try to create a better proof of concept with a memory-profiler ... just to be shure.

Edit: Oh well profiling didn't show the result that I was expecting. Somehow I still evaluate this wrong, there seems to be another condition i can not figure out. For example I just duplicated the row and numbered it, because i was expecting that the other rows should also stay in memory. But for some reason only content of the first row stayed in memory, although the content of the other rows were nearly the same. I'm a bit clueless at the moment ...

Edit again: Guess that was just a red hering ... the text stays in memory for some reason, but you can't really increase the memory usage by it. Atleast i was not able to produce a case where the following rows would also stay in memory. So you can only increase memory usage if your first row is very very very large, and that won't match the described behaviour ...

Edit 04.07.2022: Now analysing a core-dump of a full runthrough with a smaller dataset ("astronomy"). There seem to be many "NavigableString" left from beautifulsoup, but not shure if that means anything. Also some urls seem to be left in memory when comparing multiple runs. Although I already checked by memory profiling beautifulsoup-commands themself, there might be something special when it could leak memory. (According to https://bugs.launchpad.net/beautifulsoup ... there seem also to be problems about "recursion depth" , not shure if such errors could also lead to unclean memory)

Edit 06.07.2022: Just another jupyter notebook to show that memory is leaking: https://gist.github.com/TheCrazyT/117dd8371f63708e6f655e444f4bfaa4 This time by executing sotoki itself and with garbage-collection near the sys.exit . I guess it is kinda strange that text of row number 12409 of 12504 is still within the dump. The last row would be understandable ... but that entry is kinda old. Also had another test-notebook that showed that there are still about 24097 dict-objects, 13144 list-objects and 11848 tuple-objects referenced within the core-dump. Not so shure if I could figure out how/where those a still referenced. My guess is that maybe some are still indirectly referenced by the Global-object.

Jul 02 '22 07:07 TheCrazyT

Guess I should run a compared mem-profile with python 3.10 ,because they seem to have changed something about garbage collection within this commit:

https://github.com/python/cpython/commit/59af59c2dfa52dcd5605185263f266a49ced934c

Hm weird, just noticed that the tag on the commit is v3.11.0b3. (although somewhere else it said 3.10) So maybe it is not released yet ...

Edit: Well alot of packes just won't work with python 3.11.0b3. Only option left is get the code of python3.8, cherry pick that commit and cross fingers that it is compatible 😞.

Update 09.07.2022: Was now running a patched python version 3.8 with the following patch: https://gist.github.com/TheCrazyT/db4323d9cb3d3cc71b2a69020b1fa535 With the "blender"-dataset the peak memory dropped by about 1.2 mb , but I guess more would be possible if there would be some more modifications that were done in python version 3.11 . (Atleast I still see improvement, since the curve by mprof still seems to increase). It is not that much, but compared to the dataset it might be good enough for the "stackoverflow" dataset. But because I do not have the resources I'm not able to let it run against that.

Guess the biggest problem will be to add the patch to the Dockerfile. The only versions i can think about will increase the size and build-time of the docker image dramatically, because you would need to compile 2 python extensions.(so the image would need compile tools and compile time) Adding the binaries themself is probably also no good solution.

Update 11.07.2022: I'm still missing something, the current changes(commits below) do not have much impact. Not really satisfying tbh.

Jul 06 '22 20:07 TheCrazyT

Sorry if this is too much to ask, but, I would like to take a look, though Jupiter Notebooks aren't quite helpful for me.

Can you please write a small python script, with a provided xml file (I recommend wikipedia's XML dumps, they're beef enough to illustrate trends like this)

Thank you!

Jul 11 '22 15:07 albassort

@Retkid sorry, beside fishing strings in the core-dump and profiling the whole sotoki I have no simple script that profiles the memory.

I'm currently looking at the output of "log-malloc2" to figure out if there is something weird.(something allocated, but not freed)

Currently leads me to "lxml" inside "xmlBufGrow" , but that could also be a red hering.

Jul 14 '22 08:07 TheCrazyT

Thanks for the updates. Missed those edits above as it doesn't trigger a notification.

FYI, a nopic (without any picture) run of stackoverflow is about to end. This code includes the parser.close() patch but as expected, there is no visible improvement.

It's important to note that the RAM usage is equivalent to the version with pictures, confirming that images are not involved, which can incidentally exclude libzim itself (for binary entries).

Jul 14 '22 09:07 rgaudin

@rgaudin

Ok ... here is another thing I noticed. The commit of the redis-package, has actually a "command stack"( https://github.com/redis/redis-py/blob/c54dfa49dda6a7b3389dc230726293af3ffc68a3/redis/client.py#L2049 ) that gets emptied after a commit.

But here is the catch, the commit_maybe won't commit as othen as you would think it would: https://github.com/openzim/sotoki/blob/692f59b601f2681a76c37351c41f799bb6259c08/src/sotoki/utils/database.py#L496

A commit will happen if nb_seen is dividable throuhg 1000: https://github.com/openzim/sotoki/blob/692f59b601f2681a76c37351c41f799bb6259c08/src/sotoki/utils/database.py#L82

But since you increase it by: (4 + len(post.get("Tags", [])) it requires a bit of luck to get exactly to something that is divideable with 1000. What if you get 1001,1002,1003, 1004, 1005... after incrementing nb_seen with (4 + len(post.get("Tags", [])) ? It will need another loop till it could get lucky to hit 2000, etc.

So my guess is that the command_stack just increases and increases, because a commit does barely happen. (currenlty this theory has the problem that you do not see a drop in the graph at all)

To be honest, I would simply remove the + len(post.get("Tags", [])), but I'm not shure if it has any special meaning. Another solution would be to reset nb_seen to 1 if it is bigger than 1000 inside the should_commit and return self.nb_seen > self.commit_every .

Jul 17 '22 08:07 TheCrazyT

Good point ; I'd go with self.nb_seen > self.commit_every as well as it seems safer.

Jul 18 '22 11:07 rgaudin

Please submit a PR once you have a working version ; thanks 🙏

Jul 18 '22 11:07 rgaudin

I would go with a new last_commit and have something like :

if self.nb_seen > self.last_commit + self.commit_every:
    do_commit(...)
    self.last_commit += self.commit_every

This way we would not reset nb_seen and be able to use it for other statistics. (My two cents)

(currenlty this theory has the problem that you do not see a drop in the graph at all)

This is indeed a limitation of your theory (even if this is a good catch). On such a long run, we should see few drops times to times. We will see how memory usage behaves with this fix.

Jul 18 '22 13:07 mgautierfr

Well, If I saw it correctly, then every thread has its own nb_seen, starting from 0.

Maybe its just not noticeable if one thread is lucky and hits commit.

Update:

Oh btw. the reason I did not create a pull request yet is because I was just logging the "commit".

For some reason I still only get the output at the end of the "Questions"-Progress and that is driving me slightly insane atm.

Update again:

Just now noticed that the actual recording-phase is within "Questions_Meta". (within PostFirstPasser)

But that is near the beginning.

And this means: ... I probably still didn't find the main memory-leak problem I guess.

Jul 18 '22 14:07 TheCrazyT

Still believe that the Rewriter that gets executed during the html-generation somehow causes a leak.

Sadly I'm not able to reproduce the problem if I just test the Rewriter on its own. But if you disable the rewrite method you that the memory-profile-curve seems to be more flat.

Already thought that BeatifulSoup may cause the leak because they do not seem to close the xml-parser in case of an error during the encoding detection: https://github.com/wention/BeautifulSoup4/blob/03a2b3a9d1fc5877212d9d382a512663f24c887d/bs4/builder/_lxml.py#L227

Although thats true ... it is not causing the leak. (Atleast I was not able to create an example that would trigger an UnicodeDecodeError error and I somehow doubt that it would cause a steady increase of memory on a normal dataset)

Jul 19 '22 11:07 TheCrazyT

Hi all, seen that there did not seem to be progress on this i started looking at it, using the memray tool and running on a dump of the gaming stackexchange site (can't run anything more massive).

I've found that there does not seem to be a leak per se but instead a diverging memory usage behavior caused mainly by two culprits (both native memory so not easily found):

Xapian search engine
Zim file compression

The Xapian behavior is especially bad as it indexes the entire text of the all the comments / questions / etc so it grows very fast by the number of words instead of the number of questions. The file compression disabled also helps with keeping down the memory usage as it stores the compression stream during the entire run and grows with the size of the document.

Both were enabled by default so with no way of disabling so i've proposed two PRs on sotoki and python-scraperlib to make these two functionalities optional and not enabled by default (enabling them specifically from the commandline).

The data about my run can be found at link and i'll keep them available until the ticket is open.

If you can please execute a run of this change on the main stackoverflow site so we can ascertain if my analysis is actually correct. thank you.

Jul 25 '22 16:07 parvit

For reference, this is the usage with the current code in both projects: original

This is with my change and keeping both functionalities disabled: complete_fix

To my understanding and after viewing the memray reports it seems that the second graph corresponds to a normal behavior of the program.

Jul 25 '22 16:07 parvit

@parvit ; thank you for your contribution but compression and full-text index are enabled by default on purpose, because that's what's wanted 99% of the time. It is still possible to disable it should it not be wanted.

Global.creator = Creator(
...
).config_verbose(True).config_indexing(False, Global.conf.iso_lang_3).config_compression("none")

I am skeptical about compression or indexing having a huge impact on memory as I believe those would work at most at ZIM-cluster level which is fixed size. So the size of the content should not matter much… unless there's a bug in there.

Finally, those features are essential to sotoki so we don't want to disable them.

I will try to run it with both disabled on a large SE site to confirm/infirm your hypothesis.

Jul 25 '22 16:07 rgaudin

@rgaudin Sure you can check for yourself by looking at the analysis in the drive folder i've provided. In the original run 2/3 of the total memory were allocated by the compression + indexer.

I can make the change so as to keep it enabled by default and give the possibility of disable it if you feel it's better this way for the project.

Jul 25 '22 16:07 parvit

@rgaudin I hope you do find at some point what you think is the bug however i always let myself be guided by data when doing these performance analysis so i don't think i'll contribute anything else to this because to me the data was very clear and the fix also.

Good luck.

Jul 26 '22 18:07 parvit

@parvit IMO your are mixing two different things:

The Sotoki options and default benaviour: we don't want to change it and we don't need to change it. This has nothing to do with the memory leak.
The memory leak itself, which is the topic. You have emitted a theory and provided evidences. We still have to redo the expermiment to see if we come to the same conclusion, but we don't need your PRs for that.

Jul 26 '22 19:07 kelson42

well i'll admit i have jumped to conclusions a bit too fast.

I'll be awaiting the results of your tests.

Il Mar 26 Lug 2022, 21:16 Kelson @.***> ha scritto:

@parvit https://github.com/parvit IMO your are mixing two different things:

The Sotoki options and default benaviour: we don't want to change it and we don't need to change it. This has nothing to do with the memory leak.

The memory leak itself, which is the topic. You have emitted a theory and provided evidences. We still have to redo the expermiment to see if we come to the same conclusion, but we don't need your PRs for that.

— Reply to this email directly, view it on GitHub https://github.com/openzim/sotoki/issues/243#issuecomment-1195882173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWGJD4EWVPHF2DU567C7L63VWA2SNANCNFSM5DY2ZIOA . You are receiving this because you were mentioned.Message ID: @.***>

Jul 26 '22 19:07 parvit

@TheCrazyT PR #268 did not improve memory usage. baseline

Look at the durations as the second one's image is not complete (had to restart the backend)

Will try with @parvit suggestion tomorrow.

Jul 28 '22 18:07 rgaudin

Thanks for testing.

That is a really disappointing result. I must admit that I always tested with the "FakeCreator" that you mentioned in your first comment and with a much smaller dataset.

I wish you luck in finding the leak.

Probably does not have something to do with python code itself but I'm currently out of ideas. It is also really difficult to create a simple and small proof of concept of the leak.

Update: ~~Could also be some content related thing.~~ ~~Since lxml is indirectly used on the posts, some stuff like the "billion laughs expansion bug" might be a thing, but I was not able to test that, yet.~~ Nope ... thats not a thing. Would probably cause way too much memory before a crash. Process won't run through with that (and if it would ... you will see a big drop after the memory allocation, so this is clearly not the situation here).

Jul 28 '22 20:07 TheCrazyT

OK, here's the graph of the run without neither indexing nor compression.

baseline

nocomp-noindex

This is very interesting. After the initial hill, the curve is descending only. Also, on this 5GB-capped run, regular code used up to 4.4GB (at the end) while this one peaked at 2.44GB, roughly 2GB less (45% of the total).

Thank you @parvit for bringing this to our attention. I need to process this and understand what is going on exactly now before jumping into conclusions.

full netdata report

Jul 29 '22 16:07 rgaudin

Happy that my analysis was pointing in a useful direction.

Jul 29 '22 16:07 parvit

sotoki sotoki copied to clipboard

Memory Leak (XML parsing, Templates, libzim ?)

sotoki
sotoki copied to clipboard