ktistec icon indicating copy to clipboard operation
ktistec copied to clipboard

Possible memory leak

Open vrthra opened this issue 2 years ago • 52 comments

There is a possible memory leak; Here is my memory utilization;

memory

At this point, the machine becomes unresponsive, and has to be rebooted. It is an oracle cloud instance, and Ktistec is the only application running on it.

$ uname -a
Linux instance-web 5.15.0-1022-oracle #28-Ubuntu SMP Wed Oct 26 09:38:05 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.1 LTS
Release:	22.04
Codename:	jammy
$ free -m
               total        used        free      shared  buff/cache   available
Mem:             964         217         107           1         638         595
Swap:              0           0           0

vrthra avatar Nov 17 '22 20:11 vrthra

how many inbound messages are you typically receiving in a given day? i'm assuming you're operating at about my scale, but i want to check. my vps has 1gig of memory, and my instance fits comfortably in that. memory usage does increase, but it doesn't grow until a crash. at least for me.

toddsundsted avatar Nov 17 '22 21:11 toddsundsted

Counting by the posts I see in the front page, I just got 9 posts yesterday before the server became unresponsive. I am not sure how many messages (pings etc.) I received. I have started collecting my logs; I will count the number of lines before this happens again and update.

vrthra avatar Nov 17 '22 22:11 vrthra

ok, i get 30-40 posts a minute on average. there's usually a pretty steep memory ramp when i first start the server, but then it generally plateaus, and it will run for days or weeks like that. i've had one or two crashes historically, but those might have been the server being killed off for memory issues unrelated to the server.

that's not to say there's not a memory leak. i'm keeping an eye on my instance in case i introduced a problem recently.

toddsundsted avatar Nov 17 '22 22:11 toddsundsted

For me, there was a small steep climb in the first hour, followed by a dip, and then constant climb in 12 hours. See this image.

memleak

I will update what happens today; So far, it has plateaued.

memleak

vrthra avatar Nov 17 '22 22:11 vrthra

Here is what I have currently; There is a small but noticeable climb in two days.

memleak

vrthra avatar Nov 19 '22 21:11 vrthra

i'm trying to find an easy way to chart the memory usage/growth on my server. i don't believe i've done anything that would prevent instances from being garbage collected, but i could be wrong, or there could be bugs in the core libraries.

toddsundsted avatar Nov 20 '22 04:11 toddsundsted

I have not monitored my server in detail, yet. But ktistec (in docker) crawled from 600MB to 1600MB in the past 5 days. I have been a bit more active over the weekend, so I am not sure if that's a problem.

Edit: that's on a VPS with 4GB RAM

JayVii avatar Nov 21 '22 07:11 JayVii

memory usage should generally be flat. it doesn't, for example, intentionally try to keep an increasing subset of data in memory. so if you're seeing growth like that day after day, then i'm confident you're seeing a memory leak. i'm going to add additional logging to dump gc and memory stats periodically to see if that help shed light on what's going on.

toddsundsted avatar Nov 21 '22 13:11 toddsundsted

commit https://github.com/toddsundsted/ktistec/commit/4ce1d3e1ddf5a0ad835b870455265cf86466fb9b introduces some code to capture memory information.

toddsundsted avatar Nov 23 '22 03:11 toddsundsted

Thanks; I have deployed this version. Let me know what to look for and report.

vrthra avatar Nov 23 '22 20:11 vrthra

Is there any way to see the current commit/version from a deployed instance?

vrthra avatar Nov 23 '22 20:11 vrthra

there is not. i thought version would be more useful—i didn't count on all the early adopters building from main

toddsundsted avatar Nov 24 '22 12:11 toddsundsted

@vrthra i've found a petty good candidate bug for the memory leak. i believe sqlite is caching prepared statements for all queries issued. this wouldn't be a huge problem—queries should be parameterized—except in one query i embedded an id directly in the string instead of passing the value in as a parameter. this is in code that handles inbound activities, so it's going to result in a lot of caching of non-reusable statements. i don't know if this is the whole problem, but in local tests for me it's made a big difference in footprint over time.

toddsundsted avatar Nov 24 '22 12:11 toddsundsted

@vrthra i created a branch with some changes that seem to make thing significantly better for me:

https://github.com/toddsundsted/ktistec/tree/memory-problems

i need to make some improvements to the code before i put it in main, but it should improve things for you, as well.

toddsundsted avatar Nov 24 '22 17:11 toddsundsted

After deploying this, I get

docker run -p 80:3000 -v `pwd`/db:/db -v `pwd`/uploads:/uploads  -it social:latest
1
2
GC::Stats(@heap_size=786432, @free_bytes=24576, @unmapped_bytes=0, @bytes_since_gc=749696, @total_bytes=749696)
Unhandled exception: more than one row (DB::Error)
  from /build/lib/db/src/db/query_methods.cr:81:9 in 'query_one:args'
  from /build/src/framework/model.cr:246:9 in 'find:iri:include_deleted:include_undone'
  from /build/src/framework/model.cr:256:9 in 'find?:iri:include_deleted:include_undone'
  from /build/src/models/activity_pub/activity/delete.cr:9:5 in 'object?'
  from /build/src/rules/content_rules.cr:43:3 in 'each_match'
  from /build/lib/school/src/school/domain/domain.cr:95:15 in 'each_match:trace'
  from /build/lib/school/src/school/domain/domain.cr:123:11 in 'run'
  from /build/src/rules/content_rules.cr:87:5 in 'run'
  from /build/src/utils/database.cr:41:15 in 'recreate_timeline_and_notifications'
  from /build/src/utils/database.cr:16:5 in 'recreate_timeline_and_notifications'
  from /build/src/database/migrations/000019-update-timeline-and-notifications.cr:5:1 in '->'
  from /build/src/framework/database.cr:60:41 in 'do_operation'
  from /build/src/framework/framework.cr:136:14 in '__crystal_main'
  from /usr/share/crystal/src/crystal/main.cr:115:5 in 'main_user_code'
  from /usr/share/crystal/src/crystal/main.cr:101:7 in 'main'
  from /usr/share/crystal/src/crystal/main.cr:127:3 in 'main'
  from src/env/__libc_start_main.c:95:2 in 'libc_start_main_stage2'
make: *** [Makefile:4: ktistec] Error 1

Any idea?

vrthra avatar Nov 25 '22 00:11 vrthra

it looks like your database has two activitypub object records with the same id/iri. that migration rebuilds your notification and timeline. the server is processing a delete activity but it's finding more than one object. strictly speaking you probably don't need to run that migration, but the duplicate object is disturbing. i'm at a loss at the moment on how it could have been created. something like the following query should return the duplicates. i'd be interested to know how many you have: select iri from objects group by iri having count(id) > 1;

toddsundsted avatar Nov 25 '22 02:11 toddsundsted

It doesn't seem link that is causing it.

~/ktistec$ sqlite3 db/ktistec.db 'select iri from objects'  | wc -l
5287
~/ktistec$ sqlite3 db/ktistec.db 'select iri from objects group by iri having count(id) > 1'
~/ktistec$ 

vrthra avatar Nov 25 '22 05:11 vrthra

ok. i don't have a better theory about the trace at the moment. thanks for testing!

toddsundsted avatar Nov 25 '22 09:11 toddsundsted

How do I turn off the migration?

vrthra avatar Nov 25 '22 10:11 vrthra

you can either delete src/database/migrations/000019-update-timeline-and-notifications.cr before you build, or give me ~1 more day to push out the changes in this branch. those changes will include a try/catch in that migration to log the stack traces, but continue running.

toddsundsted avatar Nov 25 '22 12:11 toddsundsted

@vrthra all changes have been merged into main, including some protection against the migration failing—you'll still see the error messages but the server should continue to start.

i'd be interested in how many error messages you get, along with any other logging output. i don't get any errors running that migration, so i want to get a better understanding of what you're seeing.

toddsundsted avatar Nov 25 '22 23:11 toddsundsted

These are the logs when starting. I can also supply my ktistec.db or give you access to my oracle machine if it helps you. I have no private information in that machine, and happy to help if you want to debug what is happening.

ktistec.log

vrthra avatar Nov 27 '22 00:11 vrthra

weird weird weird. i'm guessing you're following some server i'm not and it's sending something i don't expect. maybe it's another aspect of the guppe problems. the one common thing is that it's only happening when processing deletes, so it's probably not noticeable in your timeline or notifications, but if you see anything that looks amiss there, let me know. i've signed up to guppe now, so if it shows up in any way related to that i should see if myself...

toddsundsted avatar Nov 27 '22 00:11 toddsundsted

@vrthra have you seen any improvement in memory usage on your system?

toddsundsted avatar Nov 28 '22 14:11 toddsundsted

Yes, it is much better now! Thank you for the updated version. This was deployed on 27th November; See the memory graph after that. memory I will keep an eye out on this for a few more days and I will let you know.

vrthra avatar Nov 28 '22 19:11 vrthra

The original huge growth in memory is gone, but there is still a consistent small increase in memory -- 62% to 66% in the first day and 66% to 69% in the last one day.

memory

Screen Shot 2022-11-30 at 6 03 23 am Screen Shot 2022-11-30 at 6 04 27 am

(three days after, notice the climb). Screen Shot 2022-11-30 at 6 04 27 am

vrthra avatar Nov 29 '22 19:11 vrthra

Running the build including the latest fixes only since about 24h in total, I can confirm the big leak is gone: image I can't yet confirm an ever so slightly increase, will have to wait some more time to discern a pattern over the GC-related ups-and-downs: image

felixkrohn avatar Nov 29 '22 20:11 felixkrohn

thanks! there are still a few more possibilities, one is that there is another instance somewhere of a prepared statement with an interpolated value. sqlite also allocates pages for its own work, and i haven't looked to see how they are managed.

generally what i want to understand now is does this plateau and/or does this eventually crash. i'll set some vm limits next time i start my instance and see what happens when it bumps into them.

toddsundsted avatar Nov 30 '22 02:11 toddsundsted

My memory growth is fairly consistent. Any logs I can get you?

mem

vrthra avatar Dec 05 '22 19:12 vrthra

@vrthra the output from select * from points where chart = "heap-size"; would be useful. that captures the heap size as it appears to the crystal language garbage collector. if that's constant, then the only other source of allocations should be sqlite. this will help me diagnose that case. thanks!

toddsundsted avatar Dec 07 '22 12:12 toddsundsted