[TD-94] Taskd takes up all RAM
Tom Sydney Kerckhove on 2015-03-24T23:49:13Z says:
I can sync on one system, but when I try to sync with the other system, taskd starts consuming all available memory. Both systems use task 2.4.2.
Migrated metadata:
Created: 2015-03-24T23:49:13Z
Modified: 2015-05-02T12:42:33Z
Paul Beckingham on 2015-03-25T00:42:09Z says:
Thanks for the report. Can you tell us how long the taskd instance was running before this happened? How often you are syncing? Are you using recurring tasks? Just looking for clues here.
Is there anything in the taskd.log file?
Tom Sydney Kerckhove on 2015-03-25T08:15:47Z says:
The taskd instance was running constantly for several days. I'm syncing every minute (crontab), but I guess it doesn't do anything when there's nothing to sync. I am using recurring tasks, though only two. Not sure what to look for in the (35MB !) log file.
I had to increase request size limit to sync my tasks once, I think the problem started then.
Is there anyway I can work around this for now? (like start a new user on the server? It's mine anyway.)
Paul Beckingham on 2015-03-25T11:28:22Z says:
Can you show your taskd installation details (XXXX out the hostname for privacy):
$ taskd diag
This will show me the various version numbers. I am looking for patterns, and very old GnuTLS libs.
Can you show me the size of the data:
$ wc ~/.task/*data
This will show me whether you have lots of tasks, or a few.
Then the number of transactions:
$ grep -c ' from ' taskd.log
Then the bounce count, which is how many times you restarted. Combining transactions and bounce count will give me a sense of how long your server stays up:
$ grep -c Daemonizing taskd.log
Then the size of the transactions:
$ grep Stored taskd.log | grep -v 'Stored 0 '
This will tell me if you have unusually large transactions (which I doubt, given the sync frequency). If you are not comfortable publishing this data here, please send it to support(at)taskwarrior.org.
We don't have a workaround for this yet (this is the second case of this problem), but here is something to try:
- Bounce the server (no need to create a new user)
- Delete the tx.data file on the server, and from one client run
task sync initto give the server a copy, then continue as before - Sync twice a minute? That's a lot. Reducing that will help, as this appears to be a cumulative memory leak, so any reduction in the syncs will help, while we find and fix this. So far we have not seen this behavior.
- Failing all that, bounce the server every night.
Meanwhile, we'll keep improving the server, and try to find this problem.
Tom Sydney Kerckhove on 2015-03-25T16:36:18Z says:
taskd diag
taskd 1.0.0
Platform: Linux
Hostname: XXXX
Compiler
Version: 4.8.2 20140120 (Red Hat 4.8.2-16)
Caps: +stdc +stdc_hosted +200809 +LP64 +c1 +i4 +l8 +vp8
Build Features
Built: Jan 28 2015 19:15:58
Commit: 2f40c1b
CMake: 2.8.12
Caps: +pthreads +tls
libuuid: libuuid + uuid_unparse_lower
libgnutls: 2.8.5
$ wc ~/.task/*data (on the system that makes taskd consume all RAM)
141 807 34350 .task/backlog.data
1472 34172 449841 .task/completed.data
398 5000 105789 .task/pending.data
24474 272705 3462316 .task/undo.data
26485 312684 4052296 total
$ wc ~/.task/*data (on the other system)
1 1 37 /home/syd/.task/backlog.data
1539 36627 484465 /home/syd/.task/completed.data
674 7970 160428 /home/syd/.task/pending.data
29043 307801 4043979 /home/syd/.task/undo.data
31257 352399 4688909 total
$ grep -c ' from ' taskd.log
59061
grep -c Daemonizing taskd.log
8
grep Stored taskd.log | grep -v 'Stored 0 ' (this is huge, only the last few lines)
2015-03-19 16:36:01 [57767] Stored 2 tasks, merged 0 tasks
2015-03-19 17:35:56 [57886] Stored 1 tasks, merged 0 tasks
2015-03-20 10:28:57 [59005] Stored 1143 tasks, merged 0 tasks
2015-03-20 11:02:53 [59041] Stored 651 tasks, merged 0 tasks
2015-03-20 15:30:03 [1] Stored 6127 tasks, merged 0 tasks
2015-03-23 16:28:57 [1] Stored 184 tasks, merged 0 tasks
2015-03-24 17:38:05 [1] Stored 1 tasks, merged 0 tasks
I deleted the tx.data file and resynced, but now one system gets the following answer:
Sync failed. The Taskserver returned error: 500 Client sync key not found.
Renato Alves on 2015-04-17T20:07:36Z says:
Given that request size limit had to be increased, I suspect it's the same problem described in TD-77.
How much is "All available RAM" on your system?
Tom Sydney Kerckhove on 2015-04-17T20:23:26Z says:
All RAM is about 1GIB
Paul Beckingham on 2015-04-25T22:13:33Z says:
Observations:
- Your taskserver handled 59041 transactions. If there is a memory leak, that would certainly be enough transactions to make it show itself in this way.
- libgnutls 2.8.5 is from 2009, and as a security product, this is horribly out of date. Is updating your gnutls, rebuilding, and rerunning an option?
Tom Sydney Kerckhove on 2015-04-28T21:53:37Z says:
a more recent version of libgnutils doesn't seemt to be available for amazon linux in the package repositories.
The problem hasn't presented itself recently anymore. It happened after I had modified a lot of tasks with a command that had a really open filter. I even had to change the server transation limit to have it start syncing.
I will compile gnutils manually some time soon. Is there anything I should keep you updated on?
Paul Beckingham on 2015-04-28T23:59:51Z says:
Thanks Tom, I appreciate the feedback. I added you as a contributor. I think the problem will go away with newer GnuTLS versions, and I'll just assume that is the case, which means I'll be very interested if it happens again, and if it does, I'm interested in the data we gathered above.
It makes sense that you had to increase the transaction size for that large update, it's really just a crude cutoff that prevents someone trying a 50Gb sync and taking out your server. Raising it (and keeping it) above the default 1Mb is probably a good idea. Perhaps I should raise the default.
We have gathered some data about whether we have seen bad leaks with various GnuTLS versions:
GOOD 3.3.14 BAD 2.12.20 BAD 2.8.5
Looking at the commits to the GnuTLS project, I see several leak fixes in the 3.2 branch, which corroborates this data. If you see data that extends the information in this list, I'd be interested.
I'll keep the issue open for now - perhaps someone will see it and add to the data, or learn from it.