learninglocker icon indicating copy to clipboard operation
learninglocker copied to clipboard

Large Redis store

Open doender opened this issue 6 years ago • 18 comments

As far as I understand Redis is used to cache statements that are not stored yet, so its size should remain relatively constant over time.

However, on our instance, the Redis store has grown to be around 11GB. This in turn makes redis-rdb-bgsave take up a long time.

INFO keyspace is db0:keys=16788661,expires=11,avg_ttl=16841

Is this a normal size or is there some way to decrease it?

doender avatar Aug 12 '19 14:08 doender

@doender I've copied our messages from Gitter into here so that we keep a track of this thread a little more easily. I suggest we move the conversation here from now on if that's okay with you.

@ryansmith94 August 13, 2019 10:56 AM

Hi @doender, the part I've quoted below from your comment is not true. We use Redis as a pub/sub so that the app can subscribe to new statements for worker jobs (personas, query builder cache, etc).

As far as I understand Learning Locker uses Redis to cache statements that are not stored yet, so its size should remain relatively constant over time (?).

We also use Redis to cache the results of aggregation queries. If you're making heavy use of the dashboards and visualisations in user interface it's possible it could be growing because of that, I think we put TTLs on the caches so they should expire. If you're Redis instance is growing just with statement IDs then it's possible you don't have workers running or configured correctly.

@doender August 13, 2019 2:46 PM

Thanks @ryansmith94 !

I've ran redis-cli --bigkeys to gain more insight into the content of the store:

Sampled 16788658 keys in the keyspace!
Total key length in bytes is 742012940 (avg len 44.20)

Biggest string found 'bull:DEV_STATEMENT_QUERYBUILDERCACHE_QUEUE:3337928:lock' has 36 bytes
Biggest   list found 'bull:STATEMENT_QUERYBUILDERCACHE_QUEUE:wait' has 4904921 items
Biggest    set found 'bull:STATEMENT_FORWARDING_QUEUE:completed' has 3 members
Biggest   hash found 'bull:STATEMENT_FORWARDING_QUEUE:784130' has 11 fields

16 strings with 207 bytes (00.00% of keys, avg size 12.94)
8 lists with 16788623 items (00.00% of keys, avg size 2098577.88)
3 sets with 8 members (00.00% of keys, avg size 2.67)
16788631 hashs with 139532114 fields (100.00% of keys, avg size 8.31)
0 zsets with 0 members (00.00% of keys, avg size 0.00)

I've checked a number of values for the keys formed as bull:STATEMENT_FORWARDING_QUEUE:* as hashes seem to take up all of the space. They all contain information on statements that are present in the LRS. For example:

redis-cli HGETALL bull:STATEMENT_FORWARDING_QUEUE:814131
 1) "data"
 2) "{\"statementId\":\"5bbdd45b5c9b410fc6d22e8c\"}"
 3) "opts"
 4) "{\"removeOnFail\":false,\"removeOnComplete\":false}"
 5) "progress"
 6) "0"
 7) "delay"
 8) "0"
 9) "timestamp"
10) "1539167323294"
11) "attempts"
12) "1"
13) "attemptsMade"
14) "0"
15) "failedReason"
16) ""
17) "stacktrace"
18) "[]"
19) "returnvalue"
20) "null"

and statement 5bbdd45b5c9b410fc6d22e8c is stored in Mongo.

What's the reason these keys are in Redis? And could they all be removed to shrink the size?

@ryansmith94 August 16, 2019 9:37 AM

Hi @doender, thanks for the extra info. Seems odd that one of your queues has that many items. I think it's possible that your workers aren't running. When you run pm2 status on the server what do you get back?

@doender September 3, 2019 12:47 PM

@ryansmith94 According to PM2, all workers are running. We get new records every day, so I think that's not the problem.

Could it have to do something with our specific setup?

We have 2 APIs, one in amsterdam (AMS) and one in singapore (SGP). Both APIs connect to a replicated database. The primary is in AMS, secondary in SGP. All xAPI statements are pushed to the AMS API, which are then written to the SGP DB (as it is primary), so no statements are pushed to the SGP API.

Both APIs get aggregation queries, but SGP receives way more because our customers are mainly in that region.

The redis store in SGP is the large one, the one in AMS is smallish.

@ryansmith94 I checked several STATEMENT_FORWARDING_QUEUE and DEV_STATEMENT_QUERYBUILDERCACHE_QUEUEkeys, but there don't seem to be any TTLs set

@ryansmith94 September 4, 2019 9:51 AM

Hey @doender, possibly, but I think there must be some worker issue looking at the size of bull:STATEMENT_QUERYBUILDERCACHE_QUEUE:wait

Perhaps try pm2 logs workerto see if there are any errors

ryasmi avatar Sep 06 '19 11:09 ryasmi

@ryansmith94 I don't see any errors in pm2 logs. Anything I can check to see whether data can be removed? And if so; how can I safely remove the data?

doender avatar Sep 20 '19 11:09 doender

You can safely remove query builder cache and any aggregation caches from Redis. However, the query builder cache is quite important for keeping the query builder UI up to date with the latest values.

ryasmi avatar Sep 23 '19 08:09 ryasmi

Thanks, now deleting those. What about DEV_STATEMENT_FORWARDING_QUEUE (I've got a lot of those as well)?

doender avatar Sep 23 '19 13:09 doender

Interesting, I wouldn't have expected to see anything in DEV_STATEMENT_FORWARDING_QUEUE that's usually what we use for development, I think that's only used when the NODE_ENV is set to DEVELOPMENT in the .env file

ryasmi avatar Sep 23 '19 13:09 ryasmi

Mmm strange... it's set to production in the .env file. So all keys prefixed with DEV_ can be removed as well?

doender avatar Sep 23 '19 13:09 doender

I'm pretty sure yeah. Can you print the sizes of the keys?

ryasmi avatar Sep 23 '19 13:09 ryasmi

Here's the content of one of the statements:

127.0.0.1:6379> HGETALL bull:DEV_STATEMENT_FORWARDING_QUEUE:4213125
 1) "opts"
 2) "{\"removeOnFail\":false,\"removeOnComplete\":false,\"attempts\":1,\"delay\":0,\"timestamp\":1546567921384}"
 3) "priority"
 4) "0"
 5) "data"
 6) "{\"statementId\":\"5c2ec0eddcbf3d6969956ab6\"}"
 7) "name"
 8) "__default__"
 9) "delay"
10) "0"
11) "timestamp"
12) "1546567921384"

Is this what you mean by printing the size?

I'm currently also running a redis-cli --bigkeys to see what we have after removing the QUERYBUILDER_CACHE keys. We now have 8M keys left; where it used to be almost 17M. Interestingly, the number of keys didn't change much since I first posted this, so I have the feeling most are stale.

doender avatar Sep 23 '19 13:09 doender

Ok good to know most keys are stale, I feel like there was probably an issue at some point in the past which caused things to back up. Whatever the problem was must be resolved if the keys are increasing again.

When I mentioned printing the size of the keys I just meant running redis-cli --bigkeys to see which DEV_ keys were largest and how large they'd got.

ryasmi avatar Sep 23 '19 13:09 ryasmi

Here's the output:

Sampled 8297439 keys in the keyspace!
Total key length in bytes is 337395319 (avg len 40.66)

Biggest string found 'bull:DEV_STATEMENT_QUEUE:stalled-check' has 13 bytes
Biggest   list found 'bull:STATEMENT_FORWARDING_QUEUE:wait' has 4795148 items
Biggest    set found 'bull:STATEMENT_FORWARDING_QUEUE:completed' has 3 members
Biggest   hash found 'bull:STATEMENT_PERSON_QUEUE:4417956' has 11 fields

13 strings with 134 bytes (00.00% of keys, avg size 10.31)
3 lists with 8297412 items (00.00% of keys, avg size 2765804.00)
3 sets with 8 members (00.00% of keys, avg size 2.67)
8297420 hashs with 68965159 fields (100.00% of keys, avg size 8.31)
0 zsets with 0 members (00.00% of keys, avg size 0.00)

I've also ran: redis-cli --scan --pattern bull:DEV_* | wc -l, which output is 3502271

doender avatar Sep 23 '19 14:09 doender

Okay I think it should be safe to delete keys prefixed with "DEV_"

ryasmi avatar Sep 23 '19 14:09 ryasmi

I've removed the DEV_ keys, but it seems some have returned:

redis-cli --scan --pattern bull:DEV_*
bull:DEV_STATEMENT_PERSON_QUEUE:stalled-check
bull:DEV_STATEMENT_FORWARDING_QUEUE:stalled-check
bull:DEV_STATEMENT_FORWARDING_DEADLETTER_QUEUE:stalled-check
bull:DEV_PERSONA_IMPORT_QUEUE:stalled-check
bull:DEV_STATEMENT_QUERYBUILDERCACHE_QUEUE:stalled-check
bull:DEV_STATEMENT_FORWARDING_REQUEST_DELAYED_QUEUE:stalled-check
bull:DEV_STATEMENT_FORWARDING_REQEUST_QUEUE:stalled-check

Is this problematic?

doender avatar Sep 23 '19 17:09 doender

It's odd that they're being prefixed with DEV_ unless that is the default. I'm not 100% sure what the stalled-check is. I'll see if Chris can have a look at this.

ryasmi avatar Sep 24 '19 10:09 ryasmi

Do you have any statement forwards setup @doender?

ryasmi avatar Sep 24 '19 11:09 ryasmi

No, not that I know. How can I check this?

doender avatar Sep 24 '19 12:09 doender

You should be able to see this via the user interface on the Statement Forward page.

ryasmi avatar Sep 26 '19 11:09 ryasmi

We don't have statement forwarding set up

novolrs:PRIMARY> db.statementForwarding.count()
0

FYI, this is our .env file:

# Node environment (production|development)
NODE_ENV=production

# Site URL (used for emails)
SITE_URL=http://127.0.0.1

#########
# PORTS #
#########

# Expose the UI on this port
UI_PORT=3000

# Expose the API on this port
API_PORT=8080

# Use an additional port for unit tests
#TEST_API_PORT=808080

# Unique string used for hashing
# Recommended length, 256 bits
APP_SECRET=<SECRET>

#############
# DATABASES #
#############

#### MONGO ####

# Mongo URL (e.g. mongodb://localhost:27017/learninglockerV2)
MONGODB_PATH=mongodb://learninglocker:<SECRET>@platform-prod-lrs-ams3-1:27017,platform-prod-lrs-sgp1-1:27017/learninglocker_v2?replicaSet=novolrs&w=1&maxPoolSize=50
# Mongo socket timeout in milliseconds (default=300000)
#MONGO_SOCKET_TIMEOUT_MS=

# Mongo pool size (default=20)
#MONGO_CONNECTION_POOLSIZE=

# Mongo URL for tests
MONGODB_TEST_PATH=mongodb://localhost:27017/llv2tests

# Are aggregations allowed to use the Mongo disk (true*|false)
#ALLOW_AGGREGATION_DISK_USE=

#### REDIS ####
# Redis is used for caching aggregations and can also be used as a queue driver where SQS is not available
# The address of your Redis instance e.g. 127.0.0.1
REDIS_HOST=127.0.0.1
# The port where your redis instance is running e.g. 6379
REDIS_PORT=6379
# Your Redis DB number e.g. 0
REDIS_DB=0
# A prefix for each key in the Redis DB e.g. DEVELOPMENT/PROD
REDIS_PREFIX=LEARNINGLOCKER

# How many seconds are aggregation results cached (default=300)
#AGGREGATION_CACHE_SECONDS=

# Refresh aggregations when this close to expiry (default=120)
#AGGREGATION_REFRESH_AT_SECONDS=

# Max time aggregations can run for in milliseconds (default=0)
# https://docs.mongodb.com/manual/reference/operator/meta/maxTimeMS/
#MAX_TIME_MS=

# Maximum number of documents an aggregation can scan
# https://docs.mongodb.com/manual/reference/operator/meta/maxScan/
#MAX_SCAN=

# Turn off fuzzy scoring on persona matching
# Set to true to make persona workers much faster at scale (default is true)
#DISABLE_PERSONA_SCORING=true

############
# SECURITY #
############

# Enable/disable frameguard
# Disabling this feature will allow your UI to be embedded (via iframe) into other hosts/domains
#ENABLE_FRAMEGUARD=true

###########
# Logging #
###########

# Minimum logging level (error|warning|info|debug|silly)
LOG_MIN_LEVEL=info
# Relative dir to store API access logs
LOG_DIR=logs
# Logging level for tests
TEST_LOG_MIN_LEVEL=debug
# Use ANSI colour in the logs
COLOR_LOGS=true

#######################
# AWS Cloudwatch logs #
# AWS credentials must be configured for Cloudwatch access
# Ref: http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/iam-identity-based-access-control-cwl.html
#######################
# Enable cloudwatch logs (false*|true)
#WINSTON_CLOUDWATCH_ENABLED=
#WINSTON_CLOUDWATCH_LOG_GROUP_NAME=
#WINSTON_CLOUDWATCH_LOG_STREAM_NAME=
#WINSTON_CLOUDWATCH_ACCESS_KEY_ID=
#WINSTON_CLOUDWATCH_SECRET_ACCESS_KEY=
#WINSTON_CLOUDWATCH_REGION=


#########
# OAuth #
#########

# Enable Google OAuth
# These details are available from the google developer console
GOOGLE_ENABLED=false
#GOOGLE_CLIENT_ID=
#GOOGLE_CLIENT_SECRET=

########
# Mail #
########

# SMTP credentials
#SMTP_HOST= # 127.0.0.1
#SMTP_PORT= # 587
#SMTP_SECURED= # true
#SMTP_USER= # user
#SMTP_PASS= # password

##########
# Queues #
##########

# Queue management system (REDIS|SQS)
QUEUE_PROVIDER=REDIS

# Namespace for queues
QUEUE_NAMESPACE=DEV

# AWS SQS credentials
#AWS_SQS_ACCESS_KEY_ID=
#AWS_SQS_SECRET_ACCESS_KEY=
#AWS_SQS_DEFAULT_REGION=

################
# File storage #
################

# The file storage type (local*|amazon)
FS_REPO=local

# Defaults to `storage`
#FS_SUBFOLDER=

# Local storage

# Defaults to current working directory
#FS_LOCAL_ENDPOINT=

# AWS S3 config
#FS_AWS_S3_ACCESS_KEY_ID=
#FS_AWS_S3_SECRET_ACCESS_KEY=
#FS_AWS_S3_REGION=
#FS_AWS_S3_BUCKET=

#############
# New Relic #
#############

# New Relic License key
#NEW_RELIC_LICENSE_KEY=
# APM name for API
#NEWRELIC_API_NAME=
# APM name for UI
#NEWRELIC_UI_NAME=

########
# MISC #
########

# Location of virus scanning binary (ClamAV - https://www.clamav.net/)
CLAMSCAN_BINARY=/usr/bin/clamscan

We just saw we're using Redis 2.8.17 instead of the minimum required version 2.8.18 and we will upgrade it, but I don't think this could cause the DEV_ statements? To be clear, we're seeing them reappear even after we delete them.

doender avatar Nov 18 '19 11:11 doender