Hangfire icon indicating copy to clipboard operation
Hangfire copied to clipboard

Jobs stuck in scheduled after failing to acquire semaphore

Open colinblaise opened this issue 2 years ago • 14 comments
trafficstars

  • Hangfire.Core 1.8.3
  • Hangfire.Pro 2.3.2

My jobs went to scheduled due to failing to acquire a semaphore, which is expected.

However, after the processing jobs have completed, I would expect the scheduled jobs to start getting processed.

This behavior is reproducible in my product. Review the following screenshots to witness the behavior.

The maintenance semaphore allows for 250 jobs.

image

I have 20 processing

image

All of the scheduled jobs are stuck in the scheduled state with the semaphore acquisition error.

image

Any thoughts on what would cause this behavior?

colinblaise avatar Nov 07 '23 18:11 colinblaise

I'm working on getting a dump to help diagnose

colinblaise avatar Nov 07 '23 18:11 colinblaise

I'm wondering if there is an issue with Redis. My redis instance's cpu/memory appear stable and I can connect successfully.

image

image

colinblaise avatar Nov 07 '23 19:11 colinblaise

I dug into our application logs and I am seeing this as well repeatedly.

[19:49:08 ERR Hangfire.Server.DelayedJobScheduler] 5 state change attempt(s) failed due to an exception, moving job to the FailedState
  {}
System.InvalidOperationException: Transaction was discarded to prevent inconsistency: one or more of active distributed locks were abandoned. Are you hitting Redis memory limits?
   at Hangfire.Pro.Redis.RedisTransaction.Commit()
   at Hangfire.Server.DelayedJobScheduler.EnqueueBackgroundJob(BackgroundProcessContext context, IStorageConnection connection, String jobId)

colinblaise avatar Nov 07 '23 19:11 colinblaise

Another piece of information:

If I click "enqueue now" in the hangfire dashboard, then the job kicks off successfully without issue.

colinblaise avatar Nov 07 '23 20:11 colinblaise

Can you please run the INFO command in Redis and post here the results? I'd like to understand your memory-related configuration settings and see how many keys were evicted.

odinserj avatar Nov 08 '23 08:11 odinserj

I will run the command as soon as I am able to reproduce the issue again, thank you.

colinblaise avatar Nov 08 '23 14:11 colinblaise

I haven't been able to reproduce the issue yet, but here's the info output anyway so you can see some of the settings we use.

# Server
redis_version:6.0.14
redis_mode:standalone
os:Windows  
arch_bits:64
multiplexing_api:winsock_IOCP
run_id:964418c96002a26c862b9718f183ef1fb0332112
uptime_in_seconds:1095375
uptime_in_days:12
hz:10

# Clients
connected_clients:66
maxclients:1000
client_recent_max_input_buffer:8
client_recent_max_output_buffer:0
client_total_writes_outstanding:0
client_total_sent_bytes_outstanding:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:115320520
used_memory_human:109.98M
used_memory_rss:127672320
used_memory_rss_human:121.76M
used_memory_peak:149683640
used_memory_peak_human:142.75M
used_memory_peak_perc:77.04%
used_memory_overhead:102620464
used_memory_startup:329360
used_memory_dataset:12700056
used_memory_dataset_perc:11.04%
used_memory_lua:78848
maxmemory:1100000000
maxmemory_reservation:123000000
maxfragmentationmemory_reservation:123000000
maxmemory_desired_reservation:123000000
maxfragmentationmemory_desired_reservation:123000000
maxmemory_human:1.02G
maxmemory_policy:noeviction
mem_allocator:jemalloc-4.0.3

# Stats
total_connections_received:12618579
total_commands_processed:2003603211
instantaneous_ops_per_sec:2744
bytes_received_per_sec:228001
bytes_sent_per_sec:229611
bytes_received_per_sec_human:222.66K
bytes_sent_per_sec_human:224.23K
rejected_connections:0
expired_keys:3261814
evicted_keys:0
keyspace_hits:358695228
keyspace_misses:28076298
pubsub_channels:14
pubsub_patterns:0
total_oom_messages:0

# Replication
role:master

# CPU
used_cpu_sys:29082.609375
used_cpu_user:35627.312500
used_cpu_avg_ms_per_sec:73
server_load:8.71
event_wait:647
event_no_wait:1303
event_wait_count:505
event_no_wait_count:561

# Cluster
cluster_enabled:0
cluster_myself_name:

# Keyspace
db0:keys=6,expires=0,avg_ttl=0
db1:keys=98,expires=13,avg_ttl=497424746
db2:keys=63,expires=29,avg_ttl=401228145
db3:keys=57,expires=29,avg_ttl=379735132
db4:keys=112,expires=46,avg_ttl=262265486
db5:keys=86,expires=41,avg_ttl=398301437
db6:keys=728,expires=55,avg_ttl=240082493
db7:keys=1108,expires=49,avg_ttl=325907332
db8:keys=97,expires=54,avg_ttl=491462140
db9:keys=103,expires=56,avg_ttl=550863295
db10:keys=133,expires=8,avg_ttl=626600081
db11:keys=70,expires=40,avg_ttl=796224391
db12:keys=6,expires=0,avg_ttl=0
db13:keys=57,expires=38,avg_ttl=757435709
db14:keys=56,expires=38,avg_ttl=756466655
db15:keys=66,expires=44,avg_ttl=274352738

colinblaise avatar Nov 08 '23 16:11 colinblaise

Thank you for the details. How often do you see the "Transaction was discarded to prevent inconsistency" error in your logs? Also, did you face strangely looking jobs in the Dashboard UI whose type and method couldn't be found? And last but not least, is your Scheduled jobs not working all the time or they are recovered automagically after some period of time?

odinserj avatar Nov 09 '23 09:11 odinserj

When the issue was occurring, I saw that log about once every 5 seconds.

I did not see anything abnormal in the Dashboard UI where the type and method couldn't be found.

From what I can tell, the only jobs that were stuck belonged to a specific semaphore. There was no automatic recovery, and the only solution was to manually re-enqueue the jobs from the Dashboard UI.

The issue has not surfaced again yet so everything is working as expected for now.

colinblaise avatar Nov 09 '23 20:11 colinblaise

Thanks for all the information. I see that everything is normal with memory in Redis, that there were no evictions which theoretically could cause distributed locks keys to be evicted, and max memory policy is set to noeviction that disallows evictions at all.

Do you see any other exceptions in the log? Could you share all of them here or send them to [email protected]?

odinserj avatar Nov 13 '23 10:11 odinserj

I found the issue live again.

> info
# Server
redis_version:6.0.14
redis_mode:standalone
os:Windows  
arch_bits:64
multiplexing_api:winsock_IOCP
run_id:aaf31577de0cf0c1725badbe1894e0f1350399f8
uptime_in_seconds:580420
uptime_in_days:6
hz:10

# Clients
connected_clients:66
maxclients:256
client_recent_max_input_buffer:8
client_recent_max_output_buffer:0
client_total_writes_outstanding:0
client_total_sent_bytes_outstanding:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:59463328
used_memory_human:56.71M
used_memory_rss:57434112
used_memory_rss_human:54.77M
used_memory_peak:64194296
used_memory_peak_human:61.22M
used_memory_peak_perc:92.63%
used_memory_overhead:44058904
used_memory_startup:287376
used_memory_dataset:15404424
used_memory_dataset_perc:26.03%
used_memory_lua:56320
maxmemory:285000000
maxmemory_reservation:30000000
maxfragmentationmemory_reservation:30000000
maxmemory_desired_reservation:30000000
maxfragmentationmemory_desired_reservation:30000000
maxmemory_human:271.80M
maxmemory_policy:noeviction
mem_allocator:jemalloc-4.0.3

# Stats
total_connections_received:6153406
total_commands_processed:174772021
instantaneous_ops_per_sec:233
bytes_received_per_sec:16917
bytes_sent_per_sec:21099
bytes_received_per_sec_human:16.52K
bytes_sent_per_sec_human:20.60K
rejected_connections:0
expired_keys:1086693
evicted_keys:0
keyspace_hits:28789085
keyspace_misses:8074166
pubsub_channels:13
pubsub_patterns:0
total_oom_messages:0

# Replication
role:master

# CPU
used_cpu_sys:2054.718750
used_cpu_user:2803.171875
used_cpu_avg_ms_per_sec:12
server_load:1.24
event_wait:109
event_no_wait:133
event_wait_count:119
event_no_wait_count:105

# Cluster
cluster_enabled:0
cluster_myself_name:

# Keyspace
db0:keys=89,expires=45,avg_ttl=378357697
db1:keys=73,expires=41,avg_ttl=728006364
db2:keys=55,expires=28,avg_ttl=310361091
db3:keys=117,expires=54,avg_ttl=276079124
db4:keys=157,expires=60,avg_ttl=529266541
db5:keys=45,expires=11,avg_ttl=525267847
db6:keys=50,expires=31,avg_ttl=298025991
db7:keys=136,expires=54,avg_ttl=573588579
db8:keys=75,expires=42,avg_ttl=412107464
db9:keys=70,expires=36,avg_ttl=526147186
db10:keys=47,expires=15,avg_ttl=233165684
db11:keys=5926,expires=60,avg_ttl=515367457
db12:keys=86,expires=35,avg_ttl=496958905
db13:keys=60,expires=28,avg_ttl=310154958
db14:keys=98,expires=47,avg_ttl=707934649
db15:keys=58,expires=31,avg_ttl=499694522

Here is the redis info. The db with the issue is db11

colinblaise avatar Nov 14 '23 18:11 colinblaise

More screenshots

All jobs are stuck in enqueued. Only one job is being processed at a time.

image

image

image

colinblaise avatar Nov 14 '23 18:11 colinblaise

There are no error longs coming from Hangfire at all

colinblaise avatar Nov 14 '23 18:11 colinblaise

Sorry for the delayed response. Please tell me the following things:

  1. Is processing restored after restarting the application.
  2. Are there any other extension filters applied to your jobs?
  3. What Hangfire.Pro.Redis version you are using?
  4. Is your time on Redis server synchronized with your application server?

Please use [email protected] for faster response time if you have an active Hangfire subscription. It is backed by help desk and each ticket is tracked, unlike GitHub ones.

odinserj avatar Nov 16 '23 09:11 odinserj