claude-mem icon indicating copy to clipboard operation
claude-mem copied to clipboard

v9.0.0: Crash-recovery loop when memory_session_id is not captured

Open mrlfarano opened this issue 3 months ago • 2 comments

Bug Description

Sessions created without memory_session_id cause an infinite crash-recovery loop. The generator continuously retries and fails with:

[ERROR] [SDK] ✗ OpenRouter agent error {sessionDbId=607} Cannot store observations: memorySessionId not yet captured
[INFO] [SESSION] [session-607] Generator auto-starting (observation) using OpenRouter

This loop runs indefinitely, growing the queue depth and consuming API tokens on every retry attempt.

Environment

  • claude-mem version: 9.0.0
  • OS: macOS (Darwin)
  • Provider: OpenRouter (mimo-v2-flash:free, also reproduced with gpt-4o-mini)
  • Node version: (run node -v and add here)

Steps to Reproduce

  1. Start a Claude Code session
  2. Session gets created in sdk_sessions table but memory_session_id column remains NULL/empty
  3. Observations are enqueued to pending_messages
  4. Generator attempts to process queue
  5. Fails with "Cannot store observations: memorySessionId not yet captured"
  6. Generator auto-restarts (crash-recovery)
  7. Loop continues indefinitely

Evidence from Logs

[2026-01-08 11:19:16.784] [SDK] OpenRouter API usage {model=xiaomi/mimo-v2-flash:free, inputTokens=10452, outputTokens=129}
[2026-01-08 11:19:16.784] [ERROR] [SDK] ✗ OpenRouter agent error {sessionDbId=607} Cannot store observations: memorySessionId not yet captured
[2026-01-08 11:19:16.784] [INFO] [SESSION] [session-607] Generator aborted
[2026-01-08 11:19:16.853] [INFO] [SESSION] [session-607] Generator auto-starting (observation) using OpenRouter

Database State

Sessions missing memory_session_id:

SELECT id, content_session_id, memory_session_id, status FROM sdk_sessions WHERE memory_session_id IS NULL OR memory_session_id = '';

-- Results:
-- 607|a2265efb-c878-4be4-b2f5-1ed2323cc607||active
-- 605|83f05013-e4e2-4564-8cec-f03dfc8c5eb7||active
-- (multiple sessions affected)

Expected Behavior

  1. Sessions should not be created until memory_session_id is captured
  2. OR: Generator should skip/fail gracefully for sessions missing memory_session_id instead of infinite retry
  3. OR: Crash-recovery should have a max retry limit before marking session as failed

Workaround

Manual database cleanup:

npm run worker:stop
sqlite3 ~/.claude-mem/claude-mem.db "DELETE FROM pending_messages;"
sqlite3 ~/.claude-mem/claude-mem.db "UPDATE sdk_sessions SET status = 'failed' WHERE memory_session_id IS NULL OR memory_session_id = '';"
npm run worker:start

Impact

  • Queue grows unbounded (saw 25-36+ stuck items)
  • Consumes API tokens on every failed retry (~10k tokens per attempt)
  • Worker broadcasts isProcessing=true indefinitely
  • Web UI shows stuck queue badge that won't clear

Suggested Fix

Add a check in the generator to skip sessions with missing memory_session_id and mark them as failed after N retries, rather than infinite crash-recovery loop.

Temporary Fix for Affected Users

If you're stuck in this loop, run these commands to clear it:

# Stop the worker
cd ~/.claude/plugins/marketplaces/thedotmack
npm run worker:stop

# Clear stuck queue and mark broken sessions as failed
sqlite3 ~/.claude-mem/claude-mem.db "DELETE FROM pending_messages;"
sqlite3 ~/.claude-mem/claude-mem.db "UPDATE sdk_sessions SET status = 'failed' WHERE memory_session_id IS NULL OR memory_session_id = '';"

# Restart the worker
npm run worker:start

mrlfarano avatar Jan 08 '26 16:01 mrlfarano

In version 9.0.0 of the claude-mem plugin, a bug was identified where sessions missing a memory_session_id result in an infinite crash-recovery loop. When such sessions are created, the generator repeatedly attempts to process the session queue, fails with an error indicating the missing memory_session_id, and auto-restarts itself. This cycle continues indefinitely, causing unbounded queue growth, excessive API token consumption (~10k tokens per failed attempt), and rendering the worker stuck in a processing state. Logs and database evidence confirm that multiple sessions are affected. The expected behavior should either prevent session creation without memory_session_id, gracefully skip these sessions, or enforce a maximum retry limit to mark them as failed. A temporary workaround involves manually stopping the worker, clearing the stuck queue, and marking broken sessions as failed via direct database manipulation. A suggested fix involves updating the generator logic to avoid infinite retries for sessions missing memory_session_id.

github-actions[bot] avatar Jan 08 '26 16:01 github-actions[bot]

Additional observations from v9.0.3/v9.0.4

Still experiencing this issue on v9.0.4 with Gemini provider (CLAUDE_MEM_PROVIDER=gemini).

Observed behavior:

  • Multiple sessions (sessionDbId=21877, 21915) stuck with Cannot store observations: memorySessionId not yet captured
  • Queue accumulated 48+ pending messages
  • Worker restart sometimes triggers auto-recovery that captures memorySessionId: Auto-recovered 1 sessions with pending work {totalPending=1, started=1, sessionIds=21915} MEMORY_ID_CAPTURED | sessionDbId=21915 | memorySessionId=37b7e2b8-...

Workaround that works:

  1. Kill worker: pkill -f "worker-service.cjs"
  2. Restart worker - auto-recovery may capture memorySessionId
  3. If still stuck, /clear the affected session in VSCode to force new session creation

Related:

PR #615 (generate memorySessionId for stateless providers) would fix this for Gemini/OpenRouter users but is not yet merged.

soho-dev-account avatar Jan 11 '26 11:01 soho-dev-account

Fixed in v9.0.1+. Session ID capture and crash recovery were stabilized in subsequent releases. Please update to v9.1.1 (latest).

thedotmack avatar Feb 08 '26 00:02 thedotmack