linstor-server icon indicating copy to clipboard operation
linstor-server copied to clipboard

Unknown error during loading data from DB

Open kvaps opened this issue 4 years ago • 2 comments

Today our production cluster stopped working:

LINSTOR, Module Controller
Version:            1.13.0 (37c02e20aa52f26ef28ce4464925d9e53327171c)
Build time:         2021-06-25T13:20:41+00:00
Java Version:       11
Java VM:            Debian, Version 11.0.11+9-post-Debian-1deb10u1
Operating system:   Linux, Version 5.4.0-74-generic
Environment:        amd64, 1 processors, 15515 MiB memory reserved for allocations
System components initialization in progress
11:36:18.231 [main] INFO  LINSTOR/Controller - SYSTEM - ErrorReporter DB version 1 found.
11:36:18.232 [main] INFO  LINSTOR/Controller - SYSTEM - Log directory set to: '/logs'
11:36:18.251 [main] WARN  io.sentry.dsn.Dsn - *** Couldn't find a suitable DSN, Sentry operations will do nothing! See documentation: https://docs.sentry.io/clients/java/ ***
11:36:18.261 [Main] INFO  LINSTOR/Controller - SYSTEM - Loading API classes started.
11:36:18.456 [Main] INFO  LINSTOR/Controller - SYSTEM - API classes loading finished: 195ms
11:36:18.456 [Main] INFO  LINSTOR/Controller - SYSTEM - Dependency injection started.
11:36:18.464 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule"
11:36:18.464 [Main] INFO  LINSTOR/Controller - SYSTEM - Extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule" is not installed
11:36:19.009 [Main] INFO  LINSTOR/Controller - SYSTEM - Dependency injection finished: 553ms
11:36:19.192 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing authentication subsystem
11:36:19.272 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'TimerEventService' of type TimerEventService
11:36:19.272 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing the database connection pool
11:36:19.349 [Main] WARN  org.flywaydb.core.api.Location - Use of dots (.) as path separators will be deprecated in Flyway 7. Path: com.linbit.linstor.dbcp.migration
11:36:19.352 [Main] INFO  org.flywaydb.core.internal.license.VersionPrinter - Flyway Community Edition 6.5.7 by Redgate
11:36:19.393 [Main] INFO  org.flywaydb.core.internal.database.DatabaseFactory - Database: jdbc:postgresql://linstor-db-stolon-proxy/linstor (PostgreSQL 10.12)
11:36:19.435 [Main] INFO  org.flywaydb.core.internal.command.DbValidate - Successfully validated 59 migrations (execution time 00:00.024s)
11:36:19.445 [Main] INFO  org.flywaydb.core.internal.command.DbMigrate - Current version of schema "LINSTOR": 2021.05.26.09.00
11:36:19.445 [Main] WARN  org.flywaydb.core.internal.command.DbMigrate - outOfOrder mode is active. Migration of schema "LINSTOR" may not be reproducible.
11:36:19.446 [Main] INFO  org.flywaydb.core.internal.command.DbMigrate - Schema "LINSTOR" is up to date. No migration necessary.
11:36:19.455 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'DatabaseService' of type DatabaseService
11:36:19.455 [Main] INFO  LINSTOR/Controller - SYSTEM - Loading security objects
11:36:19.486 [Main] INFO  LINSTOR/Controller - SYSTEM - Current security level is NO_SECURITY
11:36:19.523 [Main] INFO  LINSTOR/Controller - SYSTEM - Core objects load from database is in progress
11:36:35.575 [Main] ERROR LINSTOR/Controller - SYSTEM - Unknown error during loading data from DB [Report number 60F16F31-00000-000000]
11:36:35.580 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutdown in progress
11:36:35.581 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutting down service instance 'TaskScheduleService' of type TaskScheduleService
11:36:35.581 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Waiting for service instance 'TaskScheduleService' to complete shutdown
11:36:35.581 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutting down service instance 'DatabaseService' of type DatabaseService
11:36:35.582 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Waiting for service instance 'DatabaseService' to complete shutdown
11:36:35.582 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutting down service instance 'TimerEventService' of type TimerEventService
11:36:35.582 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Waiting for service instance 'TimerEventService' to complete shutdown
11:36:35.582 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutdown complete
ERROR REPORT 60F16F31-00000-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.13.0
Build ID:                           37c02e20aa52f26ef28ce4464925d9e53327171c
Build time:                         2021-06-25T13:20:41+00:00
Error time:                         2021-07-16 11:36:35
Node:                               linstor-controller-656bbb99db-2k4vh

============================================================

Reported error:
===============

Category:                           Error
Class name:                         ImplementationError
Class canonical name:               com.linbit.ImplementationError
Generated at:                       Method 'loadAll', Source file 'DatabaseLoader.java', Line #604

Error message:                      Unknown error during loading data from DB

Call backtrace:

    Method                                   Native Class:Line number
    loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:604
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:343
    main                                     N      com.linbit.linstor.core.Controller:564

Caused by:
==========

Description:
    Failed to restore data

Category:                           LinStorException
Class name:                         DatabaseException
Class canonical name:               com.linbit.linstor.dbdrivers.DatabaseException
Generated at:                       Method 'loadAll', Source file 'AbsDatabaseDriver.java', Line #109

Error message:                      Failed to restore data

Call backtrace:

    Method                                   Native Class:Line number
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:109
    loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:408
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:343
    main                                     N      com.linbit.linstor.core.Controller:564

Caused by:
==========

Category:                           Exception
Class name:                         MaxSizeException
Class canonical name:               com.linbit.drbd.md.MaxSizeException
Generated at:                       Method 'checkVolumeSize', Source file 'VolumeDefinition.java', Line #177

Error message:                      Volume size value 1181116006400000000 is out of range [4 - 1099511627776]

Call backtrace:

    Method                                   Native Class:Line number
    checkVolumeSize                          N      com.linbit.linstor.core.objects.VolumeDefinition:177
    <init>                                   N      com.linbit.linstor.core.objects.VolumeDefinition:112
    load                                     N      com.linbit.linstor.core.objects.VolumeDefinitionDbDriver:144
    load                                     N      com.linbit.linstor.core.objects.VolumeDefinitionDbDriver:43
    restoreData                              N      com.linbit.linstor.dbdrivers.sql.SQLEngine:345
    loadAll                                  N      com.linbit.linstor.dbdrivers.sql.SQLEngine:279
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:100
    loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:408
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:343
    main                                     N      com.linbit.linstor.core.Controller:564


END OF ERROR REPORT.

kvaps avatar Jul 16 '21 11:07 kvaps

The problem was caused by setting incredibly large number for this volume in Kubernetes:

pvc-23821312-c087-4cfd-834c-fc5403cf435e   1209462790553600m   RWO            Delete           Bound    hosting/vol-data-web-hc1-wd17-1   linstor-hosting-2            541d
┊ m16c11 ┊ pvc-23821312-c087-4cfd-834c-fc5403cf435e ┊ thindata    ┊     0 ┊    2926 ┊ /dev/drbd2926 ┊ 788.96 GiB ┊ Unused ┊ Resizing, UpToDate ┊
┊ m6c10  ┊ pvc-23821312-c087-4cfd-834c-fc5403cf435e ┊ thindata    ┊     0 ┊    2926 ┊ /dev/drbd2926 ┊   1.00 TiB ┊        ┊            Unknown ┊

solved by manually correcting the number in the "LINSTOR".volume_definitions table

kvaps avatar Jul 16 '21 14:07 kvaps

Still a bug, because that volume size should not have been able to go into the database in the first place. DRBD volumes are limited in size to 1 PiB. The specified size was about 1049 EiB, which should still be well within LINSTOR's internal handling capabilities for volume sizes.

raltnoeder avatar Sep 01 '21 16:09 raltnoeder