litefs icon indicating copy to clipboard operation
litefs copied to clipboard

v0.4.0-beta1 upgrade from v0.3.0

Open tjheeta opened this issue 2 years ago • 4 comments

During the upgrade, one host now fails to start with the following error.

2023-04-14T03:55:31Z app[91857211b0e283] lax [info]config file read from /etc/litefs.yml
2023-04-14T03:55:31Z app[91857211b0e283] lax [info]LiteFS v0.4.0-beta1, commit=6421a2236f96b1d7bd88d18a1cc46aa1fd5c2ae7
2023-04-14T03:55:31Z app[91857211b0e283] lax [info]no proxy target set, skipping proxy
2023-04-14T03:55:31Z app[91857211b0e283] lax [info]Using Consul to determine primary
2023-04-14T03:55:31Z app[91857211b0e283] lax [info]initializing consul: key=litefs/someapp
2023-04-14T03:55:31Z app[91857211b0e283] lax [info]ERROR: cannot open store: open databases: open database("database"): init from database header: open /mnt/litefs/dbs/database/database: not a directory
2023-04-14T03:55:31Z app[91857211b0e283] lax [info]Starting clean up.
2023-04-14T03:55:31Z app[91857211b0e283] lax [info]Umounting /dev/vdb from /mnt/litefs
2023-04-14T03:55:32Z app[91857211b0e283] lax [info][   10.188133] reboot: Restarting system

The contents of /mnt/litefs

        2      4 drwxr-xr-x   4 root     root         4096 Apr  5 08:37 /mnt/litefs
       11     16 drwx------   2 root     root        16384 Apr  5 08:36 /mnt/litefs/lost+found
       12      4 -rw-r--r--   1 root     root           25 Apr  5 08:37 /mnt/litefs/id
     8161      4 drwxr-xr-x   5 root     root         4096 Apr 14 03:51 /mnt/litefs/dbs
       13      4 drwxr-xr-x   3 root     root         4096 Apr 14 03:17 /mnt/litefs/dbs/main
       17     32 -rw-r--r--   1 root     root        32768 Apr 14 03:17 /mnt/litefs/dbs/main/shm
       14  54432 -rw-r--r--   1 root     root     55738368 Apr 14 03:17 /mnt/litefs/dbs/main/database
       15      4 drwxr-xr-x   2 root     root         4096 Apr  5 08:37 /mnt/litefs/dbs/main/ltx
       16  54488 -rw-r--r--   1 root     root     55792920 Apr  5 08:37 /mnt/litefs/dbs/main/ltx/0000000000000001-0000000000000001.ltx
       24      0 -rw-r--r--   1 root     root            0 Apr 14 03:51 /mnt/litefs/dbs/database
       18      4 drwxr-xr-x   3 root     root         4096 Apr 14 03:17 /mnt/litefs/dbs/main.sqlite
       23      0 -rw-r--r--   1 root     root            0 Apr 14 03:52 /mnt/litefs/dbs/main.sqlite/wal
       21     32 -rw-r--r--   1 root     root        32768 Apr 14 03:52 /mnt/litefs/dbs/main.sqlite/shm
       19  40280 -rw-r--r--   1 root     root     41246720 Apr 14 03:49 /mnt/litefs/dbs/main.sqlite/database
       20      4 drwxr-xr-x   2 root     root         4096 Apr 14 03:49 /mnt/litefs/dbs/main.sqlite/ltx
       26     28 -rw-r--r--   1 root     root        24720 Apr 14 03:49 /mnt/litefs/dbs/main.sqlite/ltx/00000000000003ec-00000000000003ec.ltx
       22      8 -rw-r--r--   1 root     root         4220 Apr 14 03:49 /mnt/litefs/dbs/main.sqlite/ltx/00000000000003ea-00000000000003ea.ltx
       25      8 -rw-r--r--   1 root     root         4220 Apr 14 03:49 /mnt/litefs/dbs/main.sqlite/ltx/00000000000003eb-00000000000003eb.ltx
       27      4 drwxr-xr-x   2 root     root         4096 Apr 14 03:51 /mnt/litefs/dbs/ltx

From a system that completed the upgrade successfully:

Connecting to fdaa:0:7bb8:a7b:f4:824c:e412:2... complete
        2      4 drwxr-xr-x   4 root     root         4096 Apr  2 03:10 /mnt/litefs
       12      4 -rw-r--r--   1 root     root           25 Apr  2 03:10 /mnt/litefs/id
     8161      4 drwxr-xr-x   4 root     root         4096 Apr  2 03:10 /mnt/litefs/dbs
       13      4 drwxr-xr-x   3 root     root         4096 Apr 14 03:50 /mnt/litefs/dbs/main
       14  54432 -rw-r--r--   1 root     root     55738368 Apr 14 03:50 /mnt/litefs/dbs/main/database
       17     32 -rw-r--r--   1 root     root        32768 Apr 14 03:50 /mnt/litefs/dbs/main/shm
       15      4 drwxr-xr-x   2 root     root         4096 Apr  2 03:10 /mnt/litefs/dbs/main/ltx
       16  54488 -rw-r--r--   1 root     root     55792920 Apr  2 03:10 /mnt/litefs/dbs/main/ltx/0000000000000001-0000000000000001.ltx
       18      4 drwxr-xr-x   3 root     root         4096 Apr 14 03:50 /mnt/litefs/dbs/main.sqlite
       19  40280 -rw-r--r--   1 root     root     41246720 Apr 14 03:50 /mnt/litefs/dbs/main.sqlite/database
       23     80 -rw-r--r--   1 root     root        78312 Apr 14 04:08 /mnt/litefs/dbs/main.sqlite/wal
       21     32 -rw-r--r--   1 root     root        32768 Apr 14 04:08 /mnt/litefs/dbs/main.sqlite/shm
       20     16 drwxr-xr-x   2 root     root        16384 Apr 14 04:08 /mnt/litefs/dbs/main.sqlite/ltx
       24      4 -rw-r--r--   1 root     root         2820 Apr 14 04:08 /mnt/litefs/dbs/main.sqlite/ltx/00000000000003f1-00000000000003f1.ltx
       22     12 -rw-r--r--   1 root     root        12198 Apr 14 04:07 /mnt/litefs/dbs/main.sqlite/ltx/00000000000003f0-00000000000003f0.ltx
       25      4 -rw-r--r--   1 root     root         1751 Apr 14 04:08 /mnt/litefs/dbs/main.sqlite/ltx/00000000000003f2-00000000000003f2.ltx
       11     16 drwx------   2 root     root        16384 Apr  2 03:10 /mnt/litefs/lost+found

tjheeta avatar Apr 14 '23 04:04 tjheeta

After "rm /mnt/litefs/dbs/database" it appears to come up cleanly and latest data is there.

2023-04-14T04:23:58Z app[91857211b0e283] lax [info]database file does not exist on initialization: /mnt/litefs/dbs/ltx/database
2023-04-14T04:23:58Z app[91857211b0e283] lax [info]wal-sync: no wal file exists on "main", skipping sync with ltx
2023-04-14T04:23:59Z app[91857211b0e283] lax [info]wal-sync: short wal file exists on "main.sqlite", skipping sync with ltx
2023-04-14T04:23:59Z app[91857211b0e283] lax [info]LiteFS mounted to: /litefs
2023-04-14T04:23:59Z app[91857211b0e283] lax [info]http server listening on: http://localhost:20202
2023-04-14T04:23:59Z app[91857211b0e283] lax [info]waiting to connect to cluster
2023-04-14T04:23:59Z app[91857211b0e283] lax [info]FC1A4EAE16BD02A6: existing primary found (3287440dc92185), connecting as replica
2023-04-14T04:23:59Z app[3287440dc92185] ord [info]2F8633F30B331BF1: stream connected
2023-04-14T04:23:59Z app[3287440dc92185] ord [info]transaction file for txid 00000000000003ed no longer available, writing snapshot
2023-04-14T04:23:59Z app[3287440dc92185] ord [info]writing snapshot "main.sqlite" @ 00000000000003f2
2023-04-14T04:24:00Z app[91857211b0e283] lax [info]snapshot received for "main.sqlite", removing other ltx files: 0000000000000001-00000000000003f2.ltx
2023-04-14T04:24:00Z app[91857211b0e283] lax [info]connected to cluster, ready
2023-04-14T04:24:00Z app[91857211b0e283] lax [info]starting background subprocess: /app/bin/server []
2023-04-14T04:24:00Z app[91857211b0e283] lax [info]waiting for signal or subprocess to exit

tjheeta avatar Apr 14 '23 04:04 tjheeta

That looks better after that file removal. I'm not sure why that happened. I have some docs to push up around the v0.4.0 changes tomorrow too. I'm glad it went smoothly outside of the odd zero-length file.

benbjohnson avatar Apr 14 '23 04:04 benbjohnson

I'll see if I can reproduce that tomorrow.

benbjohnson avatar Apr 14 '23 04:04 benbjohnson

@tjheeta I didn't have any luck reproducing this and I don't see any obvious reason from the code why it would happen. I'll leave this issue open now in case anyone else hits the same problem and we can try to investigate further then.

benbjohnson avatar Apr 16 '23 17:04 benbjohnson