solidcommunity.net icon indicating copy to clipboard operation
solidcommunity.net copied to clipboard

Authentication issue in solidcommunity.net

Open VirginiaBalseiro opened this issue 6 months ago • 45 comments

The OIDC login flow itself seems to complete successfully, but once redirected back to the client app (we've tested with both SolidOS and Dokieli), any operation that requires authentication fails with a 401 Unauthorized.

And an error message pops up with "Ooops" in SolidOS (UI).

This is not not an issue with one particular account, I have tested with multiple accounts.

Image

VirginiaBalseiro avatar Sep 12 '25 17:09 VirginiaBalseiro

Can you try to login with your browser in private mode ?

bourgeoa avatar Sep 13 '25 11:09 bourgeoa

Tried both on Chrome and Firefox, private mode, clearing cookies, storage, cache, etc. Same issue. Even tried creating a new account (which seems to succeed) and same issue.

VirginiaBalseiro avatar Sep 13 '25 11:09 VirginiaBalseiro

I suspect the root cause is that internal networking on the server is down - this can be confirmed by running curl solidcommunity.net on the server to see if there is a connection error. This is because the Community Solid Server will perform a fetch on the users WebID during AuthZ operations.

The fix in this case should be to run systemctl restart systemd-networkd. The following script should be able to keep things running.

#!/bin/bash

PING_TARGET="1.1.1.1"
INTERFACE="eth0"
MAX_ATTEMPTS=2
LOG_TAG="network-watchdog"

check_network() {
    if ! ping -c 2 -W 2 -I $INTERFACE $PING_TARGET > /dev/null 2>&1; then
        echo "Ping failed"
        return 1
    fi
    
    if ! ip link show $INTERFACE | grep -q "state UP"; then
        echo "Interface down"
        return 1
    fi
    
    return 0
}

attempt_recovery() {
    logger -t $LOG_TAG "Network outage detected. Attempting recovery..."
    systemctl restart systemd-networkd
    sleep 5
}

# Main execution
if ! check_network; then
    for ((i=1; i<=$MAX_ATTEMPTS; i++)); do
        attempt_recovery
        if check_network; then
            logger -t $LOG_TAG "Network recovered after $i attempts"
            exit 0
        fi
    done
    logger -t $LOG_TAG "Failed to recover network after $MAX_ATTEMPTS attempts"
    exit 1
else
    logger -t $LOG_TAG "Network operational"
    exit 0
fi

The element discussion from last time we fixed networking on the droplet can be found here.


@kkuffour the steps I used to reproduce the error reported by @VirginiaBalseiro were to:

  • go to https://jeswr.solidcommunity.net/ (I.e. my storage URL)
  • Click "Log In" on the SolidOS UI
  • Follow the log in flow, and observe the reported error after completing the log in flow

jeswr avatar Sep 14 '25 11:09 jeswr

@kkuffour has found curl works within the droplet so my suspicion of a networking error is incorrect.

jeswr avatar Sep 14 '25 15:09 jeswr

Image, these are what the logs show. Best guess is that there appears to be a time lock issue. So either a performance bottleneck or a storage bug.

kkuffour avatar Sep 15 '25 09:09 kkuffour

@jeswr I propose restarting the application and monitoring to see if these errors reappear.

kkuffour avatar Sep 15 '25 10:09 kkuffour

@bourgeoa @jeswr @VirginiaBalseiro, my theory is that the file system resource locker set to 6000 is timing out, option 1. restart the application and monitor or option 2. change the resource locker to 15000 and restart the application. What do you prefer/advise?

kkuffour avatar Sep 15 '25 11:09 kkuffour

It should be safe to pm2 restart the server @kkuffour - give that a go. @joachimvh do you have any insight as to what is going on here?

jeswr avatar Sep 15 '25 19:09 jeswr

The log above seems to indicate that the cleanup of expired files is failing due to the lock timeout, but I can't say what is causing this. It seems to be failing on the folders itself and not individual entries, so perhaps because there are so many entries.

But there is no guarantee that this is related to the initial problem in this issue.

joachimvh avatar Sep 16 '25 07:09 joachimvh

I think your are not on the right track for this issue. The timeout issue is from the origin. More frequent with the current issue.

Did you try to restart DNS ?

sudo systemctl restart systemd-resolved.service

PS : did you lôok at the wiki

If this fails :

There are other things to look at :

  • disk size
  • recent updates to Pivot or SolidOS not tested on a test server
    • conflict on npm versions mainly Rdflib
    • pivot build not run

bourgeoa avatar Sep 16 '25 10:09 bourgeoa

This has already been tried "sudo systemctl restart systemd-resolved.service", update-wise, nothing has changed

Possibly it could be this? { "@id": "urn:solid-server:default:CookieStorage", "@type": "WrappedExpiringStorage", "timeout": 1 }

disk size Filesystem Size Used Avail Use% Mounted on udev 3.9G 0 3.9G 0% /dev tmpfs 795M 1.1M 794M 1% /run /dev/vda1 78G 17G 61G 22% / tmpfs 3.9G 0 3.9G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup /dev/vda15 105M 7.5M 97M 8% /boot/efi /dev/sda 99G 67G 27G 72% /mnt/volume_lon1_01 /dev/sdc 100G 67G 28G 71% /mnt/volume_lon1_04 /dev/sdb 100G 56G 39G 60% /mnt/volume_lon1_03 /dev/loop8 92M 92M 0 100% /snap/lxd/29619 /dev/loop6 64M 64M 0 100% /snap/core20/2582 /dev/loop2 64M 64M 0 100% /snap/core20/2599 /dev/loop5 55M 55M 0 100% /snap/certbot/4737 /dev/loop7 50M 50M 0 100% /snap/snapd/24792 /dev/loop9 67M 67M 0 100% /snap/core24/1055 /dev/loop11 92M 92M 0 100% /snap/lxd/32662 /dev/loop0 51M 51M 0 100% /snap/snapd/25202 /dev/loop1 56M 56M 0 100% /snap/core18/2940 /dev/loop4 67M 67M 0 100% /snap/core24/1151 /dev/loop10 56M 56M 0 100% /snap/core18/2947 tmpfs 795M 0 795M 0% /run/user/1007

kkuffour avatar Sep 16 '25 11:09 kkuffour

Login with https://podpro.dev works without any issue.

We must consider that mashlib is the issue. Can you display npm version with npm ls But not only, also dependencies to look at version conflicts.

Did ODI update anything ?

bourgeoa avatar Sep 16 '25 16:09 bourgeoa

We must consider that mashlib is the issue.

There are reports about this being an issue in other apps on element https://matrix.to/#/!VAJLTawStGAyYwzTTY:gitter.im/$5ii4QGHxJXvb5TjKd9Cz0eM1T7Zz9I3YA7H0xZcUfUs?via=gitter.im&via=matrix.org&via=azadehafzar.io

jeswr avatar Sep 16 '25 16:09 jeswr

it is right that podpro.dev logs in correctly. but saving a file gives a 401, too.

Image

ewingson avatar Sep 17 '25 05:09 ewingson

Just to clarify: there are no visible errors or issues when logging in, the problem is when trying to make any authenticated fetch, it returns a 401.

VirginiaBalseiro avatar Sep 17 '25 06:09 VirginiaBalseiro

it is right that podpro.dev logs in correctly. but saving a file gives a 401, too.

Image

I have no issue to edit a file with https://podpro.dev when logged to my account https://bourgeoa.solidcommunity.net

@ewingson can you retry ?

bourgeoa avatar Sep 17 '25 09:09 bourgeoa

Just to clarify: there are no visible errors or issues when logging in, the problem is when trying to make any authenticated fetch, it returns a 401.

That's correct. And this why I suspect the error to be in a mashlib/rdflib npm version conflict somewhere.

@jeswr @kkuffour could you display the result of npm ls rdflib ?

bourgeoa avatar Sep 17 '25 09:09 bourgeoa

@bourgeoa here is the output - I've also messaged you on element

root@solidcommunity:/home/solid/test-pivot# npm ls rdflib
@solid/[email protected] /home/solid/test-pivot
├─┬ [email protected] invalid: "^1.10.4" from the root project
│ ├── [email protected] deduped invalid: "^2.2.37" from the root project
│ ├─┬ [email protected]
│ │ └── [email protected] deduped invalid: "^2.2.37" from the root project
│ └─┬ [email protected]
│   ├─┬ [email protected]
│   │ ├─┬ [email protected]
│   │ │ └── [email protected] deduped invalid: "^2.2.37" from the root project
│   │ └── [email protected] deduped invalid: "^2.2.37" from the root project
│   ├─┬ [email protected]
│   │ └── [email protected] deduped invalid: "^2.2.37" from the root project
│   └── [email protected] deduped invalid: "^2.2.37" from the root project
└── [email protected] invalid: "^2.2.37" from the root project

jeswr avatar Sep 17 '25 09:09 jeswr

let's shine a light from different perspectives -

I was following this strategy on 87.245.19.109 with Win10 and firefox:

  • using https://podpro.dev and https://ewingson.solidcommunity.net
  • mere login: positive
  • just show (read) public container: positive
  • just show (read) private container: negative
  • edit (write, save) public container: negative
  • edit (write, save) private container: negative (naturally)
  • add new container (both private and public): negative
  • add new resource (test.ttl) [both private and public]: negative

so the thesis we have an issue with authenticated fetches => 401 hardens

ewingson avatar Sep 17 '25 10:09 ewingson

It might be useful to change loggingLevel of the server to debug temporarily. In my experience, CSS log gives quite detailed reasoning for why an authentication may have failed.

I can confirm that this is happening in a random app I develop as well (works fine with a local — as well as my personal — CSS instance); it's unlikely to be a client issue.

mrkvon avatar Sep 17 '25 10:09 mrkvon

@bourgeoa here is the output - I've also messaged you on element

root@solidcommunity:/home/solid/test-pivot# npm ls rdflib @solid/[email protected] /home/solid/test-pivot ├─┬ [email protected] invalid: "^1.10.4" from the root project │ ├── [email protected] deduped invalid: "^2.2.37" from the root project │ ├─┬ [email protected] │ │ └── [email protected] deduped invalid: "^2.2.37" from the root project │ └─┬ [email protected] │ ├─┬ [email protected] │ │ ├─┬ [email protected] │ │ │ └── [email protected] deduped invalid: "^2.2.37" from the root project │ │ └── [email protected] deduped invalid: "^2.2.37" from the root project │ ├─┬ [email protected] │ │ └── [email protected] deduped invalid: "^2.2.37" from the root project │ └── [email protected] deduped invalid: "^2.2.37" from the root project └── [email protected] invalid: "^2.2.37" from the root project

@jeswr this is not a normal situation

I think there is a version conflict between creating Pivot from Github or from npmjs.

I remember that when I created the migration process the intent was to create solidcommunity.net Pivot from npm but it did not succeed in the migration time frame.

I went back to the a creation from github.com with a clone from pivot source and updating package.json at will and build the new version when needed.

ODI then took responsability and the process was never updated to run from NPM has it should. In the actual situation it looks like there is a conflict between the 2 processes. Why I don't know

bourgeoa avatar Sep 17 '25 15:09 bourgeoa

The clock on the server was 2 minutes behind. This resulted in the iat value in the DPoP token being at a future date to the current time on the server - which resulted in the Authorization error.

The clock on the server has been re-syncronised, and I have now been able to successfully log-in without recieving this error message. @VirginiaBalseiro could you please check whether this resolves the issue for you?

jeswr avatar Sep 17 '25 18:09 jeswr

The clock on the server was 2 minutes behind

(behind what?)

This kind of issue crops up anywhere machines attach timestamps to anything and/or synchronize interactions, including but not limited to email, netnews, microblogs, "standard" blogs, sharded databases, etc.

The clock on the server has been re-syncronised

If that was just done as a one-off, it's a stopgap solution. It's not a future-proof solution.

Standard practice in such deployments is to set cron jobs to auto-sync the clocks on all machines involved in the deployment. ntp and/or ptp are your friends.

TallTed avatar Sep 17 '25 21:09 TallTed

@kkuffour @jeswr

Great catch. Thank you for taking care of this! 🙏🏼

(edited because I repeated what @jeswr already explained, sorry)

mrkvon avatar Sep 18 '25 09:09 mrkvon

Closing this issue. "behind what" the clock on my machine which was the client for the tests I was running.

I suspect the firewalls that we put in place on the droplet were preventing the ntp/ptp syncrhornisation. I have created internal tickets for us to resolve this and put some alerting in place for the droplet.

jeswr avatar Sep 18 '25 11:09 jeswr

@jeswr Could it be that the clock has drifted again? I've created three new accounts yesterday and today, and am getting 401s when GET'ing their newly-minted WebIDs from both my app and mashlib running on solidcommunity.net.

(Also, I couldn't find a link to this repo from solidcommunity.net, which made it a bit hard to find this issue 😅)

Vinnl avatar Sep 23 '25 08:09 Vinnl

I [...] am getting 401s when GET'ing their newly-minted WebIDs

Strange, WebIDs are usually public, aren't they? They shouldn't respond with 401 — ever — regardless of validity of authentication. (?)

mrkvon avatar Sep 23 '25 12:09 mrkvon

Exactly 😅

Vinnl avatar Sep 23 '25 12:09 Vinnl

Strange, WebIDs are usually public, aren't they? They shouldn't respond with 401 — ever — regardless of validity of authentication. (?)

Not necessarily. Some use cases may involve WebIDs which are in VPNs or other private network spaces. Though this probably isn't part of what you're trying to address. :-)

TallTed avatar Sep 23 '25 13:09 TallTed

The issue has been fixed. Could you check and confirm?

kkuffour avatar Sep 23 '25 14:09 kkuffour