crates.io icon indicating copy to clipboard operation
crates.io copied to clipboard

Implement token-based sessions.

Open adsnaider opened this issue 3 years ago • 11 comments

This PR sets up the session table (called persistent_session because diesel... https://github.com/diesel-rs/diesel/issues/3055). Every time a user is logged in, a session will be created alongside the cookie that the user will use for authentication. This PR is the initial step for fixing https://github.com/rust-lang/crates.io/issues/2630

adsnaider avatar Apr 05 '22 02:04 adsnaider

:umbrella: The latest upstream changes (presumably #4678) made this pull request unmergeable. Please resolve the merge conflicts.

bors avatar Apr 06 '22 09:04 bors

Made some updates. Still need to:

  • [ ] Partition the table by date
  • [x] Split the table to store the last_used data in a different table

adsnaider avatar Apr 16 '22 00:04 adsnaider

@jsha I got a couple of questions for you :)

  1. I agree partitioning would be useful. Do you think it would be best to have a creation_date column and partition by that. Then we can automatically clean up old partitions by date? How can we achieve something similar partitioning by the primary key?
  2. Since you mentioned that it might be useful to store the last_used data in a different (non-essential) table in case we need to disable writes, why would this be better than just storing NULL values for those fields in the sessions table instead?

adsnaider avatar Apr 16 '22 00:04 adsnaider

Do you think it would be best to have a creation_date column and partition by that. Then we can automatically clean up old partitions by date? How can we achieve something similar partitioning by the primary key?

I hadn't noticed that there's no creation_date column. That's probably a good idea.

Since the primary key is SERIAL, we expect that the primary key id maps linearly to creation dates. To check if a partition in pruneable, we should be able to look at the creation date of the highest id in that partition. If that's past the expiration time, the whole partition can be dropped.

The best practices for partitioning choices are documented here: https://www.postgresql.org/docs/current/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-BEST-PRACTICES

One of the most critical design decisions will be the column or columns by which you partition your data. Often the best choice will be to partition by the column or set of columns which most commonly appear in WHERE clauses of queries being executed on the partitioned table. WHERE clauses that are compatible with the partition bound constraints can be used to prune unneeded partitions.

Since we're now looking up by the primary key, partitioning by that key seems like the best choice. This also saves us from needing another index.

Since you mentioned that it might be useful to store the last_used data in a different (non-essential) table in case we need to disable writes, why would this be better than just storing NULL values for those fields in the sessions table instead?

I haven't fully solidified my thinking, but in broad strokes I'm thinking about how much of the session table we can afford to fit in the DB's memory, and also about locking and update rates. If the non-essential (and potentially large) fields are in another table, we wouldn't necessarily need to keep them in memory. On the other hand, writing to them constantly might wind up keeping them in memory anyhow (to the detriment of other tables). I'm also slightly concerned that constantly writing to session rows could cause contention on reads of those session rows. In general, I'm having second thoughts on the idea of recording user-agent and IP on every access, because I suspect it will be very expensive. It's a good feature to have, but I want to maximize the chances of success for the core feature of revocable sessions. What do you think of omitting that part for now?

Also, fair warning: I am no Postgres expert. I'm taking what I've learned on MariaDB/MySQL and extrapolating/reading the docs.

jsha avatar Apr 16 '22 03:04 jsha

Here's another way of thinking about the access IP and user-agent: For the attack we're most worried about (compromised GitHub account), the IP and user-agent of the login are just about as useful as the IP and user-agent of the last access. So we could store that info at login to help people figure out if they have any inappropriate sessions.

jsha avatar Apr 16 '22 03:04 jsha

I hadn't noticed that there's no creation_date column

We don't have creation_date, but we have created_at (which is a timestamp)

I agree with starting with the basics for now. I'm thinking of keeping the created_at column so that we can still invalidate stale sessions but I'll get rid of the last_used_at, last_used_ip, and last_user_agent columns.

adsnaider avatar Apr 16 '22 04:04 adsnaider

@jsha I'm having trouble figuring out how partitioning will fit alongside the rest of the infrastructure. I can define the table to be partitioned by primary key but the partitions themselves need to be setup with SQL code (not from Rust). I have some ideas how this could work, so let me know what you think.

Assuming we can have some sort of cronjob in the server, I could set one up daily that reads the latest session (not sure how yet, maybe just select all and sort descending or something), and create a partition from the highest ID in the previous partition and the current highest. Then, we can have another cronjob that looks at the partitions starting from the oldest and deletes the ones that are stale (by whatever definition of stale we decide upon). I'm not sure that we can even know what partitions currently exist with SQL.

However, I think it might be worth considering having the partitions be ranged by creation_date and here's why: we don't need an index as this will be done with a cronjob once a day so efficiency isn't necessarily a priority (do you agree with that reasoning? Maybe DB locks will be a problem). The advantage of doing the partition by creation_date is that it's very easy to create and delete the older partitions without having to do any sorting to figure out if the partition should be deleted. Additionally, it would be easy to define such cronjob without having to figure out what partitions currently exist (we can just say look at the partition with date = today - x for whatever we decide xshould be.

What do you think?

adsnaider avatar Apr 17 '22 18:04 adsnaider

:umbrella: The latest upstream changes (presumably ca82d242e85312438530adaac1e22ec9758a282c) made this pull request unmergeable. Please resolve the merge conflicts.

bors avatar May 09 '22 07:05 bors

@bjorn3 or @pietroalbini do either of you know who I should contact in order to setup automatic creation/deletion of the partitions as a cron job within the server?

adsnaider avatar May 20 '22 21:05 adsnaider

:umbrella: The latest upstream changes (presumably 64feb48880e4799e74e2cde8646411fa41cb7c22) made this pull request unmergeable. Please resolve the merge conflicts.

bors avatar May 31 '22 07:05 bors

@adsnaider there is already a daily cronjob whose source code is https://github.com/rust-lang/crates.io/blob/master/src/worker/daily_db_maintenance.rs. I guess you can probably add it there?

Also, note that I'm not on the crates.io team, so it's up to them to review the change :slightly_smiling_face:

pietroalbini avatar Jun 04 '22 13:06 pietroalbini