wastebin icon indicating copy to clipboard operation
wastebin copied to clipboard

human-readable random url

Open mokurin000 opened this issue 8 months ago • 6 comments

Example program:

use cgisf_lib::SentenceConfigBuilder;

fn main() {
    let sentence = cgisf_lib::gen_sentence(
        SentenceConfigBuilder::random()
            .plural(false)
            .adjectives(1)
            .adverbs(1)
            .structure(cgisf_lib::Structure::AdjectivesNounVerbAdverbs)
            .build(),
    );
    let sentence = sentence
        .replace("The ", "")
        .trim_end_matches(".")
        .replace(" ", "-");
    println!("{sentence}");
}

Would output things like:

rust-kiss-alarms-stealthily

Compared to traditional random hash strings (e.g., "GSlZNwBUGKi"), it achieves semantic structural composition with these advantages:

  • Memorability - Uses natural language elements (adjective+noun+verb+adverb) that align with human memory patterns.
  • Readability - Word combinations form pseudo-sentence structures (e.g., "rust-kiss-alarms-stealthily" could be interpreted as "rusty kisses stealthily alarm").

mokurin000 avatar Apr 05 '25 10:04 mokurin000

In principle this is a good idea but your proposal would mean generating new identifiers incompatible with the existing ones. Unless there is some bijective function that allows mapping from and to existing 32/64 bit identifiers I don't know of. Storing additional string identifiers is not a viable alternative for me, I'd like to keep the database schema simple and lean.

matze avatar Apr 05 '25 16:04 matze

Not quite related but I though of adding alias identifiers with am unambiguous character set, e.g. the one from https://stackoverflow.com/a/58098360. This would avoid confusions of similar looking characters like I/1/l or 0/O.

The length would increase from currently 11 to roundup(ln(2^64) / ln(number-of-character := 23)) = 15. To avoid clashes with current IDs they could be queried via /simple/{ID}.

cgzones avatar Apr 05 '25 17:04 cgzones

In principle this is a good idea but your proposal would mean generating new identifiers incompatible with the existing ones. Unless there is some bijective function that allows mapping from and to existing 32/64 bit identifiers I don't know of. Storing additional string identifiers is not a viable alternative for me, I'd like to keep the database schema simple and lean.

AFAIK the current id (the number)~url_path mapping approach is just some mask to get each 6 bits (or 2/4 bits), id's are generated from random i64 numbers. ^0

I would suggest perform ahash on such short strings (with hardware-acceleration this would be faster than rustc-hash), and get a u64 by RandomState::hash_one

The only thing I am not sure, do we really need bidirectional mapping between the url path and the id number? For example, if a user access https://somedomain.tld/long-readable-string-url, we calculate the corresponding ID to query related data from database

By the way, as the current url parts could only be 6 chars or 11 chars, ensuring human-readable ids longer than 11 bytes could prevent possible collisions. Anyway due to the sentence contains 4 words, and it's mostly impossible to have four 2-alpha words, the length check is not even required

mokurin000 avatar Apr 05 '25 17:04 mokurin000

Storing additional string identifiers is not a viable alternative for me

You need not to store additional identifiers if we just hash them to u64's.

We could allow users to specify a optional boolean human_readable e.g., to have human-readable url part, but we could still store them as i64.

mokurin000 avatar Apr 05 '25 17:04 mokurin000

The only thing I am not sure, do we really need bidirectional mapping between the url path and the id number? For example, if a user access https://somedomain.tld/long-readable-string-url, we calculate the corresponding ID to query related data from database

Okay, got your point. There are two issues I still see left:

  1. To not break backwards compatibility we still need to support old identifiers, i.e. need two code paths in the same route. Not cool.
  2. I'd hate to change the default, i.e. after pasting a new item be confronted with a different looking URL than previously which for some people is even preferable over long "readable" ones. So, it'd would have to be an opt-in change with yet-another configuration variable. Not cool either.

I'm on the fence to be honest.

matze avatar Apr 05 '25 19:04 matze

The only thing I am not sure, do we really need bidirectional mapping between the url path and the id number? For example, if a user access https://somedomain.tld/long-readable-string-url, we calculate the corresponding ID to query related data from database

Okay, got your point. There are two issues I still see left:

1. To not break backwards compatibility we still need to support old identifiers, i.e. need two code paths in the same route. Not cool.

2. I'd hate to change the default, i.e. after pasting a new item be confronted with a different looking URL than previously which for some people is even preferable over long "readable" ones. So, it'd would have to be an opt-in change with yet-another configuration variable. Not cool either.

I'm on the fence to be honest.

Okay. I see your concern, so I will leave it in my fork for now. working on the implementation

mokurin000 avatar Apr 05 '25 19:04 mokurin000