WCF icon indicating copy to clipboard operation
WCF copied to clipboard

RFC: URIs for internal links

Open dtdesign opened this issue 1 year ago • 5 comments

Links to internal pages, such as quotes use absolute URLs which work fine in general but are a true pain when moving to a different domain. It also means that a simple search & replace of URLs can be quite dangerous and prone to errors.

We could change the internal link format into something that can be resolved on runtime, that follows a strict schema and makes it easy to detect dead links. This would even allow us to dynamically strip links that point to resources that do not exist anymore.

Custom Protocol Based on RFC 7595

RFC 3986 permits a limited set of characters for the schema, but RFC 7595 section 3.8 explicitly recommends the use of a reverse domain for private schemas.

com.woltlab.v<version>://<packageIdentifier>/<object>/<optionalParameterForId>?optional=queryString
  • version is a numeric number to allow changes to the spec later on.
  • packageIdentifier matches the name found in the package.xml.
  • object is the name of the object, similar to objectType.xml.
  • optionalParameterForId is an optional, case-sensitive alphanumeric string ([a-zA-Z0-9]).
  • The query string can contain arbitrary arguments, for example, to resolve smileys based on their code.

Extra parameters are allowed through the optional query string.

Examples

# Permalink to a post
com.woltlab.v1://com.woltlab.wbb/post/12345

# Link to a user profile
com.woltlab.v1://com.woltlab.wcf/user/12345

# Resolve a smiley by its code
com.woltlab.v1://com.woltlab.wbb/smiley/?code=%3Ajoy%3A

Implementation

Data Structure for Internal Links

final class InternalLink implements \Stringable
{
    public static function fromString(string $url): ?self
    {
        // If the link is not recognized, return `null`.
    }

    /**
     * @param ?array<string, string> $parameters
     */
    public static function fromValues(
        int $version,
        string $package,
        string $object,
        ?string $id = null,
        ?array $parameters = []
    ): self {}

    public function __toString(): string {}

    private function __construct(/* yada yada yada */) {}
}

Link Recognition

Links must be recognized when rebuilding data or when processing a message from the client. Every registered handler must provide a link template similar to AbstractHtmlInputNodeProcessorListener::getRegexFromLink(), however it is up to the handler what the regex should match on.

All matching links will be collected and passed to the handler, asking for a result for each link. If the supplied values are insufficient to map this to an actual object, null must be returned to keep the link as-is.

use wcf\system\Regex;

interface InternalLinkProcessor
{
    public function getLinkRegex(): Regex;

    /**
     * @param array<string, string|int> $matches
     */
    public function evaluate(array $matches): ?InternalLink;

    /**
     * Return a value for each link passed to it. If the link matches an object,
     * but the user has no access to it, return a link that does not expose any
     * data past those already contained in the `InternalLink`. If the
     * referenced object no longer exist, return `null` to have it stripped from
     * the output.
     *
     * @param list<InternalLink> $links 
     * @return list<?string>
     */
    public function resolve(array $links): array;
}

Unresolved Problems

Resolving a link requires knowledge about the user that is about to access the link. This is easy on runtime for a viewer because the current user is well-known and implementations rely on WCF::getUser() to derive that.

However, this is a problem for situations where the links need to be processed but the target user differs from the current user. One such example are notifications / emails that are sent asynchronously and therefore must not rely on WCF::getUser().

This is a problem because checking for access is required to avoid leaking any sensitive information. Right now you could paste a crafted link in a message and it will output it as-is. The new system would allow you to paste a crafted link containing only the object id and the link generation would yield a link that may contain sensitive information, for example, the title of the object.

dtdesign avatar May 11 '24 14:05 dtdesign

I take it the intent with this is that on page load these links would be rewritten into https://www.example.com/forum/post/12345 or whatever it happens to be at the end?

I like the format of the woltlab:// urls. Curious to see where this goes.

Edit: Question regarding user UX. Say I want to use this to insert a permalink. As a technical user, I would like to use the woltlab:// protocol in links to threads and other content, and I would like to be able to use this in the ACP in various places which accept links. Is this an intended use case?

Also, what of non-technical users? Will (manually created) links in a thread to another thread for example be rewritten when the thread is posted to the woltlab:// url or will these be hard-coded and only update when the thread content is updated?

TheBrambleShark avatar Jun 25 '24 14:06 TheBrambleShark

Yes, the replacement should take place on runtime on the server, resolving the logical links to their actual URL. The same process takes place in reverse when processing input messages on the server, mapping the actual URL to their logical representation.

That said, these are meant to be a technical detail and not something you would ever face as a user. The best example I can come up with are BBCodes: Those are stored as <woltlab-metacode> HTML elements on the server but as a user you will either interact with their rendered representation or the familiar BBCode syntax.

The whole purpose is to avoid hardcoding URLs entirely so that they can easily replaced on runtime or during an import while at the same point using the semantics of URLs instead of inventing an own format. I can even see this expanded to things like attachments woltlab://v1/core/attachment/12345?thumbnail=1 or smileys woltlab://v1/core/smiley/?code=%3Ajoy%3A – although I am not entirely sure about the exact format.

Also, I think that the <object> part should use the plural form to align with the naming scheme for RPC endpoints.

dtdesign avatar Jun 25 '24 14:06 dtdesign

After talking to @TimWolla a bit, we came to the conclusion that it makes things a lot easier to move the version into the schema and moving the package identifier to the hostname part. This simplifies things a lot and makes for an easy mapping.

RFC 3986 permits a limited set of characters for the schema, but RFC 7595 section 3.8 explicitly recommends the use of a reverse domain for private schemas.

com.woltlab.v1://<packageIdentifier>/<object>/<optionalParameterForId>?optional=queryString

Some example URIs:

# Permalink to a post
com.woltlab.v1://com.woltlab.wbb/post/12345

# Link to a user profile
com.woltlab.v1://com.woltlab.wcf/user/12345

# Resolve a smiley by its code
com.woltlab.v1://com.woltlab.wbb/smiley/?code=%3Ajoy%3A

dtdesign avatar Jul 05 '24 15:07 dtdesign

RFC 3986 permits a limited set of characters for the schema, but RFC 7595 section 3.8 explicitly recommends the use of a reverse domain for private schemas.

Looked over the RFCs. I like the idea of putting the version in the schema.

RFC 3986 section 1.1.2 provides a versioned example that may be good to follow: urn:oasis:names:specification:docbook:dtd:xml:4.1.2

I might suggest even putting in the package in the schema, based on that example.

Examples

# Permalink to a post
wbb:com.woltlab:1.0.0://com.woltlab.api/post/12345

# Permalink to a support ticket (from an addon)
tickets:com.example:2.3.5://com.woltlab.api/ticket/12345

This would make the new schema take the following format:

<package>:<namespace>:<version>://com.woltlab.api/<object><opionalParameterForId>?optional=queryString

Using the com.woltlab.api as the address specification is mostly just to help with aesthetics, and the exact text label can be discussed, but I feel this may be the best way to remain compliant with the specification.

TheBrambleShark avatar Jul 05 '24 19:07 TheBrambleShark

Thank you for your feedback, although I think we’re trying to solve different problems here. The versioning is meant to allow us to handle all URLs in a centralized location and that requires the URI to be predictive. This means that all URIs will have the identical structure but in case we need to make changes, we can do so by incrementing the version and maintaining backwards compatibility.

Also, it looks like you’re mixing up two different types of URIs, the colon separated and the :// are two different types of URIs. You cannot mix and match them, they exist in parallel and solve different problems. The :// variant is used to map to a hierarchy where the colon notation is more like a unique identifier.

dtdesign avatar Jul 23 '24 12:07 dtdesign