pkl icon indicating copy to clipboard operation
pkl copied to clipboard

Inconsistent handling of HTTP redirects and file symlinks

Open translatenix opened this issue 11 months ago • 8 comments

I just did some manual testing. Currently, Pkl handles HTTP redirects and file symlinks differently:

  1. HTTP redirects: A program imports both https://example.com/bar.pkl and https://example.com/redirect-to-bar.pkl.
  • Each import fetches/parses/evaluates the module separately.
  • The resulting modules are incompatible. Their source code could be the same or different.
  • Error and trace messages contain separate URLs.
  1. File symlinks: A program imports both path/to/bar.pkl and path/to/symlink-to-bar.pkl.
  • The module is fetched/parsed/evaluated only once.
  • The resulting modules are compatible. They are the same module.
  • Error and trace messages contain the URL of the import that was accessed first.

In other words: Files are canonicalized, HTTP URLs aren't. I think both should be.

translatenix avatar Feb 26 '24 22:02 translatenix

Hm, good catch. I think the correct behavior here is to treat them as different modules.

Let's say you have a hard link:

// /path/to/a.pkl
foo: String
// /path/to/b.pkl
foo: String

These are treated as different modules:

import "file:///path/to/a.pkl"
import "file:///path/to/b.pkl"

res: a = b // error; expected a but found b

It doesn't make sense to me that making one a symbolic link for the other will make this error go away. The symlink is an implementation detail of the file system, and should be opaque to the language itself.

bioball avatar Feb 27 '24 05:02 bioball

You actually mean hard link, i.e., a.pkl and b.pkl point to the same inode? It's normal and expected for hard links and symbolic links to have different semantics. On the other hand, not canonicalizing symbolic links and HTTP URLs will result in unnecessary work, unnecessary inconsistencies (e.g., file/URL content changes while program is running), and unnecessary errors (e.g., incompatible types). I feel it's just asking for trouble for no real gain, especially in large, distributed programs. If canonicalization of symbolic links caused major confusion, you'd have heard of it already.

Here is another question: Should https://pkg.pkl-lang.org/github.com/jamesward/pklamper/[email protected] and https://github.com/jamesward/pklamper/releases/download/[email protected]/[email protected] be treated as unrelated packages?

Somewhat related: https://go.dev/doc/go1.4#canonicalimports

translatenix avatar Feb 27 '24 07:02 translatenix

Here is another question: Should https://pkg.pkl-lang.org/github.com/jamesward/pklamper/[email protected] and https://github.com/jamesward/pklamper/releases/download/[email protected]/[email protected] be treated as unrelated packages?

Yeah, I think those should be different packages.

The core problem that I see here is that, just by looking at source code, you should be able to tell whether something is treated as the same module (and the same type) or not. If we treated symlinks as in-language aliases, you'd have a pretty hard time figuring out whether the snippet above would error or not.

bioball avatar Feb 29 '24 00:02 bioball

I can't think of a concrete problem caused by having one module per source of truth. I can think of several problems caused by having multiple modules per source of truth, especially at scale and in a nominally typed language. Users won't be surprised if foo.bar.Baz and foo.bar.Baz are compatible, they will be surprised if they aren't. Aliases are already everywhere in the language, and tooling can help to make sense of them.

Looking at the current code, if the goal is to treat redirect-to-url and url as independent modules, it doesn't make sense to do an additional security check for url when accessing redirect-to-url. And the additional resolved URI cache does nothing but create the very problem you're trying to avoid, namely to have the same module for different unresolved URIs.

translatenix avatar Feb 29 '24 02:02 translatenix

FYI: the module name is not actually used for type checking. The module URI is the actual "name" of the type.

You can test this via:

FooA.pkl

module Foo

name: String

FooB.pkl (no symlink)

module Foo

name: String

test.pkl

import "FooA.pkl"
import "FooB.pkl"

res: FooA = FooB

Produces error:

–– Pkl Error ––
Module version conflict: Expected value of type `Foo` defined by module `file:///Users/danielchao/code/apple/pkl/.dan-scripts/FooA.pkl`, but got type `Foo` defined by module `file:///Users/danielchao/code/apple/pkl/.dan-scripts/FooB.pkl`.


4 | res: FooA = FooB
         ^^^^
at test#res (/Users/danielchao/code/apple/pkl/.dan-scripts/test.pkl:4)

4 | res: FooA = FooB
                ^^^^
at test#res (/Users/danielchao/code/apple/pkl/.dan-scripts/test.pkl:4)

106 | text = renderer.renderDocument(value)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
at pkl.base#Module.output.text (https://github.com/apple/pkl/blob/0.25.2/stdlib/base.pkl#L106)

I can't think of a real-world scenario where treating symlinks as a different module is actually a problem. Can you illustrate one?

Looking at the current code, if the goal is to treat redirect-to-url and url as independent modules, it doesn't make sense to do an additional security check for url when accessing redirect-to-url.

This is a security concern; Pkl shouldn't import sources that aren't allowed by --allowed-modules. If there is a redirect, we want to check the redirect as well.

And the additional resolved URI cache does nothing but create the very problem you're trying to avoid, namely to have the same module for different unresolved URIs.

Yeah, maybe the modulesByResolvedUri cache should go away.

bioball avatar Feb 29 '24 17:02 bioball

Some tests:

Python

a/__init__.py

class Foo:
  def __init__(self):
    pass

b/__init__.py (symlink)

../a/__init__.py

test.py

from a import Foo
from b import Foo as Foo2

print Foo == Foo2
python test.py
# => False

Node.js

lib/a.mjs

export const bar = { a: "hello" }

lib/b.mjs (symlink)

a.mjs

test.mjs

import { bar } from "./lib/a.mjs";
import { bar as barB } from "./lib/b.mjs";

console.log(bar === barB);
node test.mjs
# => true

Go

a/a.go

package a

type Foo struct {
	Prop string
}

b/b.go (symlink)

../a/a.go

main.go

package main

import (
	"fmt"
	"foo/a"
	b "foo/b"
)

func main() {
	var prop1 a.Foo
	prop1 = b.Foo{}
	fmt.Sprintf("Foo: %s", prop1)
}

go.mod

module foo
$ go run main.go
# command-line-arguments
./main.go:11:10: cannot use b.Foo{} (value of type "foo/b".Foo) as "foo/a".Foo value in assignment

Ruby

lib/a.rb

class MyClass
end

lib/b.rb (symlink)

a.rb

test.rb

require_relative "lib/a"
first_my_class = MyClass
require_relative "lib/b"
second_my_class = MyClass

puts first_my_class == second_my_class
$ ruby test.rb
# => true

bioball avatar Feb 29 '24 17:02 bioball

Seems like a toss-up; Go and Python treat symlinks as their own files; Node and Ruby treat them as the same file.

bioball avatar Feb 29 '24 17:02 bioball

Comparing with other languages is a good idea. Ideally we could compare with other nominally typed config languages.

I can't think of a real-world scenario where treating symlinks as a different module is actually a problem. Can you illustrate one?

I think symlinks and redirects are ultimately the same problem. Apart from the incompatible types problem that I'll discuss further below, I think that seeing multiple "versions" of a module in a single program run because the module's source code changes externally is never beneficial. Another concern is the time and memory cost of downloading and evaluating the same source of truth multiple times in a single program run.

This is a security concern; Pkl shouldn't import sources that aren't allowed by --allowed-modules. If there is a redirect, we want to check the redirect as well.

I feel that these two decisions don't go well together:

  • redirect-to-url should not be the same module as url because that would make things difficult to understand.
  • Users must understand that redirect-to-url redirects to url when specifying --allowed-modules.

Here is the kind of problem I expect to see at scale when not canonicalizing module/package URIs:

  • Your program uses package-a, which imports module-x via module-url and exports values of module-x types.
  • Your program uses package-b, which imports module-x via redirect-to-module-url and exports values of module-x types.

Result: You have no way of exchanging values of module-x types between package-a and package-b, and no way to enforce a single version for module-x, only because the two packages, which may not be aware of each other, don't use the same canonical URL for module-x (or its package). You spend the night debugging and trying to work around this.

I feel it's unnecessary to invite this kind of problem. (In a structurally typed language, you wouldn't notice as long as package-a and package-b import structurally compatible versions of module-x.)

Whether to canonicalize module and package URIs feels like a critical decision with far-reaching consequences. Would be great to hear more opinions.

translatenix avatar Feb 29 '24 22:02 translatenix