julia icon indicating copy to clipboard operation
julia copied to clipboard

RFC: Add `RelocPath` type for userdefined relocatable paths

Open fatteneder opened this issue 1 year ago • 6 comments

RelocPath(path) works by splitting path as depot/subpath based on DEPOT_PATH during construction.

When using RelocPath one explicitly opts into relocation, so it throws when path cannot be split.

A relocated path can be queried with String(r::RelocPath), which also throws on failure.


Fixes #54430 Alternative to #55146

fatteneder avatar Oct 08 '24 21:10 fatteneder

What's the idiomatic usage pattern for this look like?

topolarity avatar Oct 09 '24 13:10 topolarity

The example where @__RELOCDIR__ fails https://github.com/JuliaLang/julia/pull/55146#issuecomment-2302409741 would become now:

module MyPkg

const data = RelocPath(joinpath(@__DIR__, "myfile.data"))

function read_data()
    path = String(data)
    # load from path ...
end

end

fatteneder avatar Oct 09 '24 14:10 fatteneder

Is this how we get a Path type finally. I'd love that

gbaraldi avatar Oct 09 '24 18:10 gbaraldi

~~It would be nice to get this into Julia 1.12.~~

EDIT: Given the timeline, I don't think this can make it into 1.12. Maybe we can merge it into master early in 1.13, so that it gets some time to be tested out.

DilumAluthge avatar Oct 17 '24 23:10 DilumAluthge

Should we have a way of constructing a RelocPath by just providing the subpath?

DilumAluthge avatar Oct 18 '24 02:10 DilumAluthge

Is this how we get a Path type finally. I'd love that

FWIW there is already https://juliahub.com/ui/Packages/General/FilePathsBase and https://juliahub.com/ui/Packages/General/FilePaths which provide such types.

Should we have a way of constructing a RelocPath by just providing the subpath?

I am not sure.

fatteneder avatar Oct 18 '24 20:10 fatteneder

@vchuravy @gbaraldi and @topolarity Would you be able to review this PR?

As far as I can tell, this PR is an improvement compared to #55146, but I'm not particularly familiar with the relevant part of the codebase.

DilumAluthge avatar Oct 21 '24 00:10 DilumAluthge

@staticfloat IIRC, you also worked on the depot-relocatibility code; it would be great if you could review this PR.

DilumAluthge avatar Oct 21 '24 00:10 DilumAluthge

Is this how we get a Path type finally. I'd love that

The RelocPath constructor will error if the path isn't inside a depot, right? So IIUC this doesn't really cover the general "path type" request.

DilumAluthge avatar Oct 21 '24 00:10 DilumAluthge

Should we just call this DepotPath? I think that might be a clearer name for users

topolarity avatar Oct 26 '24 13:10 topolarity

Should we just call this DepotPath? I think that might be a clearer name for users

No strong feelings from my side.

fatteneder avatar Oct 28 '24 16:10 fatteneder

I also don't feel strongly between RelocPath and DepotPath.

We could also do RelocatablePath if we wanted something a little more unambiguous.

DilumAluthge avatar Oct 29 '24 09:10 DilumAluthge

xref: https://discourse.julialang.org/t/designing-a-paths-julep/124335

Specifically the p"@/... syntax that is intended to create a relocatable project-relative path (currently only in the design doc, not yet implemented).

tecosaur avatar Jan 03 '25 13:01 tecosaur

Specifically the p"@/... syntax that is intended to create a relocatable project-relative path (currently only in the design doc, not yet implemented).

Hmmm, that sounds slightly different that the RelocPath in this PR, right? The RelocPath in this PR is depot-relative (not project-relative).

DilumAluthge avatar Jan 03 '25 14:01 DilumAluthge

@fatteneder Can you rebase and resolve the merge conflicts?

DilumAluthge avatar Jan 03 '25 14:01 DilumAluthge

Hmmm, that sounds slightly different that the RelocPath in this PR, right? The RelocPath in this PR is depot-relative (not project-relative).

Functionally, there is a difference in the approaches taken here/proposed, but as far as I can tell there's very much overlap in the intended usage. Both fix the linked issue, and in this respect are alternatives.

tecosaur avatar Jan 03 '25 15:01 tecosaur

Hmmm, that sounds slightly different that the RelocPath in this PR, right? The RelocPath in this PR is depot-relative (not project-relative).

Functionally, there is a difference in the approaches taken here/proposed, but as far as I can tell there's very much overlap in the intended usage. Both fix the linked issue, and in this respect are alternatives.

I'm not sure I follow. I think I am probably misunderstanding what you mean by "project-relative path".

I'll try to highlight my confusion. Here's an example use case. Suppose you have two packages, PkgA and PkgB.

In the Git repo for PkgA, there is a file named PkgA/data/stuff.data. This file is some data that I ship in the PkgA repo.

In the Git repo for PkgA, this is PkgA/src/PkgA.jl:

module PkgA

const data = RelocPath(joinpath(dirname(@__DIR__), "data", "stuff.data")) # PkgA/data/myfile.data

function f()
    path = String(data)
    # load from path ...
end

end # module

This is based on the example posted in https://github.com/JuliaLang/julia/pull/55146#issuecomment-2302409741 and https://github.com/JuliaLang/julia/pull/56053#issuecomment-2402576269.

In the Git repo for PkgB, this is PkgB/Project.toml:

name = "PkgB"

[deps]
PkgA = "..."

In the Git repo for PkgB, this is PkgB/src/PkgB.jl:

module PkgB

import PkgA

function g()
    # ...
    x = PkgA.f()
    * ...
end

end # module

Suppose that I am running on machine_1.

On machine_1, my Julia depot is /home1/user1/foo/bar/mydepot.

On machine_1, I create a new Julia project at /home1/user1/hello/world/myproj/Project.toml:

# myproj/Project.toml

[deps]
PkgB = "..."

Now I build myproj into a custom sysimage. I want to give this custom sysimage to a different user, who will run it on their machine_2. So I put the sysimage in a tarball; I also put my Julia depot into that same tarball, because the user will need it.

So the user extracts the tarball on their machine_2, which obviously won't have the same paths as machine_1. Suppose they end up with the depot extracted to /home2/user2/blah/blah/blah/mydepot, and the project extracted to /home2/user2/various/things/here/myproj/Project.toml.

The user makes sure that /home2/user2/blah/blah/blah/mydepot is in their JULIA_DEPOT_PATH, and then they start Julia with the custom sysimage. And now they want to run the PkgB.g() function, which means that PkgA.f() will end up getting called. So PkgA needs to be able to find the PkgA/data/stuff.data file.

If we hadn't used RelocPath(), the old absolute path of this file on machine_1 (/home1/user1/foo/bar/mydepot/packages/PkgA/slug/data/stuff.data) would have been baked into the custom sysimage.

But we used RelocPath(). So, when PkgA.f() is run, the path = String(data::RelocParth) line will look through the depot path, and it will find that the desired file exists on machine_2 at location /home2/user2/blah/blah/blah/mydepot/packages/PkgA/slug/data/stuff.data.

So, if we use RelocPath() as implemented in this PR (depot-relative), then we find our file, and everything works.

Would the proposed p"@/... syntax have the same behavior, i.e. would it also find the file at the new location (/home2/user2/blah/blah/blah/mydepot/packages/PkgA/slug/data/stuff.data)?

My interpretation of the phrase "project-relative path" was that p"@... would be looking relative to the active project (/home1/user1/hello/world/myproj/Project.toml on the old machine, /home2/user2/various/things/here/myproj/Project.toml on the new machine). In this example, the directory containing the active project has no relationship to the stuff.data file on either the old machine or the new machine, so looking relative to the active project wouldn't yield any fruit. So that's the source of my confusion.

DilumAluthge avatar Jan 04 '25 04:01 DilumAluthge

@fatteneder In the current implementation in this PR, in the String(::RelocPath) method, as soon as we find a path where the subpath matches, we accept that path and return.

It occurs to me that, for correctness purposes, we should have more stringent requirements. Instead of only requiring that the subpath matches, I propose that we first check if the subpath matches, and then if the subpath matches, we also require that the checksum matches. If the checksum doesn't match, we reject the path.

This means that at RelocPath() construction time, we have to compute the checksum, and thus the RelocPath struct needs a field for storing the checksum:

struct RelocPath
    subpath::String
    checksum::Union{String, Nothing}
end

When constructing a RelocPath, if the input path is a file, then we just checksum the file. If the input path is a directory, we have to compute the hash of the directory - probably something like this:

artifact_hash = SHA1(GitTools.tree_hash(the_directory))

Which is taken from this section of Pkg.Artifacts.

If the input path is neither a file nor a directory (e.g. it's a symlink), then tbh idk - throw an error maybe? I don't know if that's a real use case that we need to support, at least not in the initial implementation.

I think we should allow opting out of this behavior at two different locations:

  1. At RelocPath construction time, the user can opt-out by doing RelocPath("..."; record_checksum = false). In this case, the newly constructed RelocPath object will have .checksum = nothing stored. For such a RelocPath, whenever String(::RelocPath) is called, we don't compute or verify any checksums, we just accept (and return) the path as soon as we find a path where the subpath matches - in other words, the current behavior of this PR.
    • The default value of the kwarg would be record_checksum = true.
  2. At the String(::RelocPath) callsite. In this case, the user can opt-out by doing String(reloc_path::RelocPath; verify_checksum = false). And if verify_checksum = false, then we don't compute or verify any checksums.
    • The default value of the kwarg would be verify_checksum = true.

What do you think? I feel like this would increase our confidence in the correctness of the file or directory that we return.

DilumAluthge avatar Jan 04 '25 04:01 DilumAluthge

I think this makes senses.

For include(), include_dependency() we already use the same logic for checksum computation, but with Base only functions: hash = isdir(path) ? _crc32c(join(readdir(path))) : open(_crc32c, path, "r"). I think this also works with symlinks, as they are first opened, although we do not have a test for that.

I would go for RelocPath(; track_content), like include_dependency, and then maybe String(; verify_content).

fatteneder avatar Jan 04 '25 16:01 fatteneder

We might also want to promote the object that RelocPath(; track_content=true) is referring to to an include_dependency, so that when PkgA/data/stuff.data changes we recompile PkgA to update the cached hash.

fatteneder avatar Jan 04 '25 17:01 fatteneder

Would the proposed p"@/... syntax have the same behavior, i.e. would it also find the file at the new location (/home2/user2/blah/blah/blah/mydepot/packages/PkgA/slug/data/stuff.data)?

The current idea is that it is relative to the parent project of the code file it occurs in. E.g. in PkgA/src/file.jl instead of doing

RelocPath(joinpath(dirname(@__DIR__), "data", "stuff.data"))

you do

p"@/data/stuf.data"

And the current intent (the Paths proposal is still being written), is that these would behave the same way.

tecosaur avatar Jan 04 '25 17:01 tecosaur

We might also want to promote the object that RelocPath(; track_content=true) is referring to to an include_dependency, so that when PkgA/data/stuff.data changes we recompile PkgA to update the cached hash.

I'm not sure that recompilation is what we want here, particularly in the sysimage case. I think we should just throw an error if there is a hash mismatch.

DilumAluthge avatar Jan 08 '25 23:01 DilumAluthge

I now implemented a track_content::Bool keyword.

I agreed with Timothy that a future path type could easily wrap around this RelocPath by giving it a suitable supertype and hide it behind a string macro.

What remains to be done:

  • Decide on a name, we these suggestions: RelocPath, DepotPath, RelocatablePath
  • Jameson asked for using OncePerProcess above, can't say much about it, someone else needs to judge if its worth.

fatteneder avatar Nov 08 '25 21:11 fatteneder