problem-solving icon indicating copy to clipboard operation
problem-solving copied to clipboard

Add CompUnit::Repository::Lib (or something like it) to core

Open ugexe opened this issue 1 year ago • 35 comments

CompUnit::Repository::Lib (which I'll call CURL) is a mix between CompUnit::Repository::FileSystem (which I'll call CURFS, and in that it uses the same folder/file naming structure everyone is used to) and CompUnit::Repository::Installation (in that it allows for multiple distributions). The structure of the data on the file system looks like:

<some unique prefix 1>/META6.json
<some unique prefix 1>/lib/Foo.rakumod
<some unique prefix 2>/META6.json
<some unique prefix 2>/lib/Bar.rakumod

This solves a few of the issues CompUnit::Repository::Installation (which I'll call CURI) was created to solve, and it avoids trying to solve a few (arguably less important) others.


CURI and CURL Solves

  • Multiple versions of the same distribution

This is the primary problem that needed to be solved, and it speaks for itself.

  • Query-able

Kind of tied into the "multiple versions" problem is needing to be query-able. For instance when two different versions of a given module are installed and someone does use Foo;, the repository needs to be able to pick the proper one to be loaded (as no version was explicitly declared). Additionally there is the issue of multiple versions of bin scripts -- To have a single PATH entry for a given repository, the repository itself needs to be able to query itself to find the e.g. highest versioned bin script to load.

CURI Solves, CURL Doesn't Solve

  • Unicode file names, case insensitive file systems

This is a hard problem, and I'd argue CURI only solves a small part of it. Indeed the hashing of file names means CURI can create module files that can be used via a unicode name. But CURI get the files (and the data they contain) from the file system - you have to download and extract the given distribution to your file system before CURI can install it, and that isn't going to work right if those files are named in a way that doesn't work with a given file system. git doesn't solve this problem either, as if you try to clone a repo that doesn't map correctly on your file system it will give you a warning (and your git diff might show one file contains the data of another similarly cased file).

(Technically something like Distribution::Common::Remote can be passed to CURI to such that the files to be installed don't need to be extracted to the file system first, but that would exclude anything that uses a build step / Makefile and anything that depends on something that uses a build step / Makefile. And currently there isn't a way to tell if a build step needs to occur for an arbitrary distribution from just meta data, so strategically using that isn't a good option in my mind.)

Renaming files (like how CURI renames things to their sha1 on installation) also breaks some things. Notably the OpenSSL dlls don't work when they have been renamed, but also web stuff that may want to put assets in resources and reference the files in html by their original names.

CURI Doesn't Solve, CURL Solves

  • As previously mentioned, renaming files can break e.g. dlls on windows and make referencing relative resources file paths in html/javascript difficult. CURL doesn't have this problem as files retain their original names.

  • Users have a hard time understanding what is actually inside a repository full of sha1s. CURL does still use a sha1 to create the root directory of each distribution, but it doesn't have to be and even with that being the case it is relatively easy to find what is inside each of the directories as, again, CURL files retain their original names.


I suspect users would think the benefits (human readable installed files, easier to integrate with non-raku languages e.g. html and dlls) outweigh the drawbacks (can't theoretically install a distribution that contains both Foo.rakumod and foo.rakumod -- or is named with e.g. unicode characters -- on certain file systems).

Problems?

  • CURL currently greps each directory in its prefix, and lazily reads each META6.json until it finds the distribution it needs to load. It should probably use an index on module short names similar to CURI.

ugexe avatar Sep 24 '23 15:09 ugexe

Is the idea for CURL to replace CURI entirely, or would CURI still be available for situations where CURL won't work?

I ask because of the second drawback you mentioned: that users can't install Unicode-containing files on some file systems. I'd personally view that as a fairly large point against the idea. Given that many file systems do support Unicode characters and that many languages use non-ASCII characters fairly regularly, it's pretty easy to imagine a module author use a file name that works for them but fails for other users. And that'll only be increasingly true if some of Raku's more ambitious internationalization plans work out (of the sort discussed at the core summit, I mean).

And, anyway, both S22 and the CompUnit docs make a fairly big deal out of Raku's support for Unicode file names. So, at the very least, it's not something we should give up lightly.

Would there be some way to mangle/normalize the file names such that they're still human readable (at least vaguely) but that avoids breaking on non-Unicode-supporting file systems?

codesections avatar Sep 24 '23 15:09 codesections

Is the idea for CURL to replace CURI entirely, or would CURI still be available for situations where CURL won't work?

CURI would still be available. However it would not be as a sort of fallback for when CURL won't work, but just because a bunch of stuff already uses it (slow moving stuff, like packagers).

And, anyway, both S22 and the CompUnit docs make a fairly big deal out of Raku's support for Unicode file names.

I agree that in theory it is a great idea. In practice it hasn't really been used, and the way we allow it to happen at all has significant drawbacks. It is only theoretically possible in a very specific situation: installing a distribution using a custom Distribution that doesn't extract files to the file system and doesn't have any build steps. There isn't really a way for a language to solve extracting a given downloaded distribution somewhere to be installed if the file system is not capable of representing those files.

Would there be some way to mangle/normalize the file names such that they're still human readable (at least vaguely) but that avoids breaking on non-Unicode-supporting file systems?

Punycode is probably the best thing I can think of, but that doesn't handle the arguably bigger issue of case sensitivity. I'm not sure there is a way to have all of: human readable, unicode compatible, case sensitivity. Regardless, CURI does not preclude the use of some theoretical name normalization.

ugexe avatar Sep 24 '23 15:09 ugexe

Something else worth mentioning is that unicode module names can be used with CURL; it's only unicode file names that CURL doesn't support. This means if files are mapped using what is in a given META6.json (instead of naively concatting $module-name to lib/) -- and in the META6.json it maps to some non-unicode path -- that you can still refer to the module by its unicode name as you'd expect.

ugexe avatar Sep 24 '23 17:09 ugexe

Something else worth mentioning is that unicode module names can be used with CURL; it's only unicode file names that CURL doesn't support.

That kind of makes it more of a footgun, though – if authors see unicode module names by more experienced Rakoons, and then use unicode when naming their own modules (not realizing that they need a META6 mapping), then they'll have a module that works perfectly for them but breaks on different OSes.

But I like the overall idea of CURL and believe that we ought to be able to come up with a normalization scheme that is still mostly human-readable without sacrificing unicode support/case sensitivity.

How about this scheme, off the top of my head: 1. Any ASCII lowercase letter is left as-is 2. Any ASCII uppercase letter is preceeded by a _ 3. Any other character is replaced by a _ followed by that character's decimal Unicode value

So ResuméBuilder2 would become _resum_233_builder_50. That isn't 100% human-readable, but it's close enough that you'd be able to tell what module it meant. And I think that'd work on pretty much any file system. What do think?

But, again, I like the general idea of moving to CURLs or something like them.

codesections avatar Sep 24 '23 18:09 codesections

I think ideally one could just extract an archive of a boring distribution somewhere and have it work. If we have to normalize 99% of all module filenames then I'm not sure its a great alternative. And for that it doesn't even avoid the issue regarding the naming of files inside of resources/ (the dll problem, and the html files issue in particular).

I kind of think the best we can do is to allow ecosystems to warn users against distributing code that isn't very system independant. For instance fez might warn users when they try to upload a module that has a Foo.rakumod and foo.rakumod, or when a user uses unicode in a file path. Raku would still let users use unicode file names on their own system if they want, but what gets distributed (and thus has a higher expectation of being written to work on other systems) is enforced by a given authority's policy.

ugexe avatar Sep 24 '23 19:09 ugexe

IMO, the supporting Unicode filenames is pretty key to meaningfully supporting Unicode module names and, in turn, supporting Unicode module names is a core goal of Raku's whole approach to Distributions (it's first in the list of reasons for Raku's system in that docs page I linked earlier, for instance). And I think it'd be a shame to give up on that goal.

It also seems to me that fez or zef should be responsible for mapping existing, human-friendly names into names that work across platforms (maybe in a step that occurs before installation; maybe even at upload?) instead of asking users to do that mapping manually in their META6.json file. And when it comes to resources, I'm OK with a system that prevents people from referring to them by their original file name – so long as we have easy/well-documented methods that let them map from their source filename to their location. After all, in shell scripting we can write wc --lines /path/to/file but we can't write lines '/path/to/file' in Raku – we have to convert from the file name to an actual file, with something like lines '/path/to/file'.IO. Requiring devs to put the resources equivalent of that in HTML files doesn't strike me as too bad, especially since forgetting will generate an immediate and obvious error.

But all that's just my ¢2, and I know you've thought more deeply about this issue than I have. So, if no one else chimes in, I'm happy to defer to your judgment on this one.

codesections avatar Sep 24 '23 20:09 codesections

IMO, the supporting Unicode filenames is pretty key to meaningfully supporting Unicode module names IMO, the supporting Unicode filenames is pretty key to meaningfully supporting Unicode module names

To me, not having to map Unicode fie names in a META6.json is a pretty low bar to support this type of feature. The fact we even make it possible to have a Unicode module name is in line with making hard things possible. We don't need to make hard things easy, just possible. We even allow the user to just use the Unicode file name on their own system if they want.

It also seems to me that fez or zef should be responsible for mapping existing, human-friendly names into names that work across platforms

In a way that means every package system (zef, apt, etc) would be free to come up with their own complicated logic to do this. Users will still want to know how to get the mangled file name (similar to how users still want to get the sha1 file names even though they shouldn't and even though we supply users with all the tools to do things The Right Way), but they'll have no way of knowing which scheme any two distributions are using. Taken to the extreme one could say zef should ignore the META6.json file entirely (or rather just generate what it thinks is an appropriate META6.json at build time) to always do what is probably expected, but doing what is probably expected is not ideal for something with security implications. To some degree module authors have to be explicit, strict, etc if they want their code to work outside of their own systems.

maybe in a step that occurs before installation; maybe even at upload?

I think it would be a good idea to notify/warn/error the user at upload. Modifying the distribution at upload or after download, not so much (after all the uploader should be able to know the checksum before it is uploaded). Having fez/mi6 handle the Unicode file name in the META6.json at authoring time is also logical. But regarding installation time... how can e.g. zef install https://github.com/foo/bar.git work if it contains Unicode file names? As soon as the repository is cloned on a file system that doesn't support Unicode or case sensitivity the the file will be lost - there is no chance to rename the actual file so it can exist on the file system.

I'm OK with a system that prevents people from referring to them by their original file name

But then we admit we can't support non-raku code that won't work with a different name. This is not an acceptable workaround, it is just the only workaround.

so long as we have easy/well-documented methods that let them map from their source filename to their location

Maybe I'm misunderstanding, but how would someone do this for html and javascript files practically? In a production environment you don't want to serve these type of files by going through some raku code to map the names, you want to let your e.g. reverse proxy just handle all your static content from some directory directly, which means you need to access them via the names as they exist on the file system. Even getting the names at runtime is probably going to be impossible unless it is built into Raku itself, because each packaging system would have their own methods of doing this (which means we would have distributions depending on a specific package manager).

But all that's just my ¢2, and I know you've thought more deeply about this issue than I have. So, if no one else chimes in, I'm happy to defer to your judgment on this one.

I have more thoughts on this than I'm capable of writing up unprompted in an initial github issue, so I'm happy (and would expect) to continue addressing people's concerns. Removing a feature is not an easy proposition to make.

ugexe avatar Sep 24 '23 22:09 ugexe

but what gets distributed (and thus has a higher expectation of being written to work on other systems) is enforced by a given authority's policy.

I'm not even sure if it must be enforced. A (suppressible) warning would be sufficient. Say, it could be a distribution to be purportedly designed for a limited set of systems where Unicode is supported.

Besides, installation can reject a distribution, with a descriptive error message, if it finds out that not all file names are suitable for the local file system. (...saying nothing about too long paths on Windows...)

vrurg avatar Sep 24 '23 23:09 vrurg

if it finds out that not all file names are suitable for the local file system.

Yeah, and unfortunately I'm not sure there is a good way to do that outside of rakudo itself. A naive way would be for some program to try to write these various files and see what works and what does not work and use that knowledge to know when to generate such warnings prior to where zef passes the distribution to rakudo for the actual installation of files. But that would have to be done per volume/device/mount whatever, since e.g. two directories can be pointed at different file systems (and even then could change after-the-fact, so any "database" of this info is liable to become stale). Basically such a rejection would have to come form CURI.install(...) itself after it discovers it failed to create a file that is not accessible by its stated file name. I agree that would be a good thing to have.

ugexe avatar Sep 25 '23 00:09 ugexe

I think the problems you listed for CURI can be solved within the implementation of CURI without requiring a full replacement. It's just that no one has given it a try so far. E.g. CURI can easily be changed to use subdirectories for keeping the files of multiple distros apart and use pure ASCII names as-is. It can even go a step further and simply test whether it can write a non-ASCII file name as-is and read it back. Nothing in CURI's interface requires it to rename all files or keep them in the current structure. The changes I mentioned can be done while retaining full backwards compatibility with existing installations.

niner avatar Sep 25 '23 06:09 niner

Nothing in CURI's interface requires it to rename all files or keep them in the current structure.

This is only true in theory. In practice it breaks custom repository locations by hard coupling them to a specific rakudo version or higher. To explain for those who don't know, CURI has an upgrade repository mechanism for changing the files/layout of a repository when building rakudo. So lets pretend CURI is updated to use this new format and I load some code with this new rakudo via raku -I /my/custom/libs -e 'use 'My::Custom::Libs' and see it works. Then you try to do rakubrew switch $some-previous-raku-version && raku -I /my/custom/libs -e 'use 'My::Custom::Libs' and suddenly nothing works because the previous version of rakudo does not know anything about the new repository format. This same workflow (which I was using regularly) was broken for me the last the upgrade mechanism was used.

So not only is updating CURI significantly more work (trying to maintain backwards compatibility over every tiny detail -- something I'm not even sure is practically possible anymore (only technically possible) -- but even done correctly it will break some existing valid workflows.

ugexe avatar Sep 25 '23 11:09 ugexe

~~Aren't new subdirectories are created for new Rakudo version in custom locations?~~ Ignore, messed up with precomps.

vrurg avatar Sep 25 '23 14:09 vrurg

Fixing these issues in CURI will most likely not require a change in repository version, as we record the path to the stored file for every internal name. Old rakudo versions would just follow that path and not care whether it's a SHAed file name or a directory + original name.

niner avatar Sep 25 '23 14:09 niner

One preliminary response, and then something that gets more at the heart of the issue. First, on the specific point:

how would someone do this for html and javascript files practically? In a production environment you don't want to serve these type of files by going through some raku code to map the names, you want to let your e.g. reverse proxy just handle all your static content from some directory directly, which means you need to access them via the names as they exist on the file system.

Right now, if I have a $path that I'd like to reference in my nginx.config, I would get Raku to tell me how to do so with .canonpath or similar. What I'm suggesting is that installed files should work similarly, with a convenience method that would display their exact path to let non-Raku code point to them.

Now on to the more general point:

The fact we even make it possible to have a Unicode module name is in line with making hard things possible. We don't need to make hard things easy, just possible. We even allow the user to just use the Unicode file name on their own system if they want.

Thanks, that's a really helpful comment – it clarifies how our perspectives differ. In my view, "giving a module the name I want" should be in the easy-things-should-be-easy category. I'm thinking partly of @finanalyst's to create non-english versions of Raku (via slangs) or @alabamenhu's work to support multi-lingual error messages. But, even setting those projects aside, "naming a module in my native language" strikes me as something that we should make easy – pretty much everyone names modules, after all. And, of course, for pretty much any non-English speaker, using names from their native language requires Unicode support at least some of the time.

Conversely, I'm pretty willing to put "dynamically linking against a non-Raku program that requires a static name" in the hard-things-should-be-possible category. I'd venture a guess that the vast majority of Raku programs don't directly link against any not Raku code, much less any that requires a static name. Of course, many more indirectly do so, but that's kind of my point: the interface between Raku and non-Raku code tends to be at the library level and, IMO, it's reasonable to expect library authors to do the hard thing of supporting linking via a static name.

Given that perspective, I wonder whether we could solve the issues with CURI from the other direction. What if we keep the current default (filename based on a hash) but allow module authors to specify a static filename in their META6.json (as a map of source-file-name → installed-file-name)? That way, anyone who needs a static name can have one (but bears the responsibility for naming it in a way that works on all target file systems). And anyone who doesn't need a static name gets full Unicode support.

codesections avatar Sep 25 '23 15:09 codesections

I would get Raku to tell me how to do so with .canonpath or similar

Maybe I'm misunderstanding, but to clarify I'm talking about with an installed distribution. The files don't exist with their original file names, so .canonpath isn't going to be useful besides potentially absolutifying the sha1 file name path. It doesn't help me point nginx at a specific distribution's resources directory and references those files by their original name.

I'd venture a guess that the vast majority of Raku programs don't directly link against any not Raku code, much less any that requires a static name.

To be fair I named OpenSSL (and thus IO::Socket::SSL and anything else that depends on it) specifically. And I would be willing to wager there is far more code written with OpenSSL as a dependency than there are module using unicode file names (even if you filter down OpenSSL use to windows users only).

That way, anyone who needs a static name can have one

One of the problems this intends to solve is how our current method is not at all human readable. I'm not sure a scenario where many users are requesting various module authors to have all files be explicitly mapped to human readable names is something we would really want.

I'm thinking partly of @finanalyst's to create non-english versions of Raku (via slangs) or @alabamenhu's work to support multi-lingual error messages

These problems don't exactly face the same restraints though. File systems themselves are at the core of this problem, and we would be wise to consider that abstraction when designing an interface around it. I have a strong hunch that if we asked users if they would prefer A) the ability to use unicode file names for their modules or B) the ability to untar a distribution directory into its install location and have it largely Just Work, users would choose B. Remember, Option A really does preclude option B because the files still have to be extracted from a tar file, git repository, etc onto the potentially problematic file system before zef or rakudo can rename them.

ugexe avatar Sep 25 '23 15:09 ugexe

Fixing these issues in CURI will most likely not require a change in repository version, as we record the path to the stored file for every internal name. Old rakudo versions would just follow that path and not care whether it's a SHAed file name or a directory + original name.

Even if this is technically true, it also seems a bit off. For all intents the repository format has indeed changed. In the future when the repository format needs to change again it seems like it would need to know what state the repository is actually in to do a meaningful upgrade, but it won't know if its using the flat directory format, this new proposed format, or some mix of both.

ugexe avatar Sep 25 '23 15:09 ugexe

When considering how Unicode file names should work, think of how to solve this workflow:

  1. User downloads UnicodeNamedDist.tar.gz
  2. User extracts UnicodeNamedDist.tar.gz to ./UnicodeNamedDist
  3. User goes to install ./UnicodeNamedDist, but precompilation fails because the distribution seems to be missing a file listed in provides

By the time we've reached 3 it is already too late - the archive has been extracted but the file does not exist. There is no point where raku can give it an alternative name before it touches the file system for the first time...

...or rather no core friendly way. https://github.com/ugexe/Raku-CompUnit--Repository--Tar (and which S22 also references) can actually do this by extracting single files to stdout and piping that data into a file path that raku provides. But I'm not sure every version of tar supports this, nor would I suggest something in the core that shells out to e.g. tar. If we had core tar.gz extraction support like golang it could be an alternative option though.

ugexe avatar Sep 25 '23 16:09 ugexe

I only have a shallow understanding, so apologies if I've made incorrect assumptions.

How about a solution that retains SHA install naming but adds a file-system layer of link/junction satisfying the human desire for meaningful names? The meaningful links/junctions could sit in a distinct hierarchy - i.e. not be mixed directly in the install folders. Perhaps tooling to create/maintain a human-meaningful shadowed hierarchy on the file-system from an existing installation.

This restricts file-system short-comings to the representational side and works safely alongside the universal SHA naming. There would be nothing preventing several shadow representations existing simultaneously - ASCII, English, French, Kanji - all linking to the same SHA hierarchy.

On the flip side - add META for files/folders where Install should create links. Example: files in this folder should be installed as usual (SHA naming), and then representational links created in [non-clashing install location]. If a supplied file (.DLL?) cannot be linked but requires install at a fixed place on the file-system, you likely have significant chance of security/crash/problem - so I'm not sure if this is worth supporting. SHA-then-link isolates issues with representation to the link layer on non-supported filesystems, and can be reported at attempted install.

Separately: IMHO Unicode module and file naming is important to developers, and should be easy. I expect filesystems will add Unicode support over time, and feel that it would be a step backwards to encourage non-Unicode in core. I'd rather see a Unicode / Case-sensitive approach that errors meaningfully when a file-system has limitations.

jaguart avatar Sep 25 '23 16:09 jaguart

@jaguart even if we implement that level of complexity, how could it solve the workflow I outlined in my previous comment? For a large percentage of people those files can't get onto the file system in the first place to even begin creating sha1 files with their data.

ugexe avatar Sep 25 '23 16:09 ugexe

Maybe I'm misunderstanding, but to clarify I'm talking about with an installed distribution. The files don't exist with their original file names, so .canonpath isn't going to be useful besides potentially absolutifying the sha1 file name path. It doesn't help me point nginx at a specific distribution's resources directory and references those files by their original name.

Maybe I'm the one misunderstanding. In that example, why would you want to be able to use the original file names with nginx? I would think that you'd want to point nginx at the installed file – after all the original file is basically part of the source code and could be changed/deleted at any point.

To be fair I named OpenSSL (and thus IO::Socket::SSL and anything else that depends on it) specifically. And I would be willing to wager there is far more code written with OpenSSL as a dependency than there are module using unicode file names (even if you filter down OpenSSL use to windows users only).

Yeah, that's exactly the point I was trying to get at with by drawing a distinction between programs that directly link to non-Raku code and those that only link indirectly (that is, because a dependency does the actual linking). For a Raku program that depends on IO::Socket::SSL, the developer doesn't need to care at all about how OpenSSL manages to link to dlls on Windows. That's an implementation detail that's abstracted away by the library. Thus, I'm OK with it being a "hard thing"; it only needs to be solved once, at the library level. (Of course, it does need to be a possible thing to solve or else all the dependencies are in serious trouble…)

I have a strong hunch that if we asked users if they would prefer A) the ability to use unicode file names for their modules or B) the ability to untar a distribution directory into its install location and have it largely Just Work, users would choose B.

I don't share that hunch. I agree that, if we're pulling from current Raku users, there wouldn't be a huge contingent of people clamoring for option A. But I expect/hope that will change as Raku becomes more international and utf-u everywhere becomes more of a reality.

But my hunch is that the group insisting on option B would be even smaller. I don't, generally speaking, expect that process to work for any software; instead, I expect that I'll need to install the software in whatever way is customary for that software/ecosystem (e.g., .configure; make; make install, or cargo install, a program-specific-wizard on Windows, etc). And I have the same problem with the workflow you mention in the following comment: Yes, that workflow isn't well supported. By why is it a common enough workflow that we should prioritize it? This might be slightly flippant, but it seems to me that Raku offers a way to install software and users who don't want to use that way are Doing It Wrong™.

@jaguart wrote:

Separately: IMHO Unicode module and file naming is important to developers, and should be easy. I expect filesystems will add Unicode support over time, and feel that it would be a step backwards to encourage non-Unicode in core.

Agreed.

codesections avatar Sep 25 '23 17:09 codesections

By why is it a common enough workflow that we should prioritize it?

It is essentially the only workflow to install modules. What is the alternative workflow to do so that isn't based on shelling out to a system dependent outside program (tar, etc)?

To install a distribution (that isn't already on the file system) you download a single file in some way (tar file, git, etc). Then you have to extract it. Then Raku can do something. If the file can't be extracted on a given file system, and Raku lacks core support for whatever archive algorithm is used, then there is no reason for Raku to try to make it work on that system because it can never reach that point in the first place. In other words - we can support those Unicode filenames but distributions using them still can't be saved/extracted (and thus installed) for any practical purposes on the systems we implement the sha1-ing for in the first place. And indeed for systems where they can e.g. extract a unicode name to the file system, we don't have to do anything extra for Raku to support it with CURL.

Maybe I'm the one misunderstanding. In that example, why would you want to be able to use the original file names with nginx?

I want to point my reverse proxy at a directory (potentially of an installed distribution's resource directory) and have it serve the files there under their original names (since the html files in that distribution would be written to the original file names similar to as if it was being loaded by CURFS).

ugexe avatar Sep 25 '23 17:09 ugexe

To install a distribution (that isn't already on the file system) you download a single file in some way (tar file, git, etc).

No, to install a distribution I type zef install Some::Raku::Code :grin:

That's a somewhat flippant answer, but it gets at a more serious point: I don't see any problem with having Zef (or some other tool) be the "blessed" way to install Raku packages and to say that other installation methods may require more work. In fact, I'd bet that pretty much the only people who might want to install Raku packages without Zef are package maintainers for Linux distros (or BSDs, I guess). And those folks are both ① unlikely to have difficulty with Unicode and ② familiar enough with using tar and other Linux tools in their build process to be willing and able to use the rename-via-stdout method you described.

If we start from the perspective that Zef is the way regular users install Raku programs, then the problem gets easier. Instead of needing to make a workflow easy for everyone, we just need to make it hard-but-possible for Zef to be able to extract the contents of an archive regardless of filesystem constraints. And you've explained why that poses challenges when the archive contains files with names that the OS considers illegal. Indeed, you might be correct that there's no way to do this without either shelling out to tar or implementing at least some level of extraction support in Raku (though I'm not sure how far we'd have to go with that implementation – once we get to the .tar stage (as opposed to the tar.gz stage) we can read the filenames from tar header).

Where I disagree is with the idea that we'd need that support in core. Since the goal is "only" to enable Zef to install packages, it seems like we could have whatever support we need in user land, and Zef could depend on that. And, of course, that distribution wouldn't use any Unicode module names.

I realize that this might seem like an "a simple matter of programming"™ type suggestion. But I'm describing what (IMO) it makes sense to aim for in the longer term. In the near/medium term I personally don't have an issue with shelling out to tar. True, it's not pure bootstrapping, but tar is so widely available – and we'd be using such a limited set of features – that it doesn't seem like a large issue, especially if we plan to move away at some point.

codesections avatar Sep 25 '23 19:09 codesections

I want to point my reverse proxy at a directory (potentially of an installed distribution's resource directory) and have it serve the files there under their original names (since the html files in that distribution would be written to reference the original file names similar to as if it was being loaded by CURFS).

But why would the HTML files be written to reference the original file names? That's what I'm not understanding. If the HTML files are generated by the Raku distribution, then (IMO) that distribution should be able to generate them with names pointing to the installed files. If they're external to the Raku distribution, then I should be able to edit them to point to the installed files. I'm just not understanding why the name of the source-code file (as opposed to the name of the installed file) should ever need to be in my HTML.

(I feel like I might be missing something basic here; my apologies if I'm being dense)

codesections avatar Sep 25 '23 19:09 codesections

No, to install a distribution I type zef install Some::Raku::Code 😁

Behind the scenes this downloads a tar file, uses git, or downloads that distribution in some way.

That's a somewhat flippant answer, but it gets at a more serious point: I don't see any problem with having Zef (or some other tool) be the "blessed" way to install Raku packages and to say that other installation methods may require more work. In fact, I'd bet that pretty much the only people who might want to install Raku packages without Zef are package maintainers for Linux distros (or BSDs, I guess). And those folks are both ① unlikely to have difficulty with Unicode and ② familiar enough with using tar and other Linux tools in their build process to be willing and able to use the rename-via-stdout method you described.

This requires name resolution to happen deterministically which should be in core since zef isn't responsible for module loading, or at least exist in some capacity within the CUR* - right now everything is SHA'd in CURI but not the others.

If we start from the perspective that Zef is the way regular users install Raku programs, then the problem gets easier. Instead of needing to make a workflow easy for everyone, we just need to make it hard-but-possible for Zef to be able to extract the contents of an archive regardless of filesystem constraints. And you've explained why that poses challenges when the archive contains files with names that the OS considers illegal. Indeed, you might be correct that there's no way to do this without either shelling out to tar or implementing at least some level of extraction support in Raku (though I'm not sure how far we'd have to go with that implementation – once we get to the .tar stage (as opposed to the tar.gz stage) we can read the filenames from tar header).

What ugexe is trying to say is that if the OS can't handle unicode names for files then this is where zef's agency ends. It can't shell out to tar or gzip and it can't continue installation in the current state. The suggested fix is creating a CUR that can handle this mutation in both extraction and resolution, not to say the CUR needs to handle the extraction but deterministically determine what it'd have been mutated to <- this is the key.

Where I disagree is with the idea that we'd need that support in core. Since the goal is "only" to enable Zef to install packages, it seems like we could have whatever support we need in user land, and Zef could depend on that. And, of course, that distribution wouldn't use any Unicode module names.

The rub is when rakudo attempts to load/resolve unicode file names on a non-unicode file system.

tony-o avatar Sep 25 '23 19:09 tony-o

Behind the scenes this downloads a tar file, uses git, or downloads that distribution in some way.

Yeah, I get that of course (hence the grin). What I was trying to get at is that, since this is done behind the scenes, it's fine for it to be hard-but-difficult, which is much easier than trying to come up with a solution that fits into the workflow for typical uses. You know, the typical "torment the implementer" sort of thing…

What ugexe is trying to say is that if the OS can't handle unicode names for files then this is where zef's agency ends. It can't shell out to tar or gzip and it can't continue installation in the current state.

I don't follow this. I understood @ugexe to have said (in a previous comment) that Zef could handle that situation by using the approach taken by CompUnit::Repository::Tar, at least if shelling out to tar is acceptable. Did I misunderstand that comment?

codesections avatar Sep 25 '23 19:09 codesections

But why would the HTML files be written to reference the original file names? That's what I'm not understanding. If the HTML files are generated by the Raku distribution, then (IMO) that distribution should be able to generate them with names pointing to the installed files.

Because inside of html like resources/mypage.html you might do something like <img src="myfile.png> which would work in CURFS, but when loaded from CURI it would 404 because the file would now be called 58F5FC9AA510E61F7A2C619903AEA1C929D9E007.png. Those files aren't generated by the Raku distribution, they are only distributed with it.

ugexe avatar Sep 25 '23 19:09 ugexe

a simple matter of programming

I'm not sure how that can work with the various nativecall distributions that use e.g. Build.rakumod and/or Makefiles. Those files have to be extracted. If all the files in the archive are extracted then potentially some files fail to get created because the file system doesn't support them. If, somehow, only the files required for the Makefile are extracted then they need to also be re-archived into a new .tar.gz file to be installed (and that is ignoring that the Makefile might need to access the actual Raku module files, leading back to saying everything needs to be extracted). Furthermore, if hooks (as mentioned in S22) is ever implemented, it too would likely require all the files to be extracted pre-install.

ugexe avatar Sep 25 '23 19:09 ugexe

Because inside of html like resources/mypage.html you might do something like <img src="myfile.png> which would work in CURFS, but when loaded from CURI it would 404 because the file would now be called 58F5FC9AA510E61F7A2C619903AEA1C929D9E007.png.

I understand that part. But what I don't understand is why the author of the distribution wouldn't just put <img src="58F5FC9AA510E61F7A2C619903AEA1C929D9E007.png> in the HTML file. True, that would require the distribution author to introspect enough to generate that hash, but that seems like a reasonable step to take for production code – using the hashed name clarifies that it is production code and ensures that it points to the correct file. That second point isn't as relevant for a png, but matters more for js/css; indeed, IME many js/css files are already renamed with a hash for cache-busting purposes as part of the build process.

None of that is to say that there couldn't be a situation in which someone really wants to have <img src="myfile.png"> in their HTML. But if that does come up, it seems like the sort of edge case that'd be addressed by letting developers specify that resources/myfile.pngshould be mapped tomyfile.png(and accepting the responsibility for ensuring thatmyfile.png` is a valid filename).

I'm not sure how that can work with the various nativecall distributions that use e.g. Build.rakumod and/or Makefiles.

Yeah, I can see how that'd be an issue. But, as in the OpenSSL case, nativecall distributions tend to be pretty low-level and written by fairly experienced Rakoons. And, almost by necessity, they deal with OS-specific issues. So I wouldn't mind a solution that requires nativecall-distribution developers to avoid Unicode filenames when they're targeting non-Unicode-supporting OSs. Or one that required them to add a field to their META6.json. Or, at least, I'd prefer that they deal with that complexity than that someone's first Raku module runs into an issue because its name includes an umlaut.

codesections avatar Sep 25 '23 20:09 codesections

I understand that part. But what I don't understand is why the author of the distribution wouldn't just put <img src="58F5FC9AA510E61F7A2C619903AEA1C929D9E007.png> in the HTML file

Because it then would not work when loaded by CURFS (or some other external repository class that doesn’t use sha1). The sha1 is an implementation detail of a specific repository type.

ugexe avatar Sep 25 '23 20:09 ugexe

Behind the scenes this downloads a tar file, uses git, or downloads that distribution in some way.

Yeah, I get that of course (hence the grin). What I was trying to get at is that, since this is done behind the scenes, it's fine for it to be hard-but-difficult, which is much easier than trying to come up with a solution that fits into the workflow for typical uses. You know, the typical "torment the implementer" sort of thing…

What ugexe is trying to say is that if the OS can't handle unicode names for files then this is where zef's agency ends. It can't shell out to tar or gzip and it can't continue installation in the current state.

I don't follow this. I understood @ugexe to have said (in a previous comment) that Zef could handle that situation by using the approach taken by CompUnit::Repository::Tar, at least if shelling out to tar is acceptable. Did I misunderstand that comment?

Tar can handle it since it's just bytes in a file and does not necessarily need to be extracted anywhere. In this way TAR is handling the mutation that needs to happen to the filenames (by making it unecessary).

tony-o avatar Sep 25 '23 20:09 tony-o