netatalk Support remapping the \r in Icon\r to something that can be named in .hidden files

CLARIFYING EDIT: Every time I write \r, I mean a literal 0x0D byte (A.K.A. ^M in Vim), not the string \r.

Is your feature request related to a problem? Please describe.

Currently, my local Linux file manager's view of /srv/retro is cluttered up with Icon empty document entries that I can't get rid of because either the local equivalent to Samba's hide files = ... option (which is putting a .hidden file inside the folder with one filename per line) or Dolphin's parsing of it wasn't designed with filenames like Icon\r in mind. (Neither Icon? nor Icon\r nor Icon<literal CR> work and I don't want to try Icon*.)

Describe the solution you'd like

Given that the trailing \r is probably going to be eaten by every "just do what I mean" implementation of line-splitting under the sun, including ones built into some programming languages, I think it would be an endless slog to identify and fix every implementer of .hidden that doesn't special-case Icon\r, so I think it needs to be fixed on the Netatalk side.

The only option I can think of which wouldn't break Samba interop (In fact, it would fix it for those who want that. More on that later.) would be to add a config file option that allows remapping \r to something else.

(I'd go with generic support for remapping a list of Unicode code points to another list of Unicode code points since that feels like it'd be the best balance of concerns.)

Now about how it would fix Samba interop...

I noticed that I was able to set up my /srv/retro with per-platform icons like this:

Linux (.directory file inside each folder. Only works remotely if mounted because of how Dolphin's "When is it safe to thumbnail?" logic handles the smb:// KIOSlave.)
Windows XP (desktop.ini. As long as icon sizes above 48x48px are undefined in the .ico, sizes 48x48px or below will be used.)
Windows 7 (desktop.ini. Windows 7 will downscale the 128x128px icon at any icon size above 48x48px.)
Mac OS 9 (Icon resource over AFP, sizes 32x32px and below.)
Mac OS 10.4 (Icon resource over AFP, size 128x128px)
Mac OS 10.13 (Icon resource over SMB)

At first, I just assumed that OSX had changed how it stored the icon resources (inside .DS_Store?) when using the SMB client... until I noticed that I had two un-hidden files in the folder... Icon\r and Icon.

I didn't see any option to turn it off in the smb.conf docs and I honestly like being able to give different icons to PPC-era and Intel-era Mac OS, but either OSX or Samba apparently does remap \r to  when storing Icon\r over SMB.

The important thing is that I can name files containing  in .hidden, so, aside from inspiring this solution, we're back to "Netatalk's representation of Icon\r is the only thing I can't hide using .hidden".

(At the moment, my solution has been to just sudo chattr +i **/Icon$'\r' **/._Icon$'\r' to make sure I don't accidentally delete them from the Linux side of my usually-read only = yes /srv/retro share. ...there's also some interaction between the different heterogeneous clients that I haven't tracked down the source of yet that occasionally causes the Icon\r files to become un-invisibled, but praise-be-to-ad set -f V, I just have a cleanup.sh script which recursively hides all the Icon\r and Icon files and the share is normally read-only anyway.)

Describe alternatives you've considered

You could specifically remap just Icon\r, but it feels inelegant and potentially vulnerable to "You did that. Why can't you do this too?" scope creep.
You could implement a generic filename remapping option but that also feel inelegant and prone to users using it to stick dots on names like TheVolumeSettingsFolder, which I assume would break things like Samba.
You could implement a generic solution for automatically remapping the Invisible attribute to a prepended dot and searching for a dotfile too when the client asks for a file without a prepended dot... but I don't think I need to see how the complexity of addressing all the edge cases would spiral.

Additional context

In case you aren't sold on the importance of retaining the ability to keep Netatalk's Icon\r separate from SMB's Icon, have a screenshot of my in-progress icon set for my /srv/retro:

(Yeah. I haven't started on the 10.4 icons yet, since I'm not yet tooled up for a streamlined workflow on perspective transforms, so I'm just using a couple of standard icons to demonstrate the principle, and you'll just have to take my word for it that Dolphin isn't just honoring desktop.ini's icon assignments until I get around to making variants to fit with KDE's Breeze theme.)

preview

Aug 08 '24 22:08 ssokolow

How do you have extended attributes setup on the netatalk side? Is it set ea=auto or ea=samba? The latter is supposed to improved compatibility with Samba's vfs_stream_xattr, but the docs are a bit thin otherwise. The other issue is that Samba has seen continuous development and might have had changes that broke Netatalk 3.x compatibility over the years. Here are additional tips on setting up Samba with Netatalk 3.x: https://www.samba.org/samba/docs/current/man-html/vfs_fruit.8.html

Aug 08 '24 23:08 NJRoadfan

[DEFAULT]
; Make extended attributes compatible
ea = samba

...but if that's supposed to collapse Icon\r and Icon together, maybe I should change it so things don't break when I upgrade and it gets fixed.

EDIT: ...on the other hand, I remember the manpage saying all it's supposed to do is append a null byte to each xattr for compatibility with Samba's implementation.

Aug 08 '24 23:08 ssokolow

...and as a reminder, this isn't a feature request to make OSX AFP and OSX SMB share the same icons. It's a feature request to make Netatalk reproduce mac filenames less faithfully for better compatibility with the .hidden feature of Linux file managers.

Aug 08 '24 23:08 ssokolow

The other issue is that Samba has seen continuous development and might have had changes that broke Netatalk 3.x compatibility over the years. Here are additional tips on setting up Samba with Netatalk 3.x: https://www.samba.org/samba/docs/current/man-html/vfs_fruit.8.html

OK, I found an answer on that front. The thing I need to stay far away from if I want to ensure I can have separate custom folder icons for AFP and SMB is fruit:encoding = native.

Still, even if I wanted fruit:encoding = native, all that would do is make it so that OSX over SMB requires the requested feature too.

Aug 08 '24 23:08 ssokolow

OK, I think I know what is happening. Somewhere Netatalk is converting a literal 0x0D to \r when storing the file. The code likely assumes that 0x0D is not a valid character in a filename (generally a smart thing to do) and converts it to an escaped equivalent.

Aug 08 '24 23:08 NJRoadfan

No, it's writing a literal 0x0D to disk... Dolphin just parses .hidden in a way that means a file ending in a literal 0x0D will never match its line in that file.

If it were writing the string \r to disk then I wouldn't need the requested feature because a literal Icon\r in .hidden and a literal Icon\r in a filename would match each other.

Aug 08 '24 23:08 ssokolow

The problem is that the modern way to hide non-dotfiles in file managers on XDG desktops (i.e. Linux, *BSD) is a newline delimited list and "universal" newline splitting algorithms interpret the trailing carriage return as a delimiter instead of part of the filename.

Think trying to represent a filename containing a comma in a variant of CSV that doesn't support quoting or escaping.

If it were literally any byte other than 0x0d or 0x0a, there would be no problem.

Depending on how such a splitting algorithm is implemented, it may even only take issue with terminal carriage returns. (eg. I've seen ones where "universal" just means "DOS or UNIX" and they just implemented something like lines = [x.rstrip('\r') for x in raw.split('\n')])

Aug 08 '24 23:08 ssokolow

MacOS X natively stores the icon files on HFS+ drives using 0x0D, so its no surprise that Netatalk replicates that. What is writing an Icon file with a literal \r on the end then? Samba? I can't see that being the default though since the backslash is a big no-no on Windows systems.

Aug 09 '24 00:08 NJRoadfan

Nothing is writing an Icon with a literal \r on the end.

The feature request is for support for translating Icon<0x0D> (A.K.A. Icon^M) into something else (like Samba does by default) because file managers parsing the .hidden file interpret a trailing CR followed by the intended LF delimiter as a stray DOS-style CRLF line ending and normalize it away. (Which then means an Icon != Icon<0x0D> result and no hiding of the Icon<0x0D> file.)

Trailing carriage return bytes are literally unrepresentable in a "Newline-delimited list, using DOS or UNIX line endings" file unless the parser is smart enough to follow Vim's stateful approach where only the first line's delimiter is heuristically detected and all following lines are assumed to use the same delimiter type.

Aug 09 '24 00:08 ssokolow

...and yes, I checked. Dolphin's parser is both universal enough to recognize classic mac-style CR-only line endings and stateless enough to allow you to mix different kinds in the same file, so it'll parse this .hidden file the same way whether using DOS, UNIX, or Mac line endings...

Desktop.ini⏎
DESKTOP.INI⏎
Icon⏎
Icon<0x0D>⏎

...like this:

Desktop.ini⏎
DESKTOP.INI⏎
Icon⏎
Icon⏎
⏎

Aug 09 '24 00:08 ssokolow

OK, so you are manually created/named a file named Icon\r and Samba converts that to Icon<0x0D> when reading and delivering the directory entry to a SMB client?

Netatalk 2.x used to support encoding illegal characters using CAP encoding (Icon:0d in your case), but that appears to have went away with 3.x. The code for this still appears in the tree though. I don't know if it would even work in this case since Icon<0x0D> is legal on *NIX platforms.

Aug 09 '24 00:08 NJRoadfan

I'm manually creating a file named Icon<0x0D>... by Using Command-C and Command-V between two Get Info dialogs in Mac OS 9.2.2 and Mac OS 10.4 to set a custom icon on a folder.

Finder stores custom folder icons inside the resource fork of an invisible file named Icon<0x0D> inside the folder in question... but Mac OS hides it using the HFS Invisible flag, while Linux/BSD file managers are completely ignorant of the user.org.netatalk.Metadata xa that Netatalk uses to store that.

This feels like I'm being faulted for "allowing" Windows Explorer to splash Thumbs.db files all over the place.

EDIT: The difference is, I can add a line containing Thumbs.db to a text file named .hidden inside the folder and the file manager will then treat it as if it were named .Thumbs.db instead... but .hidden is newline-delimited and the file manager's parser interprets Icon<0x0D> as Icon followed by a delimiter because <0x0D> is a valid Macintosh line delimiter and <0x0D> followed by a standard UNIX line delimiter is how you write a standard DOS/Windows line delimiter.

Aug 09 '24 00:08 ssokolow

To make it clear, Netatalk is faithfully reproducing the filename used by Finder's behind-the-scenes mechanism for custom folder icons... and that's the problem.

What I'm asking for is a solution that involves patching Netatalk once instead of tracking down and writing a PR for every single Linux file manager that implements support for the the .hidden file, both now and in the future, to smarten up its line-splitting algorithm and extend/add the regression test so that these dozens of file managers recognize that a CRLF in an otherwise LF-delimited .hidden file is probably a Netatalk-created file rather than a stray DOS/Windows line delimiter.

That latter approach would be about as viable as getting every Linux and BSD file manager in existence to add support for parsing the HFS Invisible flag out of user.org.netatalk.Metadata and honoring it... and it still wouldn't fix it for cases where your .hidden file contains exactly one entry and it's Icon<0x0D> because that'd be indistinguishable from a one-entry CRLF-delimited file.

EDIT: And why did Apple use that name to begin with? It's as if Microsoft chose to name desktop.ini as desktop.ini<CR><LF> or if XDG decided the standard name for .directory (The Linux/BSD analogue to desktop.ini) was .directory<LF>.

Aug 09 '24 00:08 ssokolow

Just a shot in the dark, the wildcard 'Icon'$'\r' doesn't work? Just noticed that when doing an ls of a directory. Using CAP notation does NOT work for this use case unfortunately.

Aug 09 '24 00:08 NJRoadfan

$'\r' is bash/zsh shell syntax. If .hidden doesn't support Icon? (which belongs to the simplest common subset of shell globs), then it's not going to support something that's effectively an embedded subset of shell scripting itself.

(Icon? and Icon* are available to every programming language, essentially for free, via the glob and fnmatch functions in the C standard library, which can be assumed to be present by any portable POSIX application because every Unixoid platform except Linux has ABI-unstable kernel syscalls and treats its libc as the ABI stability boundary.)

Aug 09 '24 01:08 ssokolow

Yeah. I just checked. Dolphin is literally just delegating to the behaviour imposed by Qt's QIODevice::Text flag when you call QTextStream::readLine(), which is documented as:

When reading, the end-of-line terminators are translated to '\n'. When writing, the end-of-line terminators are translated to the local encoding, for example '\r\n' for Win32.

EDIT: And there's no spec for .hidden. It's literally just "Some GNOME person banged this together for Nautilus and people using other file managers are asking for us to have it too. Let's just copy the 'one file per line' high-level description and call it a de facto standard."

Aug 09 '24 01:08 ssokolow

PCManFM (the GTK-based version) also interprets 0x0D as a line terminator but invokes the "newline inside filename" rendering behaviour... making Finder's custom folder icons even uglier on Linux. I believe that's how all GTK-based things which don't implement .hidden yet will show it.

Screenshot_20240809_110810

(Ignore the two Blank DVD+R Disc entries in the sidebar. It's my workaround for a decade-long, across-multiple-systems bug I don't know how to narrow down enough to report where something in Linux's stack causes "On second though, veto that tray open" commands to pile up in optical drives if you leave the tray empty.)

EDIT: Oh, and no. That's not an uncharacteristically old version of it. I'm running Kubuntu 22.04 LTS, so that would have been current at the time of feature freeze.

Aug 09 '24 15:08 ssokolow

...and, as I suspected for something that doesn't have a clearly defined spec (.hidden), there's no consistency on "Why would anyone sane ever do that?" edge-cases like Finder's decision to put a Macintosh newline at the end of a filename.

Dolphin built against Qt 5.x (system packages) and Dolphin built against Qt 6.x (Flatpak) both delegate to Qt's text-mode line-splitting algorithm (or at least that's the original "implement .hidden" commit's reason for the observed behaviour) and Icon<0x0D> doesn't get hidden because it treats Icon<0x0D><0x0A> as Icon followed by a stray DOS/Windows line ending.

The only reason they render the displayed filename as Icon is that they used to do what GTK does until they added some kind of special-case code to prevent newlines in filenames from messing up rendering.
KDE's standard Open/Save dialog may or may not do something different, despite all tested applications supposedly using the same XDG File Chooser Portal, /srv/retro folder, and .hidden file and whether Icon<0x0D> gets hidden varies.
PCManFM doesn't implement .hidden at all and, as a result, the GTK table view used for detailed list views renders Icon<0x0D> as a two-line entry consisting of Icon on one line and a blank line below it.
GTK's standard Open/Save dialog for whatever version of GTK 3 the Audacious Media Player Flatpak uses for its "GTK (legacy)" frontend apparently does something like "The only valid line terminator is \n... if you use DOS/Windows line endings, it's your problem, not ours" and actually does match Icon<0x0D> to its .hidden entry correctly but will break the list entry across two lines when not hidden.
The Basilisk II emulator uses what I can only assume is a vendored copy of the GTK+ 1.2.x file chooser and it not only doesn't honor .hidden, but renders Icon<0x0D> as a two-line list entry, interpreting <0x0D> as a line terminator.
The SheepShaver emulator uses the GTK+ 2.x file chooser and it apparently does implement .hidden in a way that matches Icon<0x0D> but but renders Icon<0x0D> as a two-line list entry, interpreting <0x0D> as a line terminator, when it's not being hidden.
If I whip up a quick PyQt test script and add dialog.setOptions(QFileDialog.Option.DontUseNativeDialog), it appears Qt's Windows-inspired built-in file chooser dialog doesn't honor .hidden but also does prevent line separator characters from breaking a filename's rendering across multiple lines.
Tk (via Python's Tkinter binding) makes it difficult to specify a file dialog filter that includes extensionless files but excludes dotfiles, but, when I just let it show everything, it does prevent Icon<0x0D> from breaking the rendering.

I can't think of any other general-purpose file-view implementations that I have installed. (eg. Geeqie only lists directories and files with known-supported image extensions and wxWidgets just delegates to GTK.)

In summary, for any folder where someone uses Finder to set a custom icon, so long as Netatalk doesn't support translating the <0x0D> to something else, Linux users will see the following results:

Dolphin: Ugly Icon file. Cannot be hidden by any means so long as the filename ends with <0x0D>.
GTK File Chooser (vendored 1.2.x): Icon<0x0D> cannot be hidden and the dialog renders it as a two-line list entry among a collection of one-line list entries if you don't do that.
GTK File Choosers (v2+): Icon<0x0D> can be hidden using .hidden but the dialog renders it as a two-line list entry among a collection of one-line list entries if you don't do that.
KDE 5 File Chooser: Ugly Icon file that may or may not be possible to hide so long as the filename ends with <0x0D>.
PCManFM: Ugly Icon file that cannot be hidden by any means, even if <0x0D> isn't there, and is even more eye-catching because GTK renders it as a two-line list entry among a collection of one-line list entries.
Qt Non-Native File Chooser: Ugly Icon file that cannot be hidden.
Tk File Chooser: The extensionless nature of Icon means that it doesn't get included in the *.* filter generally used for "All Files" in Tk, since * is an invalid filter.

Basically, as-is, the only way for Macintosh-over-Netatalk and Linux/BSD users to coexist comfortably on the same Linux/BSD filesystem is to either forbid the Macintosh users from setting custom icons (eg. via an inotify hook to delete them as soon as they get set) or to force the Linux/BSD users to access it via Samba so veto files can be used to hide them from non-Netatalk clients... or to write a FUSE proxy filesystem which does something similar to the hidden virtual .zfs folder where snapshots live on ZFS filesystems, where Icon<0x0D> doesn't get returned by opendir/readdir but fopening it will succeed.

Other non-dotfile entries like TheVolumeSettingsFolder only show up in the root of the Netatalk share, so they can be ignored in the few applications that don't honor .hidden, same as the lost+found folder used by fsck, but Icon<0x0D> shows up in every single folder with a custom icon and, without an equivalent to FAT/NTFS Hidden attributes or HFS/HFS+ Invisible flags, there's no way to reliably hide something with such an edge-case filename.

Aug 09 '24 16:08 ssokolow

The behavior of using CAP style encoding (storing the filename as Icon:0d on the Linux side) can likely be patched in. I would not lean to that being the default behavior for creation of this file, as it would break Samba inter-op. The changes would have to be made in charcnv.c. This may cause trouble in other areas of the CNID code though as only / is currently treated as a special case in this matter.

EDIT: Scratch that. Netatalk 3.x doesn't support CAP encoding of filenames anymore as it was removed with this commit: https://github.com/Netatalk/netatalk/commit/f03f4b3ee3b8c423f1b48e3fd5a226db95ce428f

Aug 10 '24 20:08 NJRoadfan

Bear in mind that Samba interop is already broken by default because Samba will do its own analogue to CAP style encoding by default.

(Users must opt into preferrring interop with Netatalk at the cost of breaking interop with Linux GUI file managers by adding vfs_fruit and setting fruit:encoding = native... and I'd been using Samba for 20 years without ever discovering that vfs_fruit existed.)

Aug 10 '24 22:08 ssokolow

To be fair, vfs_fruit didn't exist 20 years ago. I think Apple added their protocol extensions after they switched from Samba to their in-house SMB implementation, which was around the time they started deprecating AFP.

Aug 11 '24 01:08 NJRoadfan

Did the "Samba remaps characters like <0x0D> unless you add a VFS filter to override it" behaviour come later?

...because FAT, exFAT, the Win32 personality of NTFS, and DOS, Win16, and Win32 APIs were forbidding all characters in the 0x1..=0x1f (Rust syntax) range within filenames for the entire lifespan of AFP and it wouldn't make sense for them to start doing a remapping that's clearly for the benefit of being able to manipulate the files through Windows Explorer only after vfs_fruit added an option to turn it off.

Aug 11 '24 01:08 ssokolow

I don't know for sure as I haven't followed Samba development. Those characters were likely prohibited at the SMB protocol level and Apple did filename mangling on the client side when working with shares that do not support the extensions. FWIW, Windows 11 doesn't seem to care. MacOS seemingly writes the Icon files as Icon<0x0D>. Windows stores the image data in ADSes. Explorer shows the file as Icon, so it appears to do the same private area translation of invalid characters as Samba does. The file attributes are set to 'hidden'.

Aug 11 '24 02:08 NJRoadfan

So why is it Samba's responsibility to go out of its way to to provide a non-default vfs_fruit option that writes them in a form that breaks the only mechanism for hiding non-dotfiles in Linux/BSD GUI file managers instead of Netatalk's responsibility to translate them into a form compatible with Samba as well as any program which goes the default route of using the line-splitting/newline-trimming logic in programming languages like Python or Rust and libraries like Qt?

(Even if, personally, I'd prefer the option to translate it differently so I can keep that accidental "different icons for AFP and SMB" feature while also having working .hidden.)

...I suppose I could try spending an afternoon writing an applefix FUSE proxy filesystem and adjust my afp.conf to point to /srv/.retro_applefix instead of /srv/retro... but I shouldn't need to.

Aug 11 '24 02:08 ssokolow

At one point, Samba and Netatalk were under the same management team. Being able to export the same share via multiple protocols is desirable, so some coordination on how filenames and metadata were stored on the host file system was needed. Being able to store a filename as close to the original requested name is always the desired outcome. Since Netatalk was always UNIX based, it tends to be very flexible with filename storage. It didn't have to bend backwards in this case, so it didn't.

Aug 11 '24 03:08 NJRoadfan

And yet, by Finder's weird naming choice and Netatalk adopting a "No hidden/invisible filesystem attribute? Not my problem." design, the emergent result is a worst-case for UNIX/Unixoid platforms that aren't accessing it through something like Samba's veto files.

If Samba bends over backwards for Windows that far, isn't it only fair that Netatalk, make a small nod to address a bug that emerges from an odd Macintosh design choice (Icon<0x0D>) slamming into UNX↔Windows interop ("Universal newline parsing") now that Unixy platforms are finally gaining a way to hide files without renaming them?

It really does feel unreasonable that I need either a network filesystem or a FUSE filesystem to both have custom folder icons in /srv/retro on Macintosh and not have them cluttering up my file manager on Linux, just because Netatalk adopts a posture I'd characterize as "rude" in the Jargon File, Definition 3 sense.

Anything that manipulates a shared resource without regard for its other users in such a way as to cause a (non-fatal) problem. Examples: programs that change tty modes without resetting them on exit, or windowing programs that keep forcing themselves to the top of the window stack.

Aug 11 '24 03:08 ssokolow

@ssokolow I appreciate all the know-how and research you're sharing in this thread. If I could ask you one favor: Please keep a positive and constructive tone in your messages. The arguably poor design decisions in Netatalk are at least two decades old, and if I read @NJRoadfan's intentions correctly, he is describing the current state rather than defending it.

I'm personally not ruling out changing Netatalk's filename mangling behavior. Obviously, a change in such a core part of the application will require careful coding and thorough testing. The absolutely best way to get traction here, would be for you to fork Netatalk, do the requisite code changes, and file a PR back to the project so that we can proceed with code review & testing. We seriously consider all code contributions that adhere to the coding guidelines.

Cheers!

Aug 11 '24 04:08 rdmark

Sorry. I guess my frustration led me to slip on evaluating my phrasing.

As for forking and PRs, unfortunately, I don't trust myself to write in a memory-unsafe language for anything long-running, exposed to the network, or more complex than a little MS-DOS (or, when I can make time to resume learning, Classic Macintosh) utility and Netatalk is all three... especially when I'm currently struggling with the effects of bad sleep habits and am more dependant than ever on the Rust compiler to catch my mistakes.

"Careful coding and thorough testing" is the last thing I trust myself to do at this point in time.

Aug 11 '24 04:08 ssokolow

No worries; thanks for being open to constructive criticism. :)

Doing any kind of substantive change to this C codebase is absolutely terrifying for all of us, with 0% unit test coverage and complex code paths all over the place. But we have at least SonarCloud static analysis and cross-platform CI builds (and human code reviews) to protect us against some of the more obvious bugs. If you ever change your mind, we'll be awaiting your contribution eagerly.

BTW, I have only cursory understanding of Rust, but I wonder how memory safety would be achieved for a multi-process / multi-threaded application like Netatalk? How can the compiler anticipate all potential states?

Aug 11 '24 04:08 rdmark

BTW, I have only cursory understanding of Rust, but I wonder how memory safety would be achieved for a multi-process / multi-threaded application like Netatalk? How can the compiler anticipate all potential states?

It's basically the same sort of situation as asking how a type system like C's can anticipate everything. You make certain unlikely-to-be-correct programs (eg. storing integers in two registers and then performing FADD on them without first translating them from integer form to floating-point form) more difficult in exchange for making testing the correctness of some property of the vast majority of correct programs tractable.

In Rust's case, it's mostly a superset of what's considered good practice in C++ these days, but built into the design of the language and standard library APIs so that you don't have the off-putting degree of annotation clutter and drudgework that would be involved in retrofitting a C codebase with something like splint.

There are things where you can't express them in "safe Rust"... but that's why the unsafe keyword exists to grant localized access to things like dereferencing raw pointers so you can build manually-audited, correct-by-construction abstractions (this is how things like Vec<T> in the standard library and safe FFI bindings are built)

For example...

Common to all architectures:

You need to be inside an unsafe block to be able to dereference a raw pointer, call a function marked unsafe (eg. FFI), and a couple of other things, which greatly limits the scope of code that needs to be audited to ensure memory-safety invariants hold in the rest of the codebase.
Rust has a powerful type system that the community likes to use to enforce as many invariants as is reasonably possible. (eg. Hyper uses the typestate pattern to ensure that things like trying to set an HTTP header after the request/response body has begun streaming is a compile-time error. The typestate pattern can enforce correct traversal of any finite state machine at compile time. Basically, "No such method .set_header() on type HttpRequest<BodyStreaming>"... though Hyper doesn't name their type that IIRC. What C++ can't do is use the borrow checker to keep you from holding onto a reference to a stale state object and using it... yes, Rust relies somewhat heavily on LLVM's optimizers to make this stuff zero-cost.)
The default scheme for managing heap memory is RAII, with an owning type like String or Vec<T> allocating in its constructor and deallocating in its destructor. If you need more complex lifetimes involving multiple owners (surprisingly less common than you'd think), you use the Rc<T> or Arc<T> reference-counted smart pointer types which are equivalent to C++'s std::shared_ptr but with an explicit choice for whether they use atomic instructions instead of glibc's "we'll decide whether to link the version using atomic instructions based on whether libpthread gets linked".
Assignment moves values (non-overridable memcpy and compile-time checked inability to observe or manipulate the old location) unless the Copy marker trait (interface) has been implemented and you cannot implement Copy on types with destructors, in accordance with Rust's "make costs explicit" philosophy.
References (what Rust calls pointers that aren't the raw kind you can only dereference inside unsafe) and slices ((pointer, length) tuples with nice APIs wrapped around them) implement what is essentially compile-time reader-writer locking. This means that you can't take an &mut ("mutable"... but technically "unique" would be more accurate, given that things like Mutex<T> provide a way to temporarily get an &mut reference from an & reference) reference while any & (shared) references are being held. (However, thanks to unsafe, they tend to provide manually audited methods like split_at, which takes one slice and gives you two non-overlapping ones. This compile-time reader-writer locking is also the mechanism which prevents iterator invalidation at compile time.)
If you need to go beyond what can be proven at compile time, then you add a Cell<T>/RefCell<T> (single-threaded), Mutex<T>/RwLock<T> (multi-threaded) wrapper, which uses the type system to ensure you can't forget to take a lock. (You hand ownership of the object to the Mutex when creating it and then it lends out MutexGuard smart pointers when you lock it, which re-lock the mutex when they go out of scope. There's also the Atomic* family of primitive-sized types like AtomicU8 which are less generic but don't need a mutex to be updated.)
The compiler verifies that references don't outlive what they point to, but there are limits to this so you may need to wrap something in a smart pointer like Rc<T> or Arc<T> (Rust's counterparts to C++'s shared_ptr), depending on what you want to achieve. (This is one of the biggest philosophical differences between C or C++ and Rust. Gödel's incompleteness theorems effectively say there will always be some code that can't be proven to be definitively correct or incorrect. C or C++ resolve this by deferring to "trust the programmer" while Rust instead chooses to go with "assume uncertain cases are incorrect", with unsafe as an escape hatch that grants access to additional language features that aren't provable at compile time.)
Rust provides compiler-level support for typed unions (via the enum keyword because the functional programming world calls them "data-bearing enums") which is used for things like monadic error handling (If a function returns String instead of Option<String>, you know it can't be null. If a function returns String instead of Result<String, SomethingError>, then you know the only kind of possible failures are the kind you'd use ASSERT for in C or C++.) and, builds on this to require, at compile-time, that you specify what should happen to the None/Err(T) case in order to get access to the data from the Some(T)/Ok(T) case. NULL as C programmers know it only exists for raw pointers (the ones you need unsafe to dereference) and un-tagged unions (also unsafe) are only there for the C FFI support. (Rust's match, which is like switch/case on steroids, makes this very comfortable.)
Unlike C++, Rust doesn't have you worry about whether a struct or enum is POD. It will never automatically insert a vtable. Dynamic dispatch is a property of the reference, not a property of the thing being referenced, and self in Rust structs is just syntactic sugar for free functions taking the type in question as their first member. (Though, Rust does do automatic structure packing so, unless you use the #[repr(C)] annotation, it reserves the right to reorder your struct members.)
Unit testing is as simple as writing a function in the same source file (so it can see private members), annotating it with #[test], and running cargo test... though it's generally recommended to put them all inside a mod test { ... } annotated with #[cfg(test)] to avoid dead code warnings. Cargo also has integration testing, API documentation generation, and testing of code samples in documentation wired up similarly simply.
Rust does support LLVM's sanitizers for checking your unsafe code and, while it doesn't yet support networking APIs, Rust also has its own named miri (after the Mid-Level IR Interpreter that it's wrapped around, originally written for evaluating const initializers) that can give a friendly, high-level explanation of observed rule-breaking in your unsafe code, including an experimental data-race detector. (It carries on Rust's tradition of extra-friendly error messages.)
The Rust ecosystem loves to provide tooling to make other forms of correctness verification equally comfortable, such as fuzzing, property testing, etc (Differential fuzzing FTW for writing a new implementation of something.)
Rust has a big ecosystem of tools for writing language bindings without having to drop into unsafe. (I like to use Rust as a way to have a single codebase I can maintain once, then use it from my projects in various languages. For example, PyO3 lets me write Python extensions in Rust with compile-time correctness.)
(EDIT) All your native-code tools for C and C++ like gdb and Valgrind still work too.
(EDIT) As Bryan Cantrill said in Is It Time to Rewrite the Operating System in Rust? (please excuse it being from 5 years ago when Rust still had many more rough edges) the composability of Rust makes you reconsider solutions you wouldn't be willing to maintain in C. (He talked about an example where his "I'm pleading. I just want this to compile." Rust code outperformed his optimized C and, when he investigated, he found it was down to how his code spends a lot of time searching maps, the Rust standard library uses a B-Tree for its ordered map implementation, and he's not a brave enough man to be responsible for maintaining anything more complex than an AVL tree in C. I can also vouch for how, since v1.0, Rust has already swapped out its unordered map type (HashMap), its MPSC channel type, and its mutex with more performant versions imported from the larger ecosystem with no disruption to downstream consumers.)
[Feel free to ask more questions. I'm sure there are things I forgot to mention.]

Multi-threading:

Leveraging Rust's powerful type system and an "ABI-unstable-so-only-the-stdlib-can-use-it-outside-nightly-builds" modifier keyword for trait (interface) definitions named auto, all composite types (struct, enum, union, etc.) which contain only types marked with the Send trait will also be marked Send. Same for the Sync trait.)
All APIs for creating or sending data to other threads are designed to enforce that only things marked Send may be sent to other threads and only things marked Sync may be shared between threads. (eg. Mutex<T> and RwLock<T> are marked Sync but RefCell<T> and Cell<T> don't use atomic instructions, so they're marked !Sync. Rc<T> doesn't use atomic instructions so it's !Send and only useful for reference-counting within a thread, but Arc<T> does, so it implements a type constraint along the lines of "I'm Send if my contents are Sync". Mutex implements "I'm Sync as long as my contents are Send so you can describe constraints like "This is a wrapper around a platform API that uses thread-local storage internally". The nice thing is that, so long as you're not using unsafe, you don't have to worry about getting any of this wrong. If you do, it won't compile. On multiple occasions, people have talked about refactoring a codebase for multi-threading by adding threading and then addressing compile errors until it works.)
The design of the Rust ecosystem "encourages by making it easy" designs that involve message passing. For example, you can have a channel (threadsafe queue, if you're not familiar with the term from something like Go) that you use to move things from one thread to another and, if you're doing it in-process, you're only paying to copy the portion on the stack. For example, the stack portion of a Vec<T> is a tuple along the lines of (capacity, length, data_ptr).
If your goal is to express parallelism that can be represented as a scatter-gather on an iterable with no dependency between the elements, and you're already using the Iterator API, the Rayon crate (library) makes scheduling it on a thread pool as simple as replacing .iter() with .par_iter().
The loom crate adds permutation testing to your test suite so that you can shake any concurrency bugs out of code you wrote using lower-level concurrency APIs like the load and store methods on the Atomic* types.)
[Feel free to ask more questions. I'm sure there are things I forgot to mention.]

EDIT: For multi-process, there's a limit to how much any one language can do beyond the process-internal things, but there is the typestate pattern, which Rust's de facto standard HTTP implementation is using to good effect, Rust's proven itself good for parsing and serialization/deserialization tasks (See Serde for Rust's de facto standard framework for that, as well as building blocks that make things more comfortable like bytemuck and byteorder), and it just generally helps if you can trust you need to spend less of your energy scrutinizing other aspects of the code because the compiler is watching your back.

Granted, it's not a panacea, but there's a reason a lot of people have described it as "makes programming fun again". (eg. leaking memory is safe and there's even an API for it (just call mem::forget on something with a heap allocation), because that's no more dangerous than what you can do with a list() in Python or an Array in JavaScript and it's a very difficult thing to prove at compile time. If you're using shared memory instead of message passing and you interact with multiple locking primitives at once, Rust won't magically remove the need to know and apply a solution to the dining philosophers problem. The public/private boundary is the module, so don't assume that auditing just the lines in your unsafe blocks is enough, async/await comes across as surprisingly skill-demanding compared to the rest of the language, etc.)

...and there is one place where unsafe is more dangerous than C for someone with C intuition and that's that you can't just perform operations on raw pointers without knowing which ones will create a temporary reference (&/&mut) because you're still subject to the rules for never aliasing references. (Basically, Rust makes liberal use of the LLVM IR construct that C's restrict translates to. Miri will tell you if you got it wrong. When in doubt, use std::ptr functions.)

As someone who avoids unsafe if at all possible (#![forbid(unsafe_code)] at the top of the source file that defines the root of the crate (library)), the main flaw of Rust I run into is that they did a bit too good a job of making costs explicit, so Rust has a tendency to lure you into premature optimization.

Aug 11 '24 06:08 ssokolow

netatalk netatalk copied to clipboard

Support remapping the \r in Icon\r to something that can be named in .hidden files

netatalk
netatalk copied to clipboard