ruff [red-knot] Rename `FileSystem` to `System`

Summary

This PR is mainly a renaming PR to address the confusion around the existing FileSystem trait.

Today, Red Knot uses a FileSystem trait to abstract the system-dependent file-level operations. They're system-dependent because reading a file differs between the CLI (always goes to disk), the LSP (may return unsaved content), and WASM (has no file system). So far, this makes sense. However, it all became very confusing when we introduced the VendoredFileSystem, which doesn't implement FileSystem because the logic isn't system-dependent. Still, its name suggests that it should implement the FileSystem trait (it is a filesystem-like interface, after all).

This PR renames the FileSystem trait to System (and VfsPath::FileSystem to VfsPath::System, FileSystemPath to SystemPath, and FileSystemPathBuf to SystemPathBuf) to make the distinction more clear. The trait isn't about abstracting a way a file-system like API; it's solely for abstracting system-dependent logic, which, for now, is mainly IO-related, but I can envision more:

Reading a file as a notebook rather than a string
Getting information about the runtime environment: CLI/LSP (OS, architecture, etc), WASM: wasm engine in addition to the OS architecture, etc
Setting up watch tasks (I'm not 100% sure about this one yet, mainly because it requires tight integration into the host's main loop and, therefore, might not be possible to abstract away)
Traversing a directory
Getting the current working directory (again, WASM doesn't know the concept of current working directory)
Getting the configuration directory

The second rename that I've done as part of this PR is to rename Vfs to Files, VfsFile to File, and VfsPath to FilePath. While it is correct that Files emulates a virtual file system by supporting both vendors and system paths, it seemed to me that it over-emphasizes that fact. As far as all upstream code is concerned, these are just files that Red Knot can operate on. That's why I opted for a simpler name.

Should `VendoredFileSystem` and `System` have a shared `ReadOnlyFileSystem` trait?

I mulled over this for a long time. It is very intriguing: Both implement a file system-like API. However, I’m concluding that they should not have a shared base trait because:

We have no use case today for a generic operation over a file system (either using impl or dyn). Such an operation would have the form of fs.read_to_string(path) where fs can either be vendored or system.
Even if the two have a shared base trait, they couldn’t be used interchangeably because VendoredFileSystem uses VendoredPath and System uses SystemPath. We could remove the distinction between VendoredPath and SystemPath, but I think this distinction is more valuable than having a shared base trait. A shared base trait would widen the API of VendoredFileSystem because it suddenly must support FileType::Symlink and permissions.

I’m open to reconsidering introducing a shared base trait when we have a use case for it.

Exposing the `VendoredFileSystem`

Today, VendoredFileSystem is a field on Files (Vfs). This PR introduces a new vendored method on the Db. A trait method is necessary because the ruff_db function source_text and the Files struct operate on vendored files. but the vendored typeshed stubs only ship with red_knot_module_resolver.

For tests, I decided that each test stores its own VendoredFileSystem instance. This allows replacing the VendoredFileSystem with an ad-hoc built zip file and also guarantees full isolation between tests.

Should `Db` implement `System`?

An alternative to what's done in this PR is to remove the system method on Db and instead make Db extend System. The main advantage is that it removes one additional dynamic-dispatch (system() returns dyn System).

I opted for keeping the function because:

It's not entirely clear to me yet if we'll have code that needs to be generic over system that runs before constructing Program.
It's easier to share code between the different runtime environments. We can have a single Program that takes a &dyn System as argument instead of having a custom Program for each environment (although we may still want to do that, this is unclear to me)

Overall, I think this can easily be changed later if we decide that Db extending System is the right way.

Should it be named `Host`?

An alternative to System is Host. I prefer System but don't have a strong reasoning for it. I'm open to renaming it to Host (or any other name)` if someone feels strongly. I prefer not to rename it if we all think both are about as good, just to safe some time.

Test Plan

cargo test

Jul 06 '24 14:07 MichaReiser

CodSpeed Performance Report

Merging #12214 will improve performances by 4.09%

_{Comparing file-system-to-system (2047b82) with main (16a63c8)}

Summary

⚡ 1 improvements ✅ 32 untouched benchmarks

Benchmarks breakdown

	Benchmark	`main`	`file-system-to-system`	Change
⚡	`linter/all-with-preview-rules[unicode/pypinyin.py]`	2.5 ms	2.4 ms	+4.09%

Jul 06 '24 14:07 codspeed-hq[bot]

Hmm interesting. It seems that just including vendored file system causes a 60% perf regression because dropping it (the Zip has a ton of metadata) is so expensive.

Jul 06 '24 15:07 MichaReiser

Hmm interesting. It seems that just including vendored file system causes a 60% perf regression because dropping it (the Zip has a ton of metadata) is so expensive.

I was wondering about using a Lazy static variable for it, since it's immutable read-only data:

diff --git a/Cargo.lock b/Cargo.lock
index 1da9a1e54..f426a6cd2 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -1894,6 +1894,7 @@ dependencies = [
  "camino",
  "compact_str",
  "insta",
+ "once_cell",
  "path-slash",
  "ruff_db",
  "ruff_python_stdlib",
diff --git a/crates/red_knot_module_resolver/Cargo.toml b/crates/red_knot_module_resolver/Cargo.toml
index 99e69f35c..a6761665d 100644
--- a/crates/red_knot_module_resolver/Cargo.toml
+++ b/crates/red_knot_module_resolver/Cargo.toml
@@ -16,6 +16,7 @@ ruff_python_stdlib = { workspace = true }
 
 compact_str = { workspace = true }
 camino = { workspace = true }
+once_cell = { workspace = true }
 rustc-hash = { workspace = true }
 salsa = { workspace = true }
 tracing = { workspace = true }
diff --git a/crates/red_knot_module_resolver/src/lib.rs b/crates/red_knot_module_resolver/src/lib.rs
index d6ec501cc..0a1a77577 100644
--- a/crates/red_knot_module_resolver/src/lib.rs
+++ b/crates/red_knot_module_resolver/src/lib.rs
@@ -12,4 +12,4 @@ pub use module::{Module, ModuleKind};
 pub use module_name::ModuleName;
 pub use resolver::{resolve_module, set_module_resolution_settings, RawModuleResolutionSettings};
 pub use supported_py_version::TargetVersion;
-pub use typeshed::{TypeshedVersionsParseError, TypeshedVersionsParseErrorKind};
+pub use typeshed::{TypeshedVersionsParseError, TypeshedVersionsParseErrorKind, TYPESHED};
diff --git a/crates/red_knot_module_resolver/src/typeshed.rs b/crates/red_knot_module_resolver/src/typeshed.rs
index c8a36b462..718192d0d 100644
--- a/crates/red_knot_module_resolver/src/typeshed.rs
+++ b/crates/red_knot_module_resolver/src/typeshed.rs
@@ -1,22 +1,29 @@
 mod versions;
 
+use once_cell::sync::Lazy;
+
 pub(crate) use versions::{
     parse_typeshed_versions, LazyTypeshedVersions, TypeshedVersionsQueryResult,
 };
 pub use versions::{TypeshedVersionsParseError, TypeshedVersionsParseErrorKind};
 
+use ruff_db::vendored::VendoredFileSystem;
+
+// The file path here is hardcoded in this crate's `build.rs` script.
+// Luckily this crate will fail to build if this file isn't available at build time.
+const TYPESHED_ZIP_BYTES: &[u8] = include_bytes!(concat!(env!("OUT_DIR"), "/zipped_typeshed.zip"));
+
+pub static TYPESHED: Lazy<VendoredFileSystem> =
+    Lazy::new(|| VendoredFileSystem::new(TYPESHED_ZIP_BYTES).unwrap());
+
 #[cfg(test)]
 mod tests {
     use std::io::{self, Read};
     use std::path::Path;
 
-    use ruff_db::vendored::VendoredFileSystem;
-    use ruff_db::vfs::VendoredPath;
+    use ruff_db::vendored::path::VendoredPath;
 
-    // The file path here is hardcoded in this crate's `build.rs` script.
-    // Luckily this crate will fail to build if this file isn't available at build time.
-    const TYPESHED_ZIP_BYTES: &[u8] =
-        include_bytes!(concat!(env!("OUT_DIR"), "/zipped_typeshed.zip"));
+    use super::*;
 
     #[test]
     fn typeshed_zip_created_at_build_time() {
@@ -39,7 +46,6 @@ mod tests {
     #[test]
     fn typeshed_vfs_consistent_with_vendored_stubs() {
         let vendored_typeshed_dir = Path::new("vendor/typeshed").canonicalize().unwrap();
-        let vendored_typeshed_stubs = VendoredFileSystem::new(TYPESHED_ZIP_BYTES).unwrap();
 
         let mut empty_iterator = true;
         for entry in walkdir::WalkDir::new(&vendored_typeshed_dir).min_depth(1) {
@@ -58,16 +64,16 @@ mod tests {
                 .unwrap_or_else(|_| panic!("Expected {relative_path:?} to be valid UTF-8"));
 
             assert!(
-                vendored_typeshed_stubs.exists(vendored_path),
+                TYPESHED.exists(vendored_path),
                 "Expected {vendored_path:?} to exist in the `VendoredFileSystem`!
 
                 Vendored file system:
 
-                {vendored_typeshed_stubs:#?}
+                {TYPESHED:#?}
                 "
             );
 
-            let vendored_path_kind = vendored_typeshed_stubs
+            let vendored_path_kind = TYPESHED
                 .metadata(vendored_path)
                 .unwrap_or_else(|| {
                     panic!(
@@ -75,7 +81,7 @@ mod tests {
 
                         Vendored file system:
 
-                        {vendored_typeshed_stubs:#?}
+                        {TYPESHED:#?}
                         "
                     )
                 })

Jul 06 '24 15:07 AlexWaygood

I was wondering about using a Lazy static variable for it, since it's immutable read-only data:

Hmm yeah, but we would also need to make it lazy in Program but it might be worth it. Anyway, I fixed the benchmarks by excluding the drop time :laughing:

Jul 06 '24 15:07 MichaReiser

Hmm yeah, but we would also need to make it lazy in Program

I don't think Program would need to own a reference to it if it was just a module-level static variable. Code that needs to lookup a file in the vendored typeshed stubs could just reference the existing static variable from the module resolver crate.

Jul 06 '24 15:07 AlexWaygood

I don't think Program would need to own a reference to it if it was just a module-level static variable. Code that needs to lookup a file in the vendored typeshed stubs could just reference the existing static variable from the module resolver crate.

That's correct. That works for Program but not for the TestDb databases where each test should get their own instance. We could use Cow there as well, but I think the real solution is to just make VendoredFileSystem lazy by having an inner enum which is either Raw(Cow<'static, [u8]>) or Zip(ZipArchive...). This way, the complexity is isolated into the VendoredFileSystem implementation without concerning the downstream code. I don't plan on changing this as part of this PR.

Jul 06 '24 16:07 MichaReiser

Why does each test need to have its own VendoredFileSystem instance? I thought we agreed that using --custom-typeshed-dir and the MemoryFileSystem was sufficient for mocking out typeshed in tests. I expect only integration tests to need to look up files from a VendoredFileSystem, except for the small number of tests we already have in ruff_db that are focused unit tests for the specific functionality of the VendoredFileSystem itself

Jul 06 '24 16:07 AlexWaygood

I thought we agreed that using --custom-typeshed-dir and the MemoryFileSystem was sufficient for mocking out typeshed in test

I understood that we agreed on that the ability to create ad-hoc zip files AND custom typeshed dir is sufficient. The former requires that tests have their own VendoredFileSystem.

There are also a few tests in ruff_db where we don't have the stubbed vendored files.

I don't think that storing a &'static VendoredFileSystem on Program would solve the initialization cost. The zip file still gets loaded eagerly when creating the Program. The main advantage that I see is that dropping a Program doesn't require deallocating the VendoredFileSystem. But we shouldn't make too big a deal out of it. The Zip file doesn't store that much data, at least not compared to the amount of data that we'll store when loading builtins.

The only other solution is to not store vendored on Program and instead make all code directly access a global static. I prefer not to have global variables because they make things harder to test. But more fundamentally, there's code in ruff_db that needs access to the vendored file system but the typeshed files are only available in red_knot_module_resolver and higher.

Either way. I prefer not to solve this as part of this PR.

Jul 06 '24 17:07 MichaReiser

Either way. I prefer not to solve this as part of this PR.

Sure — the reason for my questions was I was trying to understand the motivation for the changes you were making as part of this PR, which might not necessarily be desirable if we went in a slightly different direction :-)

But I agree we don't need to come to a conclusive answer on these questions now!

Jul 06 '24 18:07 AlexWaygood

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Jul 07 '24 09:07 github-actions[bot]

I'll wait with merging to give @carljm a chance to look at it as well.

Jul 08 '24 10:07 MichaReiser

ruff ruff copied to clipboard

[red-knot] Rename `FileSystem` to `System`

Summary

Should VendoredFileSystem and System have a shared ReadOnlyFileSystem trait?

Exposing the VendoredFileSystem

Should Db implement System?

Should it be named Host?

Test Plan

CodSpeed Performance Report

Merging #12214 will improve performances by 4.09%

Summary

Benchmarks breakdown

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

ruff
ruff copied to clipboard

Should `VendoredFileSystem` and `System` have a shared `ReadOnlyFileSystem` trait?

Exposing the `VendoredFileSystem`

Should `Db` implement `System`?

Should it be named `Host`?

`ruff-ecosystem` results