jj icon indicating copy to clipboard operation
jj copied to clipboard

FR: git core.autocrlf config support

Open 06393993 opened this issue 7 months ago • 6 comments

Let's use this issue to discuss the design decisions on git core.autocrlf support and how that support can integrate with EOL conversion in other backends, especially the fig/mercurial backend, instead of discussing design decision in an ad-hoc way through the review comments on https://github.com/jj-vcs/jj/pull/6728.

  • Should jj support the core.autocrlf flavor of EOL conversion for git backend or should it be available as a jj config, and available for all backends?
    • If the core.autocrlf flavor support is only for git backend, are we in favor of introducing an abstraction layer like the TargetEolStrategy trait in https://github.com/jj-vcs/jj/commit/e2293f6f994e29735a0b0f8df597c0bb00f0a5bb to hide the git related code as implementation details. If not, what's the way to hide the implementation details to avoid #[cfg(feature = "git")] from scattering everywhere?
    • If the core.autocrlf git config flavor support is not only for git backend,
      • Are we going to introduce a new jj setting to override the git config? If so, what are the key and valid values?
      • How should this support interact with other potential backends, especially the fig/mercurial backend that uses .hgeol?
  • Should the core.autocrlf config be part of the SnapshotOptions and the CheckoutOptions API or should be initialized locally in local_working_copy.rs?
    • If we are supporting the core.autocrlf flavor config for all backends, changing SnapshotOptions and CheckoutOptions seem to be the best choice, since we don’t have access to UserSettings in local_working_copy.rs.
    • SnapshotOptions and CheckoutOptions are public APIs, and adding fields to them may break downstream projects.
  • Handling binary files: we shouldn’t perform EOL conversion for binary files.
    • before we have .gitattributes
      • We probe the first N bytes(now 8KB hard coded in the source) from the content, and use the existing logic of gix_filter::eol::Stats::is_binary to decide whether the file is binary
      • For performance, we only probe the file if the configuration requires EOL conversion, i.e., core.autocrlf = true when snapshotting and updating or core.autocrlf=input when snapshotting. We don’t probe the file otherwise, i.e., core.autocrlf=false when snapshotting and updating or core.autocrlf=input when updating.
    • after we have .gitattributes, we follow the text attribute.
      • If the text attribute is unset, we won’t perform the EOL conversion regardless of the core.autocrlf config.
      • If the text attribute is set, when snapshotting, we always convert the EOL to LF, when updating we perform the EOL conversion according to the eol attribute or the core.eol git config.
      • If the text attribute is set to a string value “auto”, we perform the binary file probing described previously in the “before we have .gitattributes” section.
      • If the text attribute is unspecified, we perform the EOL conversion based on the core.autocrlf git config without probing whether the file is binary, which will change the behavior described in the “before we have .gitattributes” section.
  • Implement the EOL conversion in local_working_copy.rs
    • store -> disk EOL conversion
      • We only need LF to CRLF conversion, even if we take .gitattributes into consideration, the eol=lf attribute just means “uses the same line endings in the working directory as in the index when the file is checked out” in the gitattributes documentation.
      • In TreeState::update, where we handle the MaterializedTreeValue::File case, we query the decision on EOL conversion and pass that to TreeState::write_file as a new parameter.
      • In the MaterializedTreeValue::Symlink case in TreeState::update, we pass TargetEol::PassThrough to TreeState::write_file if the symbol link is not supported. This is the precise reason why we don’t move the logic to query the EOL conversion decision in TreeState::write_file, because in this case, we need to write without EOL conversion regardless of any configurations.
      • To implement EOL conversion in TreeState::write_file, we need to read the contents from the AsyncRead, and conduct the EOL conversion based on the content. We have several options:
        • A separate type, EolWriter, that implements the Write treat that converts the EOL on the fly. When the caller calls <EolWriter as std::io::Write>::write, it converts the EOL of the passed in data before writing to the inner writer, which in our case, the File to write. We still use the existing copy_async_to_sync function. To report the correct FileState::size, we need another utility type to count the actual bytes written to the file instead of the bytes consumed. The numbers can be different due to EOL conversion. This design allows the least memory footprint, especially if we are going to support other features that change the bytes when snapshotting or updating, e.g., the working-tree-encoding git attributes. However, the benefit is questionable given that we are supposed to only convert EOL for text files, which should never be too large. Therefore, we won’t adopt it based on the complexity.
        • Read all the contents asynchronous to a Vec<u8>. Use a function to convert the EOL for a Vec<u8>. Write the converted bytes back to the file system. With this implementation, we will change TreeState::write_file to an async function and get rid of copy_async_to_sync. Note that all the caller of TreeState::write_file is in TreeState::update which is already an async function. To report the correct FileState::size, we use the number of bytes after the EOL conversion is applied. We tend to adopt this design unless we don’t like changing TreeState::write_file to async.
        • Similar to the above, but use copy_async_to_sync to read the content of the file to a Vec<u8> synchronously. This allows us to keep TreeState::write_file away from async fn. The current implementation of copy_async_to_sync allocates another internal 16KB buffer. This implementation has the largest memory footprint. We tend not to adopt this design unless we want to keep async away from TreeState::write_file.
      • Should the EOL conversion function handle the logic for no-EOL conversion, and read from the store? Moving both logic into the EOL conversion function allows us to reduce the complexity of the TreeState::write_file, and unit test more paths. TreeState::write_file is almost not unit testable: one needs to set up Store to create TreeState. Testing a function whose parameters are impl AsyncRead, TargetEol is simpler.
    • disk -> store EOL conversion
      • We only need CRLF to LF conversion. No git configuration needs a LF to CRLF conversion when checking in code.
      • In FileSnapshotter::write_path_to_store, where we handle the merge conflicts, in addition to just reading the content of the file from the file system, we need to apply line conversion according to the configuration.
      • In FileSnapshotter::write_file_to_store, where we handle files that don’t have merge conflicts, we should pass a std::io::Read that reads the file content with EOL converted according to the configuration.
      • Should we implement another type that implements std::io::Read with EOL conversion from a File, or should we create a function that reads the content as a whole and do the EOL conversion in place?
        • A separate type is more complicated.
        • A separate type can implement the read with less memory footprint, especially if we are going to support other features that change the bytes when snapshotting or updating, e.g., the working-tree-encoding git attributes. However we are assumed to only read text files now, so the memory saved is questionable.
        • A separate type can delay the read to happen when the read is necessary. This is especially true of FileSnapshotter::write_file_to_store, which passes an AsyncRead to Backend::write_file, and may be able to yield the thread when waiting for IO. This benefit is also questionable, because currently we are using BlockingAsyncReader, which blocks the current thread when waiting for IO.
        • Therefore, we decide to go with a function that reads all the content and does the EOL conversion.
      • Should the EOL conversion function handle the logic for no-EOL conversion, and read from the files?
        • https://github.com/jj-vcs/jj/pull/6728#discussion_r2139307343 suggests that this function should handle the file read, and should only handle the case where an EOL conversion is needed.
        • Letting the function handle the case where no EOL conversion is needed can save an if at the caller site, FileSnapshotter::write_file_to_store or FileSnapshotter::write_path_to_store, which are complicated and difficult to unit test. In addition, it allows us to save 1 if, because we need to repeat the “check whether we need EOL conversion” logic in 2 places.
    • How to pass the EOL config to the function that does the EOL conversion?
      • According to different design decision to the “Should the core.autocrlf config be part of the SnapshotOptions and the CheckoutOptions API or should be initialized locally in local_working_copy.rs” question, the core.autocrlf config can be read at different places:
        • right at where the EOL conversion happens(e.g. FileSnapshotter::write_path_to_store);
        • when TreeState is initialized(i.e., TreeState::empty);
        • or when SnapshotOptions or the CheckoutOptions type is constructed(e.g., in WorkspaceCommandHelper).
      • Right at where the EOL conversion happens(e.g. FileSnapshotter::write_path_to_store) should not be used, because this triggers a git config read on every file to snapshot or update, and according to my performance test, can make snapshotting 30% slower. See details at https://github.com/jj-vcs/jj/pull/6728/commits/cf28eb289f51c9fa047082b1edfaeec511eed729.
      • If the git config is read when TreeState is initialized(i.e., TreeState::empty), we can put the type that carries the git config information as a field of TreeState, and FileSnapshotter. A new argument will be added to TreeState::write_file, because when we use this function to write a symbol link in update and symbol link is not supported, we need to ignore the EOL conversion and just write as is.
      • If the git config is read when SnapshotOptions or CheckoutOptions is constructed, we can pass the git config information through the function parameters, and as a field of FileSnapshotter. We will add a new parameter to TreeState::update, TreeState::write_file.
      • For the latter 2 designs, the git config read happens at a different place where the EOL conversion happens, so we need a data type that passes the config to functions that perform the EOL conversion, and we have different designs:
        • directly pass the value of the core.autocrlf config around;
        • pass a concrete type similar to GitIgnoreFiles that provides functions to get the git config;
        • or pass a trait object similar to the matcher parameter in TreeState::update.
      • Directly pass the core.autocrlf config value around. This design is suggested by https://github.com/jj-vcs/jj/pull/6728#discussion_r2137552726. This design should be the simplest design, however it could bring duplicate code. The logic to probe whether a file is binary may be repeated at various call sites. The future complicated .gitattributes logic that decides whether a file needs an EOL conversion may be also duplicated. It can be argued that we can wrap those logic in a function. But in that case, we lose the simplicity of the solution by introducing another layer of abstraction. Moreover, if we decide to only support the core.autocrlf git config for the git backend, this design leaks the implementation details of the git backend, resulting in a bad encapsulation. If we are going to support .hgeol from fig/mercurial, we will end up with another type and parameters that carries the .hgeol configuration.
      • Pass a concrete type similar to GitIgnoreFiles that provides functions to query the final decision of EOL conversion for a file. This design allows us to put common logic for binary probing and future .gitattributes interaction into methods. However, if we decide to only support the git autocrlf config on git backend only, the implementation details may be leaked, and can also cause #[cfg(feature = “git”)] to scatter everywhere. If we adopt this path, the type name will be TargetEolStrategy. We need 2 different functions to decide the target EOL: fn get_target_eol_from_disk(&self, file_path: &Path) -> TargetEol which probes the file based on the file on the disk, and fn get_target_eol_from_store(&self, repo_path: &RepoPath, file_id: &FileId) -> TargetEol which probes the file based on the file read from Store. To integrate with the future .gitattributes feature, we either add a GitAttributesFile parameter to those functions or we pass this TargetEolStrategy to an implicit backend function to decide the final EOL. The exact design decision is based on our answer to the “do we want to support .gitattributes for all backends” question.
      • Pass a trait object. This design has the highest complexity. It requires us to introduce a trait and a type that implements the trait. To be dyn-compatible, this trait can’t use async fn, and can only use fn that returns a boxed future. This can happen if we want to hide the binary file probe logic in the interfaces: to read a file from the store, we can only use the async Store::read_file function. However, it allows us to hide the implementation details. If we only want to support the git backend for the git core.autocrlf config, and want to provide a completely different implementation for .hgeol, this way provides the best abstraction. It can also minimize the use of #[cfg(feature = “git”)], because we only need that when implementing the trait. We don’t need that when we are calling methods from the trait. Similar to the concrete type design, fn get_target_eol_from_disk(&self, file_path: &Path) -> TargetEol and fn get_target_eol_from_store(&self, repo_path: &RepoPath, file_id: &FileId) -> TargetEol will be the required methods. To integrate with the future .gitattributes feature, we either add a GitAttributesFile parameter to those functions or we let the implementation keep an access to the underlying git .gitattributes implementation and query on demand. The exact design decision is based on our answer to the “do we want to support .gitattributes for all backends” question.
  • Implement the EOL conversion function
    • There will be 2 versions of the function: the async version, async fn convert_eol_async(content: impl AsyncRead, target_eol: TargetEol) ->Box<dyn AsyncRead>, and the non-async version fn convert_eol(content: impl Read, target_eol: TargetEol) -> Box<dyn Read>. The return value is the bytes with EOL converted as expected.
    • If target_eol indicates that no EOL conversion is needed, the passed in value will be wrapped in Box and returned.
    • If target_eol indicates that EOL conversion is needed, we call read_to_end to read all the bytes from the content parameter into a buf of the Vec<u8> type. If we encounter an error, return a Read/AsyncRead trait object that returns the error on the first call to read. This error handling design makes it easier for the caller to handle the error.
    • Create res which is another Vec<u8> with capacity set to the length of buf. We will fill res with EOL converted data later.
    • Iterating lines from buf through ByteSlice::lines seems ideal, but the last line becomes tricky: just from the iterator we can’t tell if the last line ends without LF, with CRLF, or with just LF. Therefore we have a design decision to make on whether to use ByteSlice::lines.
      • if we are using ByteSlice::lines, we use peekable to check if the current line is the last line. If it’s not the last line, we append the current line to res and append a target EOL to res by Vec::extend_from_slice. If it’s the last line, we append the current line to res, and check if the last byte of buf is \n or not. If so, we append the target EOL. If not, we do nothing.
      • if we are not using ByteSlice::lines, we use split with |c| c == b‘\n’ as the predicate to iterate over lines. We also use peekable to check if the current line is the last line. If it’s not the last line, we strip the last possible \r before we push the line to res, and we append the expected EOL. If it’s the last line, we push the line as is to res.
    • Wrap res in a Box<Cursor<Vec<u8>>, and coerce it to Box<dyn AsyncRead> or Box<dyn Read> as needed before we return.
  • File structure. We will add a lib/src/eol.rs file to store all the new types and functions added. I don’t want to call it eol_util suggested in https://github.com/jj-vcs/jj/pull/6728#discussion_r2137566057, because the _util suffix seems meaningless unless we have another naming convention we need to follow. A new lib/tests/test_eol.rs file will be added for integration tests. And if we introduce code in jj-cli(e.g., we construct SnapshotOptions or CheckoutOptions), a new cli/tests/test_eol.rs file will be added.
  • Tests: the goal is to have 100% line coverage for the newly added code, and we use unit tests as much as possible for high coverage.
    • To unit test convert_eol and convert_eol_async, we can implement Read or AsyncRead for a struct that returns error or simply use &[u8] if we need Read to return contents we need. We should at least cover the following cases:
      • convert the EOL to LF, texts have CRLF line ending, verify that CR is correctly removed.
      • convert the EOL to CRLF, texts have LF line ending, verify that CR is correctly added.
      • convert the EOL to LF or CRLF, the text doesn’t have any EOL, verify that the text is not changed
      • convert the EOL to LF or CRLF, the text has only one line which ends with some EOL, verify that the returned bytes also have only one line, and the EOL is correctly converted
      • if the input text contains empty lines(i.e., continuous LF or CRLF), those empty lines should be preserved with EOL correctly converted
      • if the Read parameter or the AsyncRead parameter returns an error when read is called, the returned value function should return the same error on read
      • if the target_eol parameter suggests that no EOL conversion should happen, the output bytes should be the same as the input bytes
      • if the target_eol parameter suggests that no EOL conversion should happen, and the Read parameter or the AsyncRead parameter returns an error when read is called. The returned value should result in an error.
    • Testing the implementation of TargetEolStrategy requires creating Store, and so will be tested via integration tests in lib/tests/test_eol.rs.
      • To set up the test, we need to create a git repo either through TestRepo::init_with_backend.
      • We modify the repo git core.autocrlf config to false, so that we can check in the file to test as is.
      • We then need to recreate the Store to pick up the git core.autocrlf config by implementing a new function. The implementation is similar to TestRepo::init_with_backend, but without creating a new TestEnvironment, and using GitBackend::init_external to replace the GitBackend::init_internal.
      • We use Store::write_file to create the file to test get_target_eol_from_store, one binary file(whose content should be the same as docs/images/favicon-96x96.png), one text file.
      • We modify the repo git core.autocrlf config to true, input, and false for different test cases and write back to the filesystem
      • We recreate the Store again.
      • We write a text file or a binary file to the filesystem to test get_target_eol_from_disk.
      • We use the Store to initialize the TargetEolStrategy, pass in the file to test, and verify the return value of get_target_eol_from_store and get_target_eol_from_disk.
      • In the future, with the .gitattributes feature’s implementation, this test will change accordingly.
    • TreeState is too complicated for unit tests. Modification to TreeState will be tested through integration tests. The test is set up similar to the TargetEolStrategy test.
      • The happy path test for snapshot: similar to TargetEolStrategy tests, but with actually snapshotting:
        • Initialize a test repo, set the autocrlf=true git config, recreate Store and create a Workspace. We can’t use the existing TestWorkspace::init_backend because currently it doesn’t support creating Store to pick up git config changes.
        • Create files with CRLF EOL, LF EOL, mixed EOL and a binary file for different test cases.
        • Create a snapshot. And create a snapshot again to verify that the working copy is clean.
        • Set the core.autocrlf=false git config, and recreate Store and Workspace to pick up the git config change, so that the later checkout won’t perform any EOL conversion.
        • Checkout an empty change and checkout the snapshot again to recreate the files.
        • Verify the file contents are expected: for text files, EOL should be converted to LF, for the binary file, no change should happen.
      • The happy path test for update: similar to snapshot, but we change the core.autocrlf git config to the value under test when check in, and change it to core.autocrlf=false when check out. And we don’t verify if the working copy is clean.
      • Test the case where we don’t perform EOL conversion for symbol link if symbol link is not enabled.
        • We need to introduce a new interface to override the TreeState::symlink_support field. We then have a design decision to make. I propose the 2 following approaches:
          • We pass that override information as a UserSettings, e.g. test.symlink-support-override, valid values are strings: “true”, “false”, or “auto”. If we have reserved config keys, we can use that key instead of test. The default is “auto”, which is the current behavior. “true” and “false” will override the TreeState::symlink_support value when creating. To pass that value to TreeState::empty, we store the test settings in Store, and add relevant interfaces to initialize and query the test settings in Store. We add a new pub(crate) get_test_settings(&’a self) -> &’a HashMap<String, String> method to Store to query the test settings, and a new pub(crate) new_with_test_settings(backend: Box<dyn Backend>, signer: Signer, test_settings: HashMap<String, String>) -> Arc<Self> method to Store. In lib/src/repo.rs, we introduce a new function that uses UserSettings::get_string with explicit keys(e.g., test.symlink-support-override) to create the test settings HashMap. And change both ReadonlyRepo::init and RepoLoader::init_from_file_system to use the new Store::new_with_test_settings function. When testing, we create a ConfigLayer with a toml string where the test.symlink-support-override setting set to the value we need, we add that layer to the base_user_config, and we use UserSettings::from_config to create the needed UserSettings that we use to create the Workspace. This solution is complicated, but doesn’t change any existing interfaces, and the public interfaces are mostly backward compatible.
          • Add a pub symlink_support_override: Option<bool> field to SnapshotOptions and CheckoutOptions, and pass that the override value through TreeState::update and another new field in FileSnapshotter. When passing the value, TreeState::symlink_support will be used if the override is None. This solution is simple, but it introduces a breaking change to the public API, and forces the users of SnapshotOptions and CheckoutOptions to fill a test field that they won’t ever care about.
        • In the test, we skip the test if the symbol link is not supported. Set the git config core.autocrlf to true. Override symbol link support to enabled. Create a file with a symbol link to the file. The name of the file will include LF without CR preceding. Create the snapshot. Checkout to an empty commit. Override symbol link support to disable. Set the git config core.autocrlf to true. Recreate the Store and Workspace to pick up the change. Checkout the commit with the symbol link. Read the content of the symbol link file, and verify that the content doesn’t change.
      • Test the case for merge conflicts.
        • Create 4 commits with the core.autocrlf git config set to false: A, the empty commit; B, and C, children of A, but add a new test-file with the content B\n, C\n respectively; D, child of B and C. A is created by create_tree(repo, &[]); B, and C are created by TestTreeBuilder::file and commit_with_tree; D is created by merge_commit_trees and commit_with_tree.
        • Set the core.autocrlf git config to true. Recreate Store and Workspace. Checkout D. Append E\r\n to test-file. Create a snapshot. Create a snapshot again to verify the working copy is clean. Create commit E based on the return value of the snapshot.
        • Set the core.autocrlf git config to false. Recreate Store and Workspace. Checkout the commit E. Verify that test-file doesn’t contain CRLF, ends with E\n, contains B\n, and contains C\n.
      • If we adopt the solution to add a target_eol: TargetEol field to SnapshotOptions and CheckoutOptions, and we construct this field in the jj-cli crate, like WorkspaceCommandHelper::base_ignores. We should add tests to the new cli/tests/test_eol.rs file to verify if it works correct for different git configs. The technique will be similar to TreeState tests, but will use run_jj_in and snapshot testing to verify if the correct EOL conversion is applied.
  • Performance consideration for binary file probing
    • Probing whether a file is binary will read the first N bytes from the file. Later check in or checkout may read the same N bytes again. One would suspect that we could have a performance optimization chance here by using BufReader like https://github.com/jj-vcs/jj/pull/6728#discussion_r2139307343. However, the reality isn’t that simple.
    • BufReader can discard the cache, if the file size is smaller than the number of bytes we probe. This Rust playground demonstrates this caveat. In fact, when the BufReader would discard the inner cache is implementation details that we can’t rely on. To guarantee that we don’t read twice, we should use a combination of Read::chain and VecDeque as https://github.com/jj-vcs/jj/commit/bfd71ae67ed8a8cb90d2b687c03b3764f47f087e.
    • Testing on the nix-pkgs repo doesn’t indicate any meaningful performance improvements. We should benchmark before introducing this complexity, so we should at least put the performance optimization to a different PR.

06393993 avatar Jun 15 '25 18:06 06393993

Thanks for working on this!

Here are comments on some parts of it. I haven't finished going through it yet.

  • Should jj support the core.autocrlf flavor of EOL conversion for git backend or should it be available as a jj config, and available for all backends?

Since we're implementing support for it only in the working copy, I think it should be independent of the backend.

  • If the core.autocrlf git config flavor support is not only for git backend,

    • Are we going to introduce a new jj setting to override the git config? If so, what are the key and valid values?

I suppose we should, but it wouldn't really be "override", because we probably wouldn't respect the value from the Git config at all. If we don't plan on adding support for CRLF conversion outside of the working copy (I think we shouldn't), then calling it working-copy.eol-conversion with values "none"/"input"/"input-output" or something like that sounds good to me (it might be good if the name indicated "convert to LF when shapshotting the working copy" and "convert to CRLF when updating the working copy" but that seems to get too long).

I think we should rename core.fsmonitor to working-copy.fsmonitor, btw.

* How should this support interact with other potential backends, especially the fig/mercurial backend that uses `.hgeol`?

The integrations at Google are actually independent of Fig/Mercurial. They will not support .hgeol (such files are not checked into the Google monorepo anyway).

The only other interaction with other backends I can think of is that we have a custom local-disk working copy implementation at Google, which is a thin wrapper around the default LocalWorkingCopy. So by adding support for EOL conversion upstream, that implementation will automatically get support for it. Maybe Windows developers using the Google monorepo will appreciate that.

  • Should the core.autocrlf config be part of the SnapshotOptions and the CheckoutOptions API or should be initialized locally in local_working_copy.rs?
    • If we are supporting the core.autocrlf flavor config for all backends, changing SnapshotOptions and CheckoutOptions seem to be the best choice, since we don’t have access to UserSettings in local_working_copy.rs.
    • SnapshotOptions and CheckoutOptions are public APIs, and adding fields to them may break downstream projects.

For things that are specific for a particular implementation of the working copy traits, I think we should not pollute SnapshotOptions and CheckoutOptions. It's probably sufficient to pass a &UserSettings into the WorkingCopyFactory functions for the EOL-conversion use case.

I think we should be able to remove the SnapshotOptions::fsmonitor_settings field too.

  • Handling binary files: we shouldn’t perform EOL conversion for binary files.

    • before we have .gitattributes

      • We probe the first N bytes(now 8KB hard coded in the source) from the content, and use the existing logic of gix_filter::eol::Stats::is_binary to decide whether the file is binary

Do you know if this is what Git does?

* To implement EOL conversion in `TreeState::write_file`, we need to read the contents from the `AsyncRead`, and conduct the EOL conversion based on the content. We have several options:
  
  * A separate type, `EolWriter`, that implements the `Write` treat that converts the EOL on the fly. When the caller calls `<EolWriter as std::io::Write>::write`, it converts the EOL of the passed in data before writing to the inner writer, which in our case, the `File` to write. We still use the existing `copy_async_to_sync` function. To report the correct `FileState::size`, we need another utility type to count the actual bytes written to the file instead of the bytes consumed. The numbers can be different due to EOL conversion. This design allows the least memory footprint, especially if we are going to support other features that change the bytes when snapshotting or updating, e.g., the `working-tree-encoding` git attributes. However, the benefit is questionable given that we are supposed to only convert EOL for text files, which should never be too large. Therefore, we won’t adopt it based on the complexity.

It sounds like it should be pretty simple to implement. Is it not?

  • disk -> store EOL conversion
    • A separate type is more complicated.

Same here: It sounds reasonably simple to me. Did you try it and it turned out more complicated than it seems?

martinvonz avatar Jun 16 '25 00:06 martinvonz

It sounds reasonably simple to me. Did you try it and it turned out more complicated than it seems?

Perhaps, it refers to my review comment on https://github.com/jj-vcs/jj/pull/6728/commits/3358779e242475e1ef3289f9cb7eeb98075b63cb#diff-d46f484c5051943f4e12a22cf14d8c16b58ace4c4c30b9657e0a657c3ea8931d

I haven't reviewed the implementation thoroughly, and I have no idea why we need this level of complexity.

yuja avatar Jun 16 '25 01:06 yuja

working-copy.eol-conversion with values "none"/"input"/"input-output"

Let's take those names.

I think we should rename core.fsmonitor to working-copy.fsmonitor, btw.

Will be in one of my future PRs then.

Maybe Windows developers using the Google monorepo will appreciate that.

I am the very Google Windows developer that can benefit from it, and our project currently is using .hgeol to tell which files are binary. We indeed doesn't check in the .hgeol file, but we create that file when we initialize the Google monorepo on our Windows machines. Therefore, I have this wrong assumption.

As long as we have a way to handle the EOL conversion for Google Windows developers, I am fine with the design decision. But of course, we need a way to specify the binary files with globs, which probably worth another settings and is more closely related to the text .gitattributes support, and we should discuss there.

It's probably sufficient to pass a &UserSettings into the WorkingCopyFactory functions for the EOL-conversion use case.

Currently, both functions of the WorkingCopyFactory trait doesn't accept UserSettings, and we have a design decision to make then. Basically 2 questions:

  • Do we want to expose the entire UserSettings to local_working_copy.rs or just the subset? If only the subset, do we want to expose those settings in a type loose way like HashMap<String, String> or a strict type WorkingCopySettings that lists all possible settings exhaustively?
  • How to pass the settings to WorkingCopyFactory? We can either add new parameters to the 2 functions of WorkingCopyFactory, or we can change let Store to carry that piece of information by changing how Store is initialized and adds an interface to Store to obtain the settings.

Do you know if this is what Git does?

Git has the text=auto .gitattributes: "Git decides by itself whether the file is text or binary". And since git has -text or binary in .gitattributes to opt-out the EOL conversion, so git doesn't need to guess whether a files is binary by default. This isn't an issue for git, but is an issue for jj until jj supports the text git attribute, which is another issue.

As of how git handles the text=auto case, git reads the entire file, and makes a statistics similar to gix_filter::eol::Stats if not identical. gitoxide is similar.

It sounds like it should be pretty simple to implement. Is it not? Did you try it and it turned out more complicated than it seems?

I tried. This is the on-the-fly EolReader, and this is the on-the-fly EolWriter(We don't need the LfWriter, so we can remove it). They are 80-120 SLOC. The implementation is certainly not extremely complicated, but is also not as intuitive as "read all and replace", which can be as simple as 6 lines. And there are many corner cases, so we have a large set of tests to maintain. The proposed line-by-line solution is inspired by https://github.com/jj-vcs/jj/pull/6728#discussion_r2137612795, which may be 10-20 SLOC to implement, but I haven't tried yet. I don't have preferences, especially I don't know whether a stream like interface is preferred.

git and gitoxide read the content to the end into a buffer, and perform EOL conversion, encoding/decoding(the working-tree-encoding git attribute), ident, and filters(clean/smudge) on the buffer. We can implement it differently in a stream-like way if we think that could matter.

06393993 avatar Jun 16 '25 01:06 06393993

I see, the read and write implementations get a bit more complex than I had thought (in particular the case where a read of CRLF gets split). Doing the eager reads seems like a good start.

martinvonz avatar Jun 16 '25 01:06 martinvonz

  • Do we want to expose the entire UserSettings to local_working_copy.rs or just the subset?

Yes, this is what I meant.

  • We can either add new parameters to the 2 functions of WorkingCopyFactory

Yes, I think that's best.

As of how git handles the text=auto case, git reads the entire file, and makes a statistics similar to gix_filter::eol::Stats if not identical. gitoxide is similar.

Thanks for checking. Yuya's suggestion of checking only the first 8kB or so sounds better to me then.

martinvonz avatar Jun 16 '25 02:06 martinvonz

Currently, both functions of the WorkingCopyFactory trait doesn't accept UserSettings, and we have a design decision to make then. Basically 2 questions:

  • Do we want to expose the entire UserSettings to local_working_copy.rs or just the subset? If only the subset, do we want to expose those settings in a type loose way like HashMap<String, String> or a strict type WorkingCopySettings that lists all possible settings exhaustively?

I would just pass &UserSettings to LocalWorkingCopy::init() and ::load(). A subset wouldn't be that useful because we can't validate the contents by upper layer anyway. FWIW, it might make sense to translate UserSettings into TreeStateSettings at LocalWorkingCopy::init()/load() because TreeState is used for different purpose by external merge tools.

  • How to pass the settings to WorkingCopyFactory? We can either add new parameters to the 2 functions of WorkingCopyFactory, or we can change let Store to carry that piece of information by changing how Store is initialized and adds an interface to Store to obtain the settings.

As function arguments, I think. Store is an object which the working copy uses, not an object who teaches working copy how it should behave.

BTW, if we add support for .gitattributes, we'll have to pass the root attributes differently. I don't think we can completely get rid of eol thingy from SnapshotOptions.

yuja avatar Jun 16 '25 02:06 yuja