FR: git core.autocrlf config support
Let's use this issue to discuss the design decisions on git core.autocrlf support and how that support can integrate with EOL conversion in other backends, especially the fig/mercurial backend, instead of discussing design decision in an ad-hoc way through the review comments on https://github.com/jj-vcs/jj/pull/6728.
- Should jj support the
core.autocrlfflavor of EOL conversion for git backend or should it be available as a jj config, and available for all backends?- If the
core.autocrlfflavor support is only for git backend, are we in favor of introducing an abstraction layer like theTargetEolStrategytrait in https://github.com/jj-vcs/jj/commit/e2293f6f994e29735a0b0f8df597c0bb00f0a5bb to hide the git related code as implementation details. If not, what's the way to hide the implementation details to avoid#[cfg(feature = "git")]from scattering everywhere? - If the
core.autocrlfgit config flavor support is not only for git backend,- Are we going to introduce a new jj setting to override the git config? If so, what are the key and valid values?
- How should this support interact with other potential backends, especially the fig/mercurial backend that uses
.hgeol?
- If the
- Should the
core.autocrlfconfig be part of theSnapshotOptionsand theCheckoutOptionsAPI or should be initialized locally inlocal_working_copy.rs?- If we are supporting the
core.autocrlfflavor config for all backends, changingSnapshotOptionsandCheckoutOptionsseem to be the best choice, since we don’t have access toUserSettingsinlocal_working_copy.rs. -
SnapshotOptionsandCheckoutOptionsare public APIs, and adding fields to them may break downstream projects.
- If we are supporting the
- Handling binary files: we shouldn’t perform EOL conversion for binary files.
- before we have
.gitattributes- We probe the first N bytes(now 8KB hard coded in the source) from the content, and use the existing logic of
gix_filter::eol::Stats::is_binaryto decide whether the file is binary - For performance, we only probe the file if the configuration requires EOL conversion, i.e.,
core.autocrlf = truewhen snapshotting and updating orcore.autocrlf=inputwhen snapshotting. We don’t probe the file otherwise, i.e.,core.autocrlf=falsewhen snapshotting and updating orcore.autocrlf=inputwhen updating.
- We probe the first N bytes(now 8KB hard coded in the source) from the content, and use the existing logic of
- after we have
.gitattributes, we follow thetextattribute.- If the
textattribute is unset, we won’t perform the EOL conversion regardless of thecore.autocrlfconfig. - If the
textattribute is set, when snapshotting, we always convert the EOL to LF, when updating we perform the EOL conversion according to theeolattribute or thecore.eolgit config. - If the
textattribute is set to a string value“auto”, we perform the binary file probing described previously in the “before we have.gitattributes” section. - If the
textattribute is unspecified, we perform the EOL conversion based on thecore.autocrlfgit config without probing whether the file is binary, which will change the behavior described in the “before we have.gitattributes” section.
- If the
- before we have
- Implement the EOL conversion in
local_working_copy.rs- store -> disk EOL conversion
- We only need LF to CRLF conversion, even if we take
.gitattributesinto consideration, the eol=lf attribute just means “uses the same line endings in the working directory as in the index when the file is checked out” in the gitattributes documentation. - In
TreeState::update, where we handle theMaterializedTreeValue::Filecase, we query the decision on EOL conversion and pass that toTreeState::write_fileas a new parameter. - In the
MaterializedTreeValue::Symlinkcase inTreeState::update, we passTargetEol::PassThroughtoTreeState::write_fileif the symbol link is not supported. This is the precise reason why we don’t move the logic to query the EOL conversion decision inTreeState::write_file, because in this case, we need to write without EOL conversion regardless of any configurations. - To implement EOL conversion in
TreeState::write_file, we need to read the contents from theAsyncRead, and conduct the EOL conversion based on the content. We have several options:- A separate type,
EolWriter, that implements theWritetreat that converts the EOL on the fly. When the caller calls<EolWriter as std::io::Write>::write, it converts the EOL of the passed in data before writing to the inner writer, which in our case, theFileto write. We still use the existingcopy_async_to_syncfunction. To report the correctFileState::size, we need another utility type to count the actual bytes written to the file instead of the bytes consumed. The numbers can be different due to EOL conversion. This design allows the least memory footprint, especially if we are going to support other features that change the bytes when snapshotting or updating, e.g., theworking-tree-encodinggit attributes. However, the benefit is questionable given that we are supposed to only convert EOL for text files, which should never be too large. Therefore, we won’t adopt it based on the complexity. - Read all the contents asynchronous to a
Vec<u8>. Use a function to convert the EOL for aVec<u8>. Write the converted bytes back to the file system. With this implementation, we will changeTreeState::write_fileto an async function and get rid ofcopy_async_to_sync. Note that all the caller ofTreeState::write_fileis inTreeState::updatewhich is already an async function. To report the correctFileState::size, we use the number of bytes after the EOL conversion is applied. We tend to adopt this design unless we don’t like changingTreeState::write_fileto async. - Similar to the above, but use
copy_async_to_syncto read the content of the file to aVec<u8>synchronously. This allows us to keepTreeState::write_fileaway from async fn. The current implementation ofcopy_async_to_syncallocates another internal 16KB buffer. This implementation has the largest memory footprint. We tend not to adopt this design unless we want to keepasyncaway fromTreeState::write_file.
- A separate type,
- Should the EOL conversion function handle the logic for no-EOL conversion, and read from the store? Moving both logic into the EOL conversion function allows us to reduce the complexity of the
TreeState::write_file, and unit test more paths.TreeState::write_fileis almost not unit testable: one needs to set upStoreto createTreeState. Testing a function whose parameters areimpl AsyncRead,TargetEolis simpler.
- We only need LF to CRLF conversion, even if we take
- disk -> store EOL conversion
- We only need CRLF to LF conversion. No git configuration needs a LF to CRLF conversion when checking in code.
- In
FileSnapshotter::write_path_to_store, where we handle the merge conflicts, in addition to just reading the content of the file from the file system, we need to apply line conversion according to the configuration. - In
FileSnapshotter::write_file_to_store, where we handle files that don’t have merge conflicts, we should pass astd::io::Readthat reads the file content with EOL converted according to the configuration. - Should we implement another type that implements
std::io::Readwith EOL conversion from aFile, or should we create a function that reads the content as a whole and do the EOL conversion in place?- A separate type is more complicated.
- A separate type can implement the read with less memory footprint, especially if we are going to support other features that change the bytes when snapshotting or updating, e.g., the
working-tree-encodinggit attributes. However we are assumed to only read text files now, so the memory saved is questionable. - A separate type can delay the read to happen when the read is necessary. This is especially true of
FileSnapshotter::write_file_to_store, which passes anAsyncReadtoBackend::write_file, and may be able to yield the thread when waiting for IO. This benefit is also questionable, because currently we are usingBlockingAsyncReader, which blocks the current thread when waiting for IO. - Therefore, we decide to go with a function that reads all the content and does the EOL conversion.
- Should the EOL conversion function handle the logic for no-EOL conversion, and read from the files?
- https://github.com/jj-vcs/jj/pull/6728#discussion_r2139307343 suggests that this function should handle the file read, and should only handle the case where an EOL conversion is needed.
- Letting the function handle the case where no EOL conversion is needed can save an if at the caller site,
FileSnapshotter::write_file_to_storeorFileSnapshotter::write_path_to_store, which are complicated and difficult to unit test. In addition, it allows us to save 1 if, because we need to repeat the “check whether we need EOL conversion” logic in 2 places.
- How to pass the EOL config to the function that does the EOL conversion?
- According to different design decision to the “Should the
core.autocrlfconfig be part of theSnapshotOptionsand theCheckoutOptionsAPI or should be initialized locally inlocal_working_copy.rs” question, thecore.autocrlfconfig can be read at different places:- right at where the EOL conversion happens(e.g.
FileSnapshotter::write_path_to_store); - when
TreeStateis initialized(i.e.,TreeState::empty); - or when
SnapshotOptionsor theCheckoutOptionstype is constructed(e.g., inWorkspaceCommandHelper).
- right at where the EOL conversion happens(e.g.
- Right at where the EOL conversion happens(e.g.
FileSnapshotter::write_path_to_store) should not be used, because this triggers a git config read on every file to snapshot or update, and according to my performance test, can make snapshotting 30% slower. See details at https://github.com/jj-vcs/jj/pull/6728/commits/cf28eb289f51c9fa047082b1edfaeec511eed729. - If the git config is read when
TreeStateis initialized(i.e.,TreeState::empty), we can put the type that carries the git config information as a field ofTreeState, andFileSnapshotter. A new argument will be added toTreeState::write_file, because when we use this function to write a symbol link in update and symbol link is not supported, we need to ignore the EOL conversion and just write as is. - If the git config is read when
SnapshotOptionsorCheckoutOptionsis constructed, we can pass the git config information through the function parameters, and as a field ofFileSnapshotter. We will add a new parameter toTreeState::update,TreeState::write_file. - For the latter 2 designs, the git config read happens at a different place where the EOL conversion happens, so we need a data type that passes the config to functions that perform the EOL conversion, and we have different designs:
- directly pass the value of the
core.autocrlfconfig around; - pass a concrete type similar to
GitIgnoreFilesthat provides functions to get the git config; - or pass a trait object similar to the
matcherparameter inTreeState::update.
- directly pass the value of the
- Directly pass the
core.autocrlfconfig value around. This design is suggested by https://github.com/jj-vcs/jj/pull/6728#discussion_r2137552726. This design should be the simplest design, however it could bring duplicate code. The logic to probe whether a file is binary may be repeated at various call sites. The future complicated.gitattributeslogic that decides whether a file needs an EOL conversion may be also duplicated. It can be argued that we can wrap those logic in a function. But in that case, we lose the simplicity of the solution by introducing another layer of abstraction. Moreover, if we decide to only support thecore.autocrlfgit config for the git backend, this design leaks the implementation details of the git backend, resulting in a bad encapsulation. If we are going to support.hgeolfrom fig/mercurial, we will end up with another type and parameters that carries the.hgeolconfiguration. - Pass a concrete type similar to
GitIgnoreFilesthat provides functions to query the final decision of EOL conversion for a file. This design allows us to put common logic for binary probing and future.gitattributesinteraction into methods. However, if we decide to only support the git autocrlf config on git backend only, the implementation details may be leaked, and can also cause#[cfg(feature = “git”)]to scatter everywhere. If we adopt this path, the type name will beTargetEolStrategy. We need 2 different functions to decide the target EOL:fn get_target_eol_from_disk(&self, file_path: &Path) -> TargetEolwhich probes the file based on the file on the disk, andfn get_target_eol_from_store(&self, repo_path: &RepoPath, file_id: &FileId) -> TargetEolwhich probes the file based on the file read fromStore. To integrate with the future.gitattributesfeature, we either add aGitAttributesFileparameter to those functions or we pass thisTargetEolStrategyto an implicit backend function to decide the final EOL. The exact design decision is based on our answer to the “do we want to support.gitattributesfor all backends” question. - Pass a trait object. This design has the highest complexity. It requires us to introduce a trait and a type that implements the trait. To be dyn-compatible, this trait can’t use async fn, and can only use fn that returns a boxed future. This can happen if we want to hide the binary file probe logic in the interfaces: to read a file from the store, we can only use the async
Store::read_filefunction. However, it allows us to hide the implementation details. If we only want to support the git backend for the gitcore.autocrlfconfig, and want to provide a completely different implementation for.hgeol, this way provides the best abstraction. It can also minimize the use of#[cfg(feature = “git”)], because we only need that when implementing the trait. We don’t need that when we are calling methods from the trait. Similar to the concrete type design,fn get_target_eol_from_disk(&self, file_path: &Path) -> TargetEolandfn get_target_eol_from_store(&self, repo_path: &RepoPath, file_id: &FileId) -> TargetEolwill be the required methods. To integrate with the future.gitattributesfeature, we either add aGitAttributesFileparameter to those functions or we let the implementation keep an access to the underlying git.gitattributesimplementation and query on demand. The exact design decision is based on our answer to the “do we want to support.gitattributesfor all backends” question.
- According to different design decision to the “Should the
- store -> disk EOL conversion
- Implement the EOL conversion function
- There will be 2 versions of the function: the async version,
async fn convert_eol_async(content: impl AsyncRead, target_eol: TargetEol) ->Box<dyn AsyncRead>, and the non-async versionfn convert_eol(content: impl Read, target_eol: TargetEol) -> Box<dyn Read>. The return value is the bytes with EOL converted as expected. - If
target_eolindicates that no EOL conversion is needed, the passed in value will be wrapped inBoxand returned. - If
target_eolindicates that EOL conversion is needed, we callread_to_endto read all the bytes from thecontentparameter into abufof theVec<u8>type. If we encounter an error, return aRead/AsyncReadtrait object that returns the error on the first call to read. This error handling design makes it easier for the caller to handle the error. - Create
reswhich is anotherVec<u8>with capacity set to the length ofbuf. We will fillreswith EOL converted data later. - Iterating lines from
bufthroughByteSlice::linesseems ideal, but the last line becomes tricky: just from the iterator we can’t tell if the last line ends without LF, with CRLF, or with just LF. Therefore we have a design decision to make on whether to useByteSlice::lines.- if we are using
ByteSlice::lines, we usepeekableto check if the current line is the last line. If it’s not the last line, we append the current line toresand append a target EOL toresbyVec::extend_from_slice. If it’s the last line, we append the current line tores, and check if the last byte ofbufis\nor not. If so, we append the target EOL. If not, we do nothing. - if we are not using
ByteSlice::lines, we usesplitwith|c| c == b‘\n’as the predicate to iterate over lines. We also usepeekableto check if the current line is the last line. If it’s not the last line, we strip the last possible\rbefore we push the line tores, and we append the expected EOL. If it’s the last line, we push the line as is tores.
- if we are using
- Wrap
resin aBox<Cursor<Vec<u8>>, and coerce it toBox<dyn AsyncRead>orBox<dyn Read>as needed before we return.
- There will be 2 versions of the function: the async version,
- File structure. We will add a
lib/src/eol.rsfile to store all the new types and functions added. I don’t want to call iteol_utilsuggested in https://github.com/jj-vcs/jj/pull/6728#discussion_r2137566057, because the_utilsuffix seems meaningless unless we have another naming convention we need to follow. A newlib/tests/test_eol.rsfile will be added for integration tests. And if we introduce code injj-cli(e.g., we constructSnapshotOptionsorCheckoutOptions), a newcli/tests/test_eol.rsfile will be added. - Tests: the goal is to have 100% line coverage for the newly added code, and we use unit tests as much as possible for high coverage.
- To unit test
convert_eolandconvert_eol_async, we can implementReadorAsyncReadfor a struct that returns error or simply use&[u8]if we needReadto return contents we need. We should at least cover the following cases:- convert the EOL to LF, texts have CRLF line ending, verify that CR is correctly removed.
- convert the EOL to CRLF, texts have LF line ending, verify that CR is correctly added.
- convert the EOL to LF or CRLF, the text doesn’t have any EOL, verify that the text is not changed
- convert the EOL to LF or CRLF, the text has only one line which ends with some EOL, verify that the returned bytes also have only one line, and the EOL is correctly converted
- if the input text contains empty lines(i.e., continuous LF or CRLF), those empty lines should be preserved with EOL correctly converted
- if the
Readparameter or theAsyncReadparameter returns an error when read is called, the returned value function should return the same error on read - if the
target_eolparameter suggests that no EOL conversion should happen, the output bytes should be the same as the input bytes - if the
target_eolparameter suggests that no EOL conversion should happen, and theReadparameter or theAsyncReadparameter returns an error when read is called. The returned value should result in an error.
- Testing the implementation of
TargetEolStrategyrequires creatingStore, and so will be tested via integration tests inlib/tests/test_eol.rs.- To set up the test, we need to create a git repo either through
TestRepo::init_with_backend. - We modify the repo git
core.autocrlfconfig tofalse, so that we can check in the file to test as is. - We then need to recreate the
Storeto pick up the gitcore.autocrlfconfig by implementing a new function. The implementation is similar toTestRepo::init_with_backend, but without creating a newTestEnvironment, and usingGitBackend::init_externalto replace theGitBackend::init_internal. - We use
Store::write_fileto create the file to testget_target_eol_from_store, one binary file(whose content should be the same asdocs/images/favicon-96x96.png), one text file. - We modify the repo git
core.autocrlfconfig totrue,input, andfalsefor different test cases and write back to the filesystem - We recreate the
Storeagain. - We write a text file or a binary file to the filesystem to test
get_target_eol_from_disk. - We use the
Storeto initialize theTargetEolStrategy, pass in the file to test, and verify the return value ofget_target_eol_from_storeandget_target_eol_from_disk. - In the future, with the
.gitattributesfeature’s implementation, this test will change accordingly.
- To set up the test, we need to create a git repo either through
-
TreeStateis too complicated for unit tests. Modification toTreeStatewill be tested through integration tests. The test is set up similar to theTargetEolStrategytest.- The happy path test for snapshot: similar to
TargetEolStrategytests, but with actually snapshotting:- Initialize a test repo, set the
autocrlf=truegit config, recreateStoreand create aWorkspace. We can’t use the existingTestWorkspace::init_backendbecause currently it doesn’t support creatingStoreto pick up git config changes. - Create files with CRLF EOL, LF EOL, mixed EOL and a binary file for different test cases.
- Create a snapshot. And create a snapshot again to verify that the working copy is clean.
- Set the
core.autocrlf=falsegit config, and recreateStoreandWorkspaceto pick up the git config change, so that the later checkout won’t perform any EOL conversion. - Checkout an empty change and checkout the snapshot again to recreate the files.
- Verify the file contents are expected: for text files, EOL should be converted to LF, for the binary file, no change should happen.
- Initialize a test repo, set the
- The happy path test for update: similar to snapshot, but we change the
core.autocrlfgit config to the value under test when check in, and change it tocore.autocrlf=falsewhen check out. And we don’t verify if the working copy is clean. - Test the case where we don’t perform EOL conversion for symbol link if symbol link is not enabled.
- We need to introduce a new interface to override the
TreeState::symlink_supportfield. We then have a design decision to make. I propose the 2 following approaches:- We pass that override information as a
UserSettings, e.g.test.symlink-support-override, valid values are strings: “true”, “false”, or “auto”. If we have reserved config keys, we can use that key instead oftest. The default is “auto”, which is the current behavior. “true” and “false” will override theTreeState::symlink_supportvalue when creating. To pass that value toTreeState::empty, we store the test settings inStore, and add relevant interfaces to initialize and query the test settings inStore. We add a newpub(crate) get_test_settings(&’a self) -> &’a HashMap<String, String>method toStoreto query the test settings, and a newpub(crate) new_with_test_settings(backend: Box<dyn Backend>, signer: Signer, test_settings: HashMap<String, String>) -> Arc<Self>method toStore. Inlib/src/repo.rs, we introduce a new function that usesUserSettings::get_stringwith explicit keys(e.g.,test.symlink-support-override) to create the test settingsHashMap. And change bothReadonlyRepo::initandRepoLoader::init_from_file_systemto use the newStore::new_with_test_settingsfunction. When testing, we create aConfigLayerwith a toml string where thetest.symlink-support-overridesetting set to the value we need, we add that layer to thebase_user_config, and we useUserSettings::from_configto create the neededUserSettingsthat we use to create theWorkspace. This solution is complicated, but doesn’t change any existing interfaces, and the public interfaces are mostly backward compatible. - Add a
pub symlink_support_override: Option<bool>field toSnapshotOptionsandCheckoutOptions, and pass that the override value throughTreeState::updateand another new field inFileSnapshotter. When passing the value,TreeState::symlink_supportwill be used if the override isNone. This solution is simple, but it introduces a breaking change to the public API, and forces the users ofSnapshotOptionsandCheckoutOptionsto fill a test field that they won’t ever care about.
- We pass that override information as a
- In the test, we skip the test if the symbol link is not supported. Set the git config
core.autocrlftotrue. Override symbol link support to enabled. Create a file with a symbol link to the file. The name of the file will include LF without CR preceding. Create the snapshot. Checkout to an empty commit. Override symbol link support to disable. Set the git configcore.autocrlftotrue. Recreate theStoreandWorkspaceto pick up the change. Checkout the commit with the symbol link. Read the content of the symbol link file, and verify that the content doesn’t change.
- We need to introduce a new interface to override the
- Test the case for merge conflicts.
- Create 4 commits with the
core.autocrlfgit config set tofalse: A, the empty commit; B, and C, children of A, but add a new test-file with the contentB\n,C\nrespectively; D, child of B and C. A is created bycreate_tree(repo, &[]); B, and C are created byTestTreeBuilder::fileandcommit_with_tree; D is created bymerge_commit_treesandcommit_with_tree. - Set the
core.autocrlfgit config totrue. RecreateStoreandWorkspace. Checkout D. AppendE\r\ntotest-file. Create a snapshot. Create a snapshot again to verify the working copy is clean. Create commit E based on the return value of the snapshot. - Set the
core.autocrlfgit config tofalse. RecreateStoreandWorkspace. Checkout the commit E. Verify thattest-filedoesn’t contain CRLF, ends withE\n, containsB\n, and containsC\n.
- Create 4 commits with the
- If we adopt the solution to add a
target_eol: TargetEolfield toSnapshotOptionsandCheckoutOptions, and we construct this field in thejj-clicrate, likeWorkspaceCommandHelper::base_ignores. We should add tests to the newcli/tests/test_eol.rsfile to verify if it works correct for different git configs. The technique will be similar toTreeStatetests, but will userun_jj_inand snapshot testing to verify if the correct EOL conversion is applied.
- The happy path test for snapshot: similar to
- To unit test
- Performance consideration for binary file probing
- Probing whether a file is binary will read the first N bytes from the file. Later check in or checkout may read the same N bytes again. One would suspect that we could have a performance optimization chance here by using
BufReaderlike https://github.com/jj-vcs/jj/pull/6728#discussion_r2139307343. However, the reality isn’t that simple. -
BufReadercan discard the cache, if the file size is smaller than the number of bytes we probe. This Rust playground demonstrates this caveat. In fact, when theBufReaderwould discard the inner cache is implementation details that we can’t rely on. To guarantee that we don’t read twice, we should use a combination ofRead::chainandVecDequeas https://github.com/jj-vcs/jj/commit/bfd71ae67ed8a8cb90d2b687c03b3764f47f087e. - Testing on the nix-pkgs repo doesn’t indicate any meaningful performance improvements. We should benchmark before introducing this complexity, so we should at least put the performance optimization to a different PR.
- Probing whether a file is binary will read the first N bytes from the file. Later check in or checkout may read the same N bytes again. One would suspect that we could have a performance optimization chance here by using
Thanks for working on this!
Here are comments on some parts of it. I haven't finished going through it yet.
- Should jj support the
core.autocrlfflavor of EOL conversion for git backend or should it be available as a jj config, and available for all backends?
Since we're implementing support for it only in the working copy, I think it should be independent of the backend.
If the
core.autocrlfgit config flavor support is not only for git backend,
- Are we going to introduce a new jj setting to override the git config? If so, what are the key and valid values?
I suppose we should, but it wouldn't really be "override", because we probably wouldn't respect the value from the Git config at all. If we don't plan on adding support for CRLF conversion outside of the working copy (I think we shouldn't), then calling it working-copy.eol-conversion with values "none"/"input"/"input-output" or something like that sounds good to me (it might be good if the name indicated "convert to LF when shapshotting the working copy" and "convert to CRLF when updating the working copy" but that seems to get too long).
I think we should rename core.fsmonitor to working-copy.fsmonitor, btw.
* How should this support interact with other potential backends, especially the fig/mercurial backend that uses `.hgeol`?
The integrations at Google are actually independent of Fig/Mercurial. They will not support .hgeol (such files are not checked into the Google monorepo anyway).
The only other interaction with other backends I can think of is that we have a custom local-disk working copy implementation at Google, which is a thin wrapper around the default LocalWorkingCopy. So by adding support for EOL conversion upstream, that implementation will automatically get support for it. Maybe Windows developers using the Google monorepo will appreciate that.
- Should the
core.autocrlfconfig be part of theSnapshotOptionsand theCheckoutOptionsAPI or should be initialized locally inlocal_working_copy.rs?
- If we are supporting the
core.autocrlfflavor config for all backends, changingSnapshotOptionsandCheckoutOptionsseem to be the best choice, since we don’t have access toUserSettingsinlocal_working_copy.rs.SnapshotOptionsandCheckoutOptionsare public APIs, and adding fields to them may break downstream projects.
For things that are specific for a particular implementation of the working copy traits, I think we should not pollute SnapshotOptions and CheckoutOptions. It's probably sufficient to pass a &UserSettings into the WorkingCopyFactory functions for the EOL-conversion use case.
I think we should be able to remove the SnapshotOptions::fsmonitor_settings field too.
Handling binary files: we shouldn’t perform EOL conversion for binary files.
before we have
.gitattributes
- We probe the first N bytes(now 8KB hard coded in the source) from the content, and use the existing logic of
gix_filter::eol::Stats::is_binaryto decide whether the file is binary
Do you know if this is what Git does?
* To implement EOL conversion in `TreeState::write_file`, we need to read the contents from the `AsyncRead`, and conduct the EOL conversion based on the content. We have several options: * A separate type, `EolWriter`, that implements the `Write` treat that converts the EOL on the fly. When the caller calls `<EolWriter as std::io::Write>::write`, it converts the EOL of the passed in data before writing to the inner writer, which in our case, the `File` to write. We still use the existing `copy_async_to_sync` function. To report the correct `FileState::size`, we need another utility type to count the actual bytes written to the file instead of the bytes consumed. The numbers can be different due to EOL conversion. This design allows the least memory footprint, especially if we are going to support other features that change the bytes when snapshotting or updating, e.g., the `working-tree-encoding` git attributes. However, the benefit is questionable given that we are supposed to only convert EOL for text files, which should never be too large. Therefore, we won’t adopt it based on the complexity.
It sounds like it should be pretty simple to implement. Is it not?
- disk -> store EOL conversion
- A separate type is more complicated.
Same here: It sounds reasonably simple to me. Did you try it and it turned out more complicated than it seems?
It sounds reasonably simple to me. Did you try it and it turned out more complicated than it seems?
Perhaps, it refers to my review comment on https://github.com/jj-vcs/jj/pull/6728/commits/3358779e242475e1ef3289f9cb7eeb98075b63cb#diff-d46f484c5051943f4e12a22cf14d8c16b58ace4c4c30b9657e0a657c3ea8931d
I haven't reviewed the implementation thoroughly, and I have no idea why we need this level of complexity.
working-copy.eol-conversion with values "none"/"input"/"input-output"
Let's take those names.
I think we should rename core.fsmonitor to working-copy.fsmonitor, btw.
Will be in one of my future PRs then.
Maybe Windows developers using the Google monorepo will appreciate that.
I am the very Google Windows developer that can benefit from it, and our project currently is using .hgeol to tell which files are binary. We indeed doesn't check in the .hgeol file, but we create that file when we initialize the Google monorepo on our Windows machines. Therefore, I have this wrong assumption.
As long as we have a way to handle the EOL conversion for Google Windows developers, I am fine with the design decision. But of course, we need a way to specify the binary files with globs, which probably worth another settings and is more closely related to the text .gitattributes support, and we should discuss there.
It's probably sufficient to pass a &UserSettings into the WorkingCopyFactory functions for the EOL-conversion use case.
Currently, both functions of the WorkingCopyFactory trait doesn't accept UserSettings, and we have a design decision to make then. Basically 2 questions:
- Do we want to expose the entire
UserSettingstolocal_working_copy.rsor just the subset? If only the subset, do we want to expose those settings in a type loose way likeHashMap<String, String>or a strict typeWorkingCopySettingsthat lists all possible settings exhaustively? - How to pass the settings to
WorkingCopyFactory? We can either add new parameters to the 2 functions ofWorkingCopyFactory, or we can change letStoreto carry that piece of information by changing howStoreis initialized and adds an interface toStoreto obtain the settings.
Do you know if this is what Git does?
Git has the text=auto .gitattributes: "Git decides by itself whether the file is text or binary". And since git has -text or binary in .gitattributes to opt-out the EOL conversion, so git doesn't need to guess whether a files is binary by default. This isn't an issue for git, but is an issue for jj until jj supports the text git attribute, which is another issue.
As of how git handles the text=auto case, git reads the entire file, and makes a statistics similar to gix_filter::eol::Stats if not identical. gitoxide is similar.
It sounds like it should be pretty simple to implement. Is it not? Did you try it and it turned out more complicated than it seems?
I tried. This is the on-the-fly EolReader, and this is the on-the-fly EolWriter(We don't need the LfWriter, so we can remove it). They are 80-120 SLOC. The implementation is certainly not extremely complicated, but is also not as intuitive as "read all and replace", which can be as simple as 6 lines. And there are many corner cases, so we have a large set of tests to maintain. The proposed line-by-line solution is inspired by https://github.com/jj-vcs/jj/pull/6728#discussion_r2137612795, which may be 10-20 SLOC to implement, but I haven't tried yet. I don't have preferences, especially I don't know whether a stream like interface is preferred.
git and gitoxide read the content to the end into a buffer, and perform EOL conversion, encoding/decoding(the working-tree-encoding git attribute), ident, and filters(clean/smudge) on the buffer. We can implement it differently in a stream-like way if we think that could matter.
I see, the read and write implementations get a bit more complex than I had thought (in particular the case where a read of CRLF gets split). Doing the eager reads seems like a good start.
- Do we want to expose the entire
UserSettingstolocal_working_copy.rsor just the subset?
Yes, this is what I meant.
- We can either add new parameters to the 2 functions of
WorkingCopyFactory
Yes, I think that's best.
As of how git handles the
text=autocase, git reads the entire file, and makes a statistics similar togix_filter::eol::Statsif not identical.gitoxideis similar.
Thanks for checking. Yuya's suggestion of checking only the first 8kB or so sounds better to me then.
Currently, both functions of the
WorkingCopyFactorytrait doesn't acceptUserSettings, and we have a design decision to make then. Basically 2 questions:
- Do we want to expose the entire
UserSettingstolocal_working_copy.rsor just the subset? If only the subset, do we want to expose those settings in a type loose way likeHashMap<String, String>or a strict typeWorkingCopySettingsthat lists all possible settings exhaustively?
I would just pass &UserSettings to LocalWorkingCopy::init() and ::load(). A subset wouldn't be that useful because we can't validate the contents by upper layer anyway. FWIW, it might make sense to translate UserSettings into TreeStateSettings at LocalWorkingCopy::init()/load() because TreeState is used for different purpose by external merge tools.
- How to pass the settings to
WorkingCopyFactory? We can either add new parameters to the 2 functions ofWorkingCopyFactory, or we can change letStoreto carry that piece of information by changing howStoreis initialized and adds an interface toStoreto obtain the settings.
As function arguments, I think. Store is an object which the working copy uses, not an object who teaches working copy how it should behave.
BTW, if we add support for .gitattributes, we'll have to pass the root attributes differently. I don't think we can completely get rid of eol thingy from SnapshotOptions.