chatgpt-source-watch
chatgpt-source-watch copied to clipboard
Explore AST based diff tools, diff minimisation, etc
There can be a lot of 'noise' when diffing minimised bundled code, as the bundler will often change the minified variable names it uses at times between builds (even if the rest of the code hasn't changed)
We can attempt to reduce this by using non-default git diff modes such as patience / histogram / minimal:
- https://git-scm.com/docs/diff-options/2.6.7#Documentation/diff-options.txt---patience
- https://git-scm.com/docs/diff-options/2.6.7#Documentation/diff-options.txt---diff-algorithmpatienceminimalhistogrammyers
- https://stackoverflow.com/questions/4045017/what-is-git-diff-patience-for
- https://web.archive.org/web/20200128181055/http://git.661346.n2.nabble.com/Bram-Cohen-speaks-up-about-patience-diff-td2277041.html
- https://bryanpendleton.blogspot.com/2010/05/patience-diff.html
- https://alfedenzo.livejournal.com/170301.html
-
Patience Diff, a brief summary
-
Patience Diff also relies on the longest common subsequence problem, but takes a different approach. First, it only considers lines that are (a) common to both files, and (b) appear only once in each file. This means that most lines containing a single brace or a new line are ignored, but distinctive lines like a function declaration are retained. Computing the longest common subsequence of the unique elements of both documents leads to a skeleton of common points that almost definitely correspond to each other. The algorithm then sweeps up all contiguous blocks of common lines found in this way, and recurses on those parts that were left out, in the hopes that in this smaller context, some of the lines that were ignored earlier for being non-unique are found to be unique. Once this process is finished, we are left with a common subsequence that more closely corresponds to what humans would identify.
-
- https://fabiensanglard.net/git_code_review/diff.php
-
Git Source Code Review: Diff Algorithms
-
-
⇒ git diff --diff-algorithm=default -- unpacked/_next/static/chunks/pages/_app.js | wc -l
116000
⇒ git diff --diff-algorithm=patience -- unpacked/_next/static/chunks/pages/_app.js | wc -l
35826
⇒ git diff --diff-algorithm=histogram -- unpacked/_next/static/chunks/pages/_app.js | wc -l
35835
⇒ git diff --diff-algorithm=minimal -- unpacked/_next/static/chunks/pages/_app.js | wc -l
35844
Musings
⭐ Suggestion
It would be cool if ast-grep was able to show a diff between 2 files, but do it using the AST rather than just a raw text compare. Ideally we would be able to provide options to this, such as ignoring chunks where the only change is to a variable/function name (eg. for diffing minimised JavaScript webpack builds)
Ideally the output would be text still (not the AST tree), but the actually diffing could be done at the AST level.
💻 Use Cases
This would be really useful for minimising the noise when diffing minimised source builds looking for the 'real changes' between the builds (not just minimised variable names churning, etc)
Looking through current diff output formats shows all of the variable name changes as well, which equates to a lot of noise while looking for the relevant changes.
Some alternative potential workarounds I've considered are either pre-processing the files to standardize their variable/function names; and/or post-processing the diff output to try and detect when the only changes in a chunk are variable/function names, and then suppressing that chunk. Currently I'm just relying on
git diff --diff-algorithm=minimal -- thefile.jsOriginally posted by @0xdevalias in https://github.com/ast-grep/ast-grep/issues/901
See Also
- https://twitter.com/_devalias/status/1752257275585265862
-
Curious if anyone has some good ideas/tools for doing fancy things to minimise the diff noise when diffing very large bundled/minified JavaScript code?
Would appreciate any thoughts/tips/insights!
-
- https://github.com/ast-grep/ast-grep/issues/901
- https://github.com/Wilfred/difftastic
-
a structural diff that understands syntax
- https://difftastic.wilfred.me.uk/diffing.html
- https://difftastic.wilfred.me.uk/tree_diffing.html
-
This page summarises some of the other tree diffing tools available.
-
- https://github.com/Wilfred/difftastic/issues/631
-
- https://github.com/afnanenayet/diffsitter
-
A tree-sitter based AST difftool to get meaningful semantic diffs
-
You can also filter which tree sitter nodes are considered in the diff through the config file.
- https://github.com/afnanenayet/diffsitter#node-filtering
-
You can filter the nodes that are considered in the diff by setting
include_nodesorexclude_nodesin the config file.exclude_nodesalways takes precedence overinclude_nodes, and the type of a node is the kind of a tree-sitter node.This feature currently only applies to leaf nodes, but we could exclude nodes recursively if there's demand for it.
-
- https://github.com/afnanenayet/diffsitter/issues/819
-
- https://github.com/dandavison/delta
-
A syntax-highlighting pager for git, diff, and grep output
- https://dandavison.github.io/delta/tips-and-tricks/export-to-html.html
-
Save output with colors to HTML/PDF etc
- https://gitlab.com/saalen/ansifilter
-
ANSI sequence filter Ansifilter handles text files containing ANSI terminal escape codes. The command sequences may be stripped or be interpreted to generate formatted output (HTML, RTF, TeX, LaTeX, BBCode, Pango)
-
-
-
- https://chat.openai.com/c/db393c4c-4131-4a0f-899f-acebcb28e813
- Private ChatGPT chat exploring options for minimising diff noise
- https://git-scm.com/docs/diff-format
- https://git-scm.com/docs/diff-format#_combined_diff_format
- https://github.com/waigani/diffparser
-
A Golang package for parsing git diffs
- https://pkg.go.dev/github.com/waigani/diffparser
-
- https://github.com/sergeyt/parse-diff
-
parse-diff Simple unified diff parser for JavaScript
-
- https://github.com/yeonjuan/parse-git-diff
-
parse-git-diff A parser for git diff
-
- https://github.com/ecomfe/gitdiff-parser
-
gitdiff-parser A fast and reliable git diff parser.
-
- https://github.com/kpdecker/jsdiff
-
A javascript text differencing implementation.
- http://incaseofstairs.com/jsdiff/ (playground)
- https://github.com/kpdecker/jsdiff#defining-custom-diffing-behaviors
-
- https://github.com/AsyncBanana/microdiff
-
A fast, zero dependency object and array comparison library. Significantly faster than most other deep comparison libraries and has full TypeScript support.
- https://news.ycombinator.com/item?id=29130661
-
The fastest object diff library in JavaScript
-
-
- https://github.com/trailofbits/graphtage
-
Graphtage A semantic diff utility and library for tree-like files such as JSON, JSON5, XML, HTML, YAML, and CSV.
- https://github.com/trailofbits/graphtage#why-does-graphtage-exist
-
Diffing tree-like structures with unordered elements is tough. Say you want to compare two JSON files. There are limited tools available, which are effectively equivalent to canonicalizing the JSON (e.g., sorting dictionary elements by key) and performing a standard diff. This is not always sufficient. For example, if a key in a dictionary is changed but its value is not, a traditional diff will conclude that the entire key/value pair was replaced by the new one, even though the only change was the key itself. See our documentation for more information.
-
-
- https://semanticdiff.com/
- https://semanticdiff.com/docs/understand-diff/
- https://app.semanticdiff.com/
- https://marketplace.visualstudio.com/items?itemName=semanticdiff.semanticdiff
- https://github.com/Sysmagine
- https://github.com/Sysmagine/SemanticDiff
- https://www.difflens.com/
-
The Developer's Diff Tool
-
Language Aware Diffs for your GitHub Pull Requests.
- https://github.com/marketplace/difflens
- https://marketplace.visualstudio.com/items?itemName=DiffLens.difflens
-
- https://code.visualstudio.com/docs/sourcecontrol/overview#_vs-code-as-git-editor
- https://github.com/mmueller2012/awesome-diff-tools
-
Awesome Diff Tools Awesome tools that show differences between files and folders.
-
Additional links to review
- https://www.monperrus.net/martin/tree-differencing
- https://handmade.network/p/366/diffest/
- https://x.com/NikosTsantalis/status/1767540305618780569
- https://arxiv.org/abs/2403.05939
- https://github.com/tsantalis/RefactoringMiner
- https://www.reddit.com/r/javascript/comments/2ydmtm/gumtreediff_diff_based_on_the_ast_of_the_code/
- https://courses.cs.vt.edu/cs6704/spring17/slides_by_students/CS6704_gumtree_Kijin_AN_Feb15.pdf
- https://github.com/GumTreeDiff/gumtree
- https://github.com/ganarajpr/ast-diff
- https://github.com/prettydiff/prettydiff
- https://github.com/SpoonLabs/gumtree-spoon-ast-diff
- https://github.com/balayette/ast-diff
- https://arxiv.org/pdf/2011.10268
- https://hal.science/hal-04423080/document
- https://tekin.co.uk/2020/10/better-git-diff-output-for-ruby-python-elixir-and-more
- https://news.ycombinator.com/item?id=24828509
- https://news.ycombinator.com/item?id=24831437
- https://bryanpendleton.blogspot.com/2010/04/more-study-of-diff-walter-tichys-papers.html
- https://www.researchgate.net/publication/220439403_The_String-to-String_Correction_Problem_With_Block_Moves
- https://news.ycombinator.com/item?id=39778412
- https://semanticdiff.com/blog/semanticdiff-vs-difftastic/
- https://codinuum.github.io/gallery-cca/
- https://github.com/codinuum/cca/tree/develop?tab=readme-ov-file
- https://ieeexplore.ieee.org/document/4656419
- https://github.com/ErnestMazurin/object-diff-ast
- https://codsen.com/os/ast-compare
- https://github.com/codsen/codsen/tree/main/packages/ast-compare#readme
- https://lib.rs/crates/diffsitter
- https://blog.atlantistech.com/semantic-diff-tool/
- https://github.com/atlantistechnology/sdt
-
Semantic Diff Tool (sdt)
-
The command-line tool sdt compares source files to identify which changes create semantic differences in the program operation, and specifically to exclude many changes which cannot be functionally important to the operation of a program or library.
-
- https://github.com/atlantistechnology/sdt
- https://github.com/jhchen/fast-diff
-
Fast Diff
-
This is a simplified import of the excellent diff-match-patch library by Neil Fraser into the Node.js environment. The match and patch parts are removed, as well as all the extra diff options. What remains is incredibly fast diffing between two strings.
The diff function is an implementation of "An O(ND) Difference Algorithm and its Variations" (Myers, 1986) with the suggested divide and conquer strategy along with several optimizations Neil added.
-
- https://github.com/jonTrent/PatienceDiff
-
PatienceDiff & PatienceDiffPlus
-
A concise javascript implementation of the Patience Diff algorithm
Plus, an implementation of a new algorithm dubbed Patience Diff Plus, which in addition to the usual Patience Diff, identifies lines that moved.
-
- https://github.com/yoshuawuyts/playground-patience-diff
-
playground-patience-diff
-
Convert the Ruby patience diff algo into javascript.
- https://gist.github.com/yoshuawuyts/86b30f07ad30104edf4671b332883908
-
patience-diff.rb
-
-
- https://github.com/yoshuawuyts/changes
-
Changes
-
Compute the differences between two states.
-
- https://blog.jcoglan.com/2017/02/12/the-myers-diff-algorithm-part-1/
-
The Myers diff algorithm: part 1
-
- https://blog.jcoglan.com/2017/02/15/the-myers-diff-algorithm-part-2/
-
The Myers diff algorithm: part 2
-
- https://blog.jcoglan.com/2017/02/17/the-myers-diff-algorithm-part-3/
-
The Myers diff algorithm: part 3
-
- https://blog.jcoglan.com/2017/03/22/myers-diff-in-linear-space-theory/
-
Myers diff in linear space: theory
-
- https://blog.jcoglan.com/2017/04/25/myers-diff-in-linear-space-implementation/
-
Myers diff in linear space: implementation
-
- https://blog.jcoglan.com/2017/09/19/the-patience-diff-algorithm/
-
The patience diff algorithm
-
- https://blog.jcoglan.com/2017/09/28/implementing-patience-diff/
-
Implementing patience diff
-
- https://github.com/ltwlf/json-diff-ts
-
json-diff-ts
-
A diff tool for JavaScript written in TypeScript.
-
- https://github.com/ast-grep/ast-grep/issues/334
I would recommend difftastic for this! Actually rspack has already used it for checking diff between its output with that of webpack.
I would recommend difftastic for this
rspack has already used it for checking diff between its output with that of webpack
@HerringtonDarkholme Interesting.. do you know if they did so while suppressing the 'noise' of changed variables? Or was it more just generally to ensure they were doing compatible things. I had a quick google, but didn't seem to turn up anything specific beyond the repo/etc:
- https://github.com/web-infra-dev/rspack
-
A fast Rust-based web bundler
-
Edit: Opened the following issue on difftastic:
- https://github.com/Wilfred/difftastic/issues/631
Edit 2: And this one on diffsitter:
- https://github.com/afnanenayet/diffsitter/issues/819
I was also just re-reading through the diffsitter README and noticed this section that I somehow missed in the past; which sounds like it might be exactly like what I want:
- https://github.com/afnanenayet/diffsitter
-
A tree-sitter based AST difftool to get meaningful semantic diffs
-
You can also filter which tree sitter nodes are considered in the diff through the config file.
- https://github.com/afnanenayet/diffsitter#node-filtering
-
You can filter the nodes that are considered in the diff by setting
include_nodesorexclude_nodesin the config file.exclude_nodesalways takes precedence overinclude_nodes, and the type of a node is the kind of a tree-sitter node.This feature currently only applies to leaf nodes, but we could exclude nodes recursively if there's demand for it.
-
-
Though playing with diffsitter just now, it's output format seems to leave a lot to be desired compared to typical git diff unified output; and it doesn't show any context/etc currently:
- https://github.com/afnanenayet/diffsitter/issues/744
- https://github.com/afnanenayet/diffsitter/issues/744#issuecomment-1918317644
- https://github.com/afnanenayet/diffsitter/issues/627
- https://github.com/afnanenayet/diffsitter/issues/155
- https://github.com/afnanenayet/diffsitter/issues/349
- https://github.com/afnanenayet/diffsitter/issues/158
- https://github.com/afnanenayet/diffsitter/issues/152
- Usage with JavaScript: https://github.com/afnanenayet/diffsitter/issues/152#issuecomment-1916191285
- https://github.com/afnanenayet/diffsitter/issues/149
eg. On a very minimal example, diffsitter:
⇒ git difftool --tool diffsitter HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-AOnHKy/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-ahyGIo/_buildManifest.js
===================================================================================
80:
---
- "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],
80:
---
+ "/search": ["static/chunks/pages/search-d835393483b5432a.js"],
138:
----
+ "static/chunks/5054-e2060ddbea2abdb7.js"
138:
----
- "static/chunks/5054-8ad3d13d663a6185.js"
Vs git diff (with delta):
⇒ git diff HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
diff --git a/unpacked/_next/static/[buildHash]/_buildManifest.js b/unpacked/_next/static/[buildHash]/_buildManifest.js
index 851a8f0..5004cc7 100644
--- a/unpacked/_next/static/[buildHash]/_buildManifest.js
+++ b/unpacked/_next/static/[buildHash]/_buildManifest.js
@@ -78,7 +78,7 @@
"/payments/success-trial": [
"static/chunks/pages/payments/success-trial-84597e34390c1506.js",
],
- "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],
+ "/search": ["static/chunks/pages/search-d835393483b5432a.js"],
"/share/e/[[...shareParams]]": [
"static/chunks/pages/share/e/[[...shareParams]]-899e50f90dac9ff5.js",
],
@@ -136,6 +136,6 @@
"static/chunks/5017-f7c5e142fc7f0516.js",
"static/chunks/3975-37a9301353b29c5d.js",
"static/chunks/3754-ae5dc2fb759ecfc1.js",
- "static/chunks/5054-8ad3d13d663a6185.js"
+ "static/chunks/5054-e2060ddbea2abdb7.js"
)),
self.__BUILD_MANIFEST_CB && self.__BUILD_MANIFEST_CB();
This is the diffsitter config I was using:
~/.config/diffsitter/config.json5
// Default: `diffsitter dump-default-config`
// See also: https://github.com/afnanenayet/diffsitter/blob/v0.8.1/assets/sample_config.json5
// Colours: `color256`, `black`, `red`, `green`, `yellow`, `blue`, `magenta`, `cyan`, `white`
{
"formatting": {
"default": "unified",
"unified": {
"addition": {
"highlight": "green",
"regular-foreground": "green",
"emphasized-foreground": "white",
"bold": true,
"underline": false,
"prefix": "+ "
},
"deletion": {
"highlight": "red",
"regular-foreground": "red",
"emphasized-foreground": "white",
"bold": true,
"underline": false,
"prefix": "- "
}
},
"json": {
"pretty_print": false
},
"custom": {}
},
"grammar": {
"dylib-overrides": null,
"file-associations": {
"js": "typescript",
"jsx": "tsx"
},
},
"input-processing": {
"split-graphemes": true,
// You can exclude different tree sitter node types - this rule takes precedence over `include_kinds`.
"exclude-kinds": null,
// "exclude-kinds": ["string"],
// You can specifically allow only certain tree sitter node types
"include-kinds": null
// "include-kinds": ["method_definition"],
},
// Specify a fallback command if diffsitter can't parse the given input
// files. This is invoked by diffsitter as:
//
// ```sh
// ${fallback_cmd} ${old} ${new}
// ```
"fallback-cmd": null,
// "fallback-cmd": "diff",
}
And this is the .gitconfig I was using to run it as a git difftool:
# https://github.com/afnanenayet/diffsitter
[difftool "diffsitter"]
cmd = diffsitter "$LOCAL" "$REMOTE"
# https://github.com/afnanenayet/diffsitter
[difftool "diffsitter-debug"]
cmd = diffsitter --debug "$LOCAL" "$REMOTE"
Running it with --debug allows us to see the steps it takes during processing, and how long they take. On our small example file above, the output looks like this:
diffsitter debug output from minimal example
⇒ git difftool --tool diffsitter-debug --color=always HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
2024-01-30T07:09:30.627Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js can be parsed
2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript
2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript
2024-01-30T07:09:30.627Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js can be parsed
2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript
2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript
2024-01-30T07:09:30.627Z DEBUG diffsitter > Extensions for both input files are supported
2024-01-30T07:09:30.627Z DEBUG libdiffsitter > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js to string
2024-01-30T07:09:30.627Z INFO libdiffsitter > Will deduce filetype from file extension
2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript
2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript
2024-01-30T07:09:30.627Z DEBUG libdiffsitter::parse > Constructed parser
2024-01-30T07:09:30.628Z DEBUG libdiffsitter::parse > Parsed AST
2024-01-30T07:09:30.628Z INFO TimerFinished > parse::parse_file(), Elapsed=968.896µs
2024-01-30T07:09:30.628Z DEBUG libdiffsitter > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js to string
2024-01-30T07:09:30.628Z INFO libdiffsitter > Will deduce filetype from file extension
2024-01-30T07:09:30.628Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
2024-01-30T07:09:30.628Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript
2024-01-30T07:09:30.628Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript
2024-01-30T07:09:30.628Z DEBUG libdiffsitter::parse > Constructed parser
2024-01-30T07:09:30.629Z DEBUG libdiffsitter::parse > Parsed AST
2024-01-30T07:09:30.629Z INFO TimerFinished > parse::parse_file(), Elapsed=590.376µs
2024-01-30T07:09:30.629Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=388.582µs
2024-01-30T07:09:30.630Z INFO TimerFinished > ast::process(), Elapsed=1.083248ms
2024-01-30T07:09:30.630Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=309.698µs
2024-01-30T07:09:30.631Z INFO TimerFinished > ast::process(), Elapsed=997.049µs
2024-01-30T07:09:30.631Z INFO TimerFinished > diff::compute_edit_script(), Elapsed=127.357µs
2024-01-30T07:09:30.631Z INFO libdiffsitter::render::unified > Using stack style vertical for title
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 80 - 80)
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 4
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 80
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 80
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 80 - 80)
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 80 - 80)
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 4
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 80
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 80
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 80 - 80)
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 138 - 138)
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 5
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 138
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 138
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 138 - 138)
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 5
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 138
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 138
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js
===================================================================================
80:
---
- "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],
80:
---
+ "/search": ["static/chunks/pages/search-d835393483b5432a.js"],
138:
----
+ "static/chunks/5054-e2060ddbea2abdb7.js"
138:
----
- "static/chunks/5054-8ad3d13d663a6185.js"
Reducing that output to just show the relevant timing events:
// Start
2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:09:30.628Z INFO TimerFinished > parse::parse_file(), Elapsed=968.896µs
// ..snip..
2024-01-30T07:09:30.629Z INFO TimerFinished > parse::parse_file(), Elapsed=590.376µs
2024-01-30T07:09:30.629Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=388.582µs
2024-01-30T07:09:30.630Z INFO TimerFinished > ast::process(), Elapsed=1.083248ms
2024-01-30T07:09:30.630Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=309.698µs
2024-01-30T07:09:30.631Z INFO TimerFinished > ast::process(), Elapsed=997.049µs
2024-01-30T07:09:30.631Z INFO TimerFinished > diff::compute_edit_script(), Elapsed=127.357µs
// ..snip..
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
// Finish
We can then see where the real performance hit is when running this against a MUCH larger file/diff (the file being diffed is ~8.4MB; ~249,128 lines long)
diffsitter debug output from large example
⇒ du -sh unpacked/_next/static/chunks/pages/_app.js
8.4M unpacked/_next/static/chunks/pages/_app.js
⇒ cat unpacked/_next/static/chunks/pages/_app.js | wc -l
249128
⇒ git difftool --tool diffsitter-debug HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js
2024-01-30T07:07:58.282Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-wpgXXK/_app.js can be parsed
2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript
2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript
2024-01-30T07:07:58.282Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-6fi10i/_app.js can be parsed
2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript
2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript
2024-01-30T07:07:58.282Z DEBUG diffsitter > Extensions for both input files are supported
2024-01-30T07:07:58.287Z DEBUG libdiffsitter > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-wpgXXK/_app.js to string
2024-01-30T07:07:58.287Z INFO libdiffsitter > Will deduce filetype from file extension
2024-01-30T07:07:58.291Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
2024-01-30T07:07:58.291Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript
2024-01-30T07:07:58.291Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript
2024-01-30T07:07:58.291Z DEBUG libdiffsitter::parse > Constructed parser
2024-01-30T07:07:59.790Z DEBUG libdiffsitter::parse > Parsed AST
2024-01-30T07:07:59.791Z INFO TimerFinished > parse::parse_file(), Elapsed=1.50364822s
2024-01-30T07:07:59.795Z DEBUG libdiffsitter > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-6fi10i/_app.js to string
2024-01-30T07:07:59.795Z INFO libdiffsitter > Will deduce filetype from file extension
2024-01-30T07:07:59.800Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
2024-01-30T07:07:59.800Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript
2024-01-30T07:07:59.800Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript
2024-01-30T07:07:59.800Z DEBUG libdiffsitter::parse > Constructed parser
2024-01-30T07:08:01.160Z DEBUG libdiffsitter::parse > Parsed AST
2024-01-30T07:08:01.160Z INFO TimerFinished > parse::parse_file(), Elapsed=1.364231801s
2024-01-30T07:08:01.985Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=825.767557ms
2024-01-30T07:08:02.972Z INFO TimerFinished > ast::process(), Elapsed=1.812168007s
2024-01-30T07:08:03.814Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=841.781925ms
2024-01-30T07:08:04.787Z INFO TimerFinished > ast::process(), Elapsed=1.815188386s
2024-01-30T07:20:22.478Z INFO TimerFinished > diff::compute_edit_script(), Elapsed=737.685108096s
2024-01-30T07:20:22.502Z INFO libdiffsitter::render::unified > Using stack style horizontal for title
2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 52698 - 52698)
2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Title string has length of 7
2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Printing line 52698
2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > End line 52698
// ..snip 17,524 lines..
2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96300
2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > End line 96300
2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301
// I'm not sure if this actually finished successfully, or just silently crashed at this point... I think it might have crashed, as I would have expected to see a log line like:
// DEBUG libdiffsitter::render::unified > End hunk
// As well as the actual diff output
Reducing that output to just show the relevant timing events:
// Start
2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:07:59.791Z INFO TimerFinished > parse::parse_file(), Elapsed=1.50364822s
// ..snip..
2024-01-30T07:08:01.160Z INFO TimerFinished > parse::parse_file(), Elapsed=1.364231801s
2024-01-30T07:08:01.985Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=825.767557ms
2024-01-30T07:08:02.972Z INFO TimerFinished > ast::process(), Elapsed=1.812168007s
2024-01-30T07:08:03.814Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=841.781925ms
2024-01-30T07:08:04.787Z INFO TimerFinished > ast::process(), Elapsed=1.815188386s
2024-01-30T07:20:22.478Z INFO TimerFinished > diff::compute_edit_script(), Elapsed=737.685108096s
// ..snip..
2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301
// Finish
We can see that ~12.29 minutes was spent in diff::compute_edit_script(), and then it looks like the script might have silently crashed, as I didn't see the diff output, nor the final DEBUG libdiffsitter::render::unified > End hunk line that I would have expected to see.
Note: I was piping this command to subl (Sublime Text), despite what the code block above was edited to look like; I haven't tried running this again yet printing directly to STDOUT/a file. I also haven't tried running this without debug mode to see if it somehow doesn't crash that way.
As a point of comparison, difftastic was seemingly able to diff this file in ~2.46sec total:
⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD -- 2.63s user 0.46s system 47% cpu 6.494 total
subl 0.01s user 0.02s system 0% cpu 6.746 total
Though with a lot of this printed among the diff output:
_app.js --- 1/674 --- Text (8.39 MiB exceeded DFT_BYTE_LIMIT)
- https://github.com/Wilfred/difftastic/blob/master/CHANGELOG.md#044-released-2nd-march-2023
-
If a file exceeds
DFT_BYTE_LIMIT,difftasticnow displays its size in the header.
-
- https://github.com/Wilfred/difftastic/blob/master/CHANGELOG.md#042-released-15th-january-2023
-
Fixed an issue with unwanted underlines with textual diffing when
DFT_BYTE_LIMITis reached.
-
- https://github.com/Wilfred/difftastic/blob/master/CHANGELOG.md#020-released-20th-february-2022
-
difftasticwill now use a text diff for large files that are too big to parse in a reasonable amount of time. This threshold is configurable with--byte-limitandDFT_BYTE_LIMIT.
-
It seems that when DFT_BYTE_LIMIT is exceeded, difftastic falls back to just doing a basic text diff, so those above timings aren't really a fair comparison. We can override this with --byte-limit and DFT_BYTE_LIMIT:
⇒ difftastic -h
Difftastic 0.52.0
..snip..
OPTIONS:
..snip..
--byte-limit <LIMIT> Use a text diff if either input file exceeds this size. [env:
DFT_BYTE_LIMIT=] [default: 1000000]
..snip..
--graph-limit <LIMIT> Use a text diff if the structural graph exceed this number of
nodes in memory. [env: DFT_GRAPH_LIMIT=] [default: 3000000]
Let's set this to something much higher than we need for now:
20*1024*1024 = 20971520
Updating our .gitconfig:
# https://github.com/Wilfred/difftastic
[difftool "difftastic"]
cmd = difft --byte-limit 20971520 "$LOCAL" "$REMOTE"
Running that command again:
⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD -- 12.42s user 1.10s system 79% cpu 17.043 total
subl 0.01s user 0.02s system 0% cpu 17.248 total
It takes a bit longer, but then we hit a different error/limit:
_app.js --- 1/674 --- Text (2 JavaScript parse errors, exceeded DFT_PARSE_ERROR_LIMIT)
From the help:
⇒ difft --help | subl
..snip..
--parse-error-limit <LIMIT>
Use a text diff if the number of parse errors exceeds this value.
[env: DFT_PARSE_ERROR_LIMIT=]
[default: 0]
There seem to be 674 matches of DFT_PARSE_ERROR_LIMIT in the output, which corresponds to the 674 chunks of output.
I didn't see any specific errors output related to the parsing though.. I wonder how we can see what the actual issue was?
It seems we can dump the AST either from tree-sitter directly, or the difftastic filtered version of it as a debug:
⇒ difft --help | subl
..snip..
DEBUG OPTIONS:
--dump-syntax <PATH>
Parse a single file with tree-sitter and display the difftastic syntax tree.
--dump-ts <PATH>
Parse a single file with tree-sitter and display the tree-sitter parse tree.
Full tree-sitter AST:
⇒ time difft --dump-ts unpacked/_next/static/chunks/pages/_app.js > _app.js-difftastic-tree-sitter.ast
difft --dump-ts unpacked/_next/static/chunks/pages/_app.js > 5.08s user 9.49s system 95% cpu 15.312 total
⇒ du -h _app.js-difftastic-tree-sitter.ast
224M _app.js-difftastic-tree-sitter.ast
⇒ cat _app.js-difftastic-tree-sitter.ast| wc -l
2533687
difftastic 'syntax' AST:
⇒ time difft --dump-syntax unpacked/_next/static/chunks/pages/_app.js > _app.js-difftastic.ast
difft --dump-syntax unpacked/_next/static/chunks/pages/_app.js > 434.80s user 43.33s system 93% cpu 8:29.31 total
⇒ du -h _app.js-difftastic.ast
2.2G _app.js-difftastic.ast
⇒ cat _app.js-difftastic.ast | wc -l
10571815
After the above explorations, I ended up taking a different tact and started exploring the 'post processing git diff' approach.
Initial PoC tests seemed to show some merit, so I hacked them together into a script that can filter an existing git diff passed in from a file, or STDIN:
- https://twitter.com/_devalias/status/1752260576066301956
-
Some alternative potential workarounds I've considered are: pre-processing the files to standardize their variable/function names; and/or post-processing the diff output to try and detect when the only changes in a chunk are variable/function names; then suppressing that chunk.
-
- https://twitter.com/_devalias/status/1752646128372535530
-
After exploring/playing around with a bunch of tools yesterday, and a hyperfocus of hacking out some PoC code this afternoon/evening, I think I have the elements of a workable solution! 🎉
Just need to pull it all together and refine it once the 🧠 is refreshed/rested..
-
- https://twitter.com/_devalias/status/1753322272293896385
-
The code is still pretty hacky.. but pulled the PoC's together into a usable script for filtering the noise in the diff output!
Still not perfect.. but I am really liking these stats so far!
For an
8.4mb_app.jsfile (250,022lines):Original diff:
33,399Filtered:7,516
-
- https://twitter.com/_devalias/status/1753322275984867513
-
There are still a bunch of areas where my partial AST parsing is getting errors, which means more noise will slip through till I can fix those.. but this is already super useful!
Hopefully will get a version of it cleaned up/committed/pushed sooner rather than later.
-
- https://twitter.com/_devalias/status/1753711267137892684
-
A few more updates that I posted elsewhere as I was going; from
7100, to6455, to4268, and now down to3913lines.-
Added a few more ‘pre AST parsing’ fixups that get the diff down even further: from the previous
7100lines to6455lines -
Trialled parsing the AST with a more lenient parser as a pre-step; filtering out unknown bits, then generating that back into code; that I then apply some of my previous ‘preparation hacks’ to, before parsing it with the more strict parser and doing the identifier normalisation.
That’s currently reduced the
6455lines down to4268lines; and there are still leftover parsing errors that could potentially get that number lower still that I’m chipping away at.Another alternative could be to completely replace the stricter parser with the more lenient one; but then I would have to rewrite a bunch of other code as well.
I could also potentially try a different lenient parser; or pre-process the code being passed to this one; as I believe it’s failing to parse some statements even; which is potentially leading to more of that noise filtering through..
-
I added a pre-processing step to the lenient parser which improved things a bit; then I added an actual fix to the AST from the lenient parser to help fix more complex ternaries that were challenging to fix in my simpler string append pre/post fix methods.
With those, we’re now down to
3913lines.Being able to properly detect and fix object properties that are missing their wrapping object would probably fix a bunch more of the issues.
Though maybe when I look at this next; I’ll just see whether using the lenient parser on its own gives a quicker/better result.
-
-
With the current PoC implementations in the diff minimiser, we're grouping by the diff chunks, and then by added/removed lines within those chunks.
Sometimes a section of code will churn and move a large amount of lines from one chunk, to another chunk, which won't get noticed by the minimiser currently.
It would be good if we could process these in a similar way to git diff's color-moved, with our 'ignore identifier changes', so that we can suppress both the 'added' and 'removed' chunks if they're otherwise unchanged (or minimise the diff between them to only show what has changed, rather than showing them as large 'disconnected blocks')
Edit: See also:
- https://github.com/Wilfred/difftastic/issues/508
--
VSCode recently added a new feature called move code detection. The algorithm is surprisingly simple: after normal diff, calculate
similarity(deletion, insertion)for everydeletion&insertionpair. The code can be found here https://github.com/microsoft/vscode/blob/166097a20cbd06d10d255ef561837c439f372de3/src/vs/editor/common/diff/defaultLinesDiffComputer/computeMovedLines.ts#L44-L85.But intuitively demonstrating the moving relationship on TUI might be a challenge.
Originally posted by @QuarticCat in https://github.com/Wilfred/difftastic/issues/508#issuecomment-1734691395
--
Have you looked into
git diff's--color-movedat all?
- https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-movedltmodegt
Personally I tend to use the
zebramode of it. From my git aliases:
- https://github.com/0xdevalias/dotfiles/blob/a59d60e18b58069fe1d2c21dc5ef42a1c0afe797/git/gitconfig.symlink#L248-L259
# https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-movedltmodegt # https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-moved-wsltmodesgt # https://git-scm.com/docs/git-config#Documentation/git-config.txt-color # https://git-scm.com/docs/git-config#Documentation/git-config.txt-colordiffltslotgt diff-refactor = \ -c color.diff.oldMoved='white dim' \ -c color.diff.oldMovedAlternative='white dim' \ -c color.diff.newMoved='white dim' \ -c color.diff.newMovedAlternative='white dim' \ -c color.diff.newMovedDimmed='white dim' \ -c color.diff.newMovedAlternativeDimmed='white dim' \ diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimalOriginally posted by @0xdevalias in https://github.com/Wilfred/difftastic/issues/539#issuecomment-1916033305
A lot of this will probably be uninteresting/useless noise, but I just made a repo and pushed all of my hacky WIP/PoC AST exploration scripts to it so they're captured in one place:
- https://github.com/0xdevalias/poc-ast-tools
PoC scripts and tools for working with (primarily JavaScript) ASTs.
What might be of interest though, is that I pushed the current hacky state of my diff minimiser code there too. I haven't checked it for a good few weeks, so can't even remember if it's in a runnable state currently, but if nothing else, there might be some ideas hidden away in there that help explain what I was referring to above:
- https://github.com/0xdevalias/poc-ast-tools/commit/9f7071f0bfea198e3bfe3a251b67d3340d76bd6c
- https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser.js
- https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser-poc-acorn.js
When I get a chance to get back to them, I plan to finish my research/refinements on the best method(s), and then clean it all up back to a single useful script that I will commit to this repo. But until then, those may be of some interest to someone.
Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2024566283
Here's a fun little hack I just thought up (while the diff minimiser script still breaks
--color-moved):
- Diff 2 files, run them through the minimiser, save that diff to a file
- Apply the output of the minimised diff to the original file, save that to a file
- Re-run diff with
--color-movedbetween the original file, and the original+minimisedDiffPatch fileCurrently this doesn't work fully, as I think I'm not properly updating the line metadata in the diff hunks when I've removed sections from them in the diff minimiser.. but something like this (in theory):
⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | \ sed 's|a/unpacked/_next/static/chunks/pages/_app.js|a/_app.original.js|g' | \ sed 's|b/unpacked/_next/static/chunks/pages/_app.js|b/_app.patched.js|g' \ > _app.minimised.diff ⇒ git show 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js > _app.original.js ⇒ git apply _app.minimised.diff ⇒ git -c color.diff.oldMoved='white dim' \ -c color.diff.oldMovedAlternative='white dim' \ -c color.diff.newMoved='white dim' \ -c color.diff.newMovedAlternative='white dim' \ -c color.diff.newMovedDimmed='white dim' \ -c color.diff.newMovedAlternativeDimmed='white dim' \ diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.jsEdit: I wrote a new
scripts/fix-diff-headers.js(https://github.com/0xdevalias/chatgpt-source-watch/commit/f7deec74cca9127bc2f18279d83f7a088f3f35f1) script that seems to get it closer to working..⇒ ./scripts/fix-diff-headers.js _app.minimised.diff _app.fixed.diff Reading diff from _app.minimised.diff and writing corrected diff to _app.fixed.diff Diff successfully corrected and written to _app.fixed.diff ⇒ patch _app.original.js _app.fixed.diff -o _app.patched.js patching file _app.original.js 6 out of 90 hunks failed--saving rejects to _app.patched.js.rej ⇒ git -c color.diff.oldMoved='white dim' \ -c color.diff.oldMovedAlternative='white dim' \ -c color.diff.newMoved='white dim' \ -c color.diff.newMovedAlternative='white dim' \ -c color.diff.newMovedDimmed='white dim' \ -c color.diff.newMovedAlternativeDimmed='white dim' \ diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.jsThat way, instead of just seeing a giant chunk of 'removed' and a giant chunk of 'added' when this module moved; we can see the dimmed lines which were moved (but otherwise unchanged), and spend less time/effort having to consider them:
An even better version of this would be a
diff --color-movedimplementation that was able to ignore identifier names changing entirely (in which case I suspect it would mark this entire block as 'moved but not changed'); but as far as I am currently aware, the only way to do that would be to basically re-implement the wholediff --color-movedlogic ourselves; which would basically end up with us writing our own AST differ I believe.Though the way I was originally intending on fixing/improving the diff minimiser script was to try just maintaining the ANSI colouring of the input lines, and outputting the lines with the same colouring at the end.
Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2073982692
From some Twitter chat with @michaelskyba, talking about git diff --color-moved and related:
- https://x.com/sucralose__/status/1866280320737132796
-
@michaelskyba (Dec 10) Wait,
--color-movedjust works. I'm pretty sure @0xdevalias was using something similar. Maybe it's fine to just remove all moved lines from the diff and call it a day
-
- https://x.com/sucralose__/status/1866282450197909758
-
@michaelskyba (Dec 10) Uh there's no option for that? Online I only see people using heuristics like "same line added and removed same number of times" but I feel like git is doing something more sophisticated, especially given the whitespace-related options
-
- https://x.com/_devalias/status/1866633367560872105
-
@0xdevalias (Dec 11) Yeah.. the code for it is open source, so you could go try and understand how the algorithm works. I wanted to, but I never got around to spending the time to figure it out.
I think I was going to see if I could port it out of git into something I could use in my own code
-
- https://x.com/sucralose__/status/1866839846901387687
-
@michaelskyba (Dec 12) Right now the time investment for that is lower ROI, but by EOY 2025, consumer LLMs will hopefully be able to just take the 600K lines of git's source code and make a nice single-paragraph summary of any individual feature. The input diffs will be 100x more complex too though
-
- https://x.com/_devalias/status/1867721773070135799
-
@0xdevalias (Dec 14) True.. but you can also probably find it pretty quickly with a couple of normal searches of the git source repo.
-
- https://x.com/_devalias/status/1867721812924412021
-
@0xdevalias (Dec 14) Tangentially related, just stumbled upon this again, I always forget that git ships with a bunch of extra contrib scripts that aren't exposed through the main CLI: https://github.com/git/git/tree/2ccc89b0c16c51561da90d21cfbb4b58cc877bf6/contrib/diff-highlight
-
- https://x.com/_devalias/status/1867723412082503694
-
@0xdevalias (Dec 14) Back to
--color-moved: https://github.com/git/git/blob/2ccc89b0c16c51561da90d21cfbb4b58cc877bf6/diff.c#L5825-L5827diff_opt_color_moved: https://github.com/git/git/blob/2ccc89b0c16c51561da90d21cfbb4b58cc877bf6/diff.c#L5214-L5233Then it looks like
options->color_movedis referenced a bunch throughout thatdiff.c
-
- https://x.com/_devalias/status/1867724495718690871
-
@0xdevalias (Dec 14) GitHub Copilot isn't the most useful.. at least from a first prompt
-
- https://x.com/_devalias/status/1867724970648121836
-
@0xdevalias (Dec 14) Claude 3.5 Sonnet does better, with that source file attached
- https://claude.ai/chat/768b1075-5517-49b3-be72-11867c0bac9e
- https://claude.ai/chat/768b1075-5517-49b3-be72-11867c0bac9e
-
- https://x.com/_devalias/status/1867725273720139901
-
@0xdevalias (Dec 14) As does ChatGPT 4o
- https://chatgpt.com/c/675ccd5a-dbf4-8008-b964-beb295ed07d0
- https://chatgpt.com/c/675ccd5a-dbf4-8008-b964-beb295ed07d0
-
- https://x.com/_devalias/status/1867727351490261458
-
@0xdevalias (Dec 14) Oh, I remember why I wanted to be able to re-implement
--color-movedin my own code... it was because I wanted to be able to add an 'ignore identifiers' concept to it, based on parsing the code into an AST. Could also work around that by stabilising the identifiers before diffing
-
- https://x.com/_devalias/status/1867731895058149712
-
@0xdevalias (Dec 14) But parsing the AST's across so many huge files takes a lot of time, so I was leveraging
git diffto 'lower the noise' before needing to do the AST parsing.. but I guess either way if I wanted to ignore the identifiers I would have to parse it at some point...
-
- https://x.com/_devalias/status/1867731965576872100
-
@0xdevalias (Dec 14) Jumping back to those AI summaries now that I've actually read them.. Claude's is nice as a higher level overview, but ChatGPT 4o is definitely getting into the nittier grittier details.
-
- https://x.com/_devalias/status/1867733682695680329
-
@0xdevalias (Dec 14) Well.. thats a side rabbithole distraction I don't have capacity for right now.. but based on the performance of those queries.. it seems way more plausible to have it generate a standalone implementation of
--color-movedthese days than it was when I was first exploring it
-
- https://x.com/_devalias/status/1867736368971231739
-
@0xdevalias (Dec 14) Lol.. I went to document all of the above on one of my GitHub comments, and then found this one in the same issue that talks about my color-moved thoughts: https://github.com/0xdevalias/chatgpt-source-watch/issues/3#issuecomment-1962228912
And it also references a vscode 'moved code detection' feature that might also be useful to look at:
- https://github.com/0xdevalias/chatgpt-source-watch/issues/3#issuecomment-1962228912
-
- https://x.com/_devalias/status/1867736959747330065
-
@0xdevalias (Dec 14) Lol.. apparently also the
git diff-> minimiser -> patch -> apply on old version -> diff again with colour method I mentioned somewhere the other day:- https://github.com/0xdevalias/chatgpt-source-watch/issues/3#issuecomment-2073984761
-
Some more Twitter chat with @michaelskyba, talking about diff minimisation:
- https://x.com/sucralose__/status/1866270226951688481
-
@michaelskyba (Dec 10) hhh I had a seemingly robust setup from iterating with o1 but the diff was still too large. I thought the identifier differences might be fairly limited and detectable after all, but I still don't know the pattern yet. So for now I'm going with the simple, general normalization
-
- https://x.com/_devalias/status/1866629461061537982
-
@0xdevalias (Dec 11) Instead of trying to figure identifiers by regex/etc, I went with AST parsing of the diff lines in my diff minifier. I first tried using a 'strict' parser (babel), and coming up with a bunch of hacks to 'fix' the diff lines to make them parse; but that was super hacky.
-
- https://x.com/_devalias/status/1866629664518836679
-
@0xdevalias (Dec 11) I then tried doing it with less strict parser, and that seemed to work without needing any hacks.
It's possible I could also maybe just do it using the tokenizer part and not even need the parser, but I don't think I ever tested that part deeply.
-
- https://x.com/_devalias/status/1866630342742061226
-
@0xdevalias (Dec 11) Obviously there's this old issue of mine: https://github.com/0xdevalias/chatgpt-source-watch/issues/3
And this of yours: https://github.com/0xdevalias/chatgpt-source-watch/issues/10
Both of which I believe contain some of the notes/thoughts around this process, eg. from a quick search, this comment: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2022019229
-
- https://x.com/_devalias/status/1866630553367351470
-
@0xdevalias (Dec 11) And this comment seems to link to where I uploaded my PoC scripts: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2024566283
https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser-poc-acorn.js
https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser.js
-
- https://x.com/sucralose__/status/1866847921661956247
-
@michaelskyba (Dec 12) Hmm, even if hacky, the approach is interesting. It's good design on acorn's part that it separates its different parsing modes like this; if diff did the same, the moved lines interpretation to wrappers or jsdiff etc. would be simpler
-
- https://x.com/_devalias/status/1867739336651026611
-
@0xdevalias (Dec 14) nods true. Though acorn is designed as a library to be consumed by others, whereas git's diff internals aren't really.
-
- https://x.com/_devalias/status/1867739549176303779
-
@0xdevalias (Dec 14) I suspect if you dug into the c code there would likely be aspects you could rip out and sort of reuse as a library.. but at that point may as well reimplement the good bits; which ties back into our other thread:
- https://x.com/_devalias/status/1866633367560872105
-