Refactor `visitAllIdentifiers` to leverage `babel`'s built in scope tracking better
This might be my own naive misunderstanding of how visitAllIdentifiers works currently, but from a quick skim through it feels kind of convoluted in how it's using findScopes, hasVisited, markVisited, etc; when, if I remember correctly, babel's AST parser + @babel/traverse has built in support for scope tracking/renames/etc that I believe make most of this redundant?
- https://github.com/jehna/humanify/blob/main/src/plugins/local-llm-rename/visit-all-identifiers.ts
I haven't tested this specifically, but after a few rounds of iteration with GitHub Copilot, this is feeling pared down/closer to how I would have expected this feature to have been implemented, leveraging more of Babel's scope tracking/etc:
import { parseAsync, transformFromAstAsync, NodePath } from "@babel/core";
import * as babelTraverse from "@babel/traverse";
import { Identifier, toIdentifier } from "@babel/types";
const traverse: typeof babelTraverse.default.default = (
typeof babelTraverse.default === "function"
? babelTraverse.default
: babelTraverse.default.default
) as any; // eslint-disable-line @typescript-eslint/no-explicit-any -- This hack is because pkgroll fucks up the import somehow
type Visitor = (name: string, scope: string) => Promise<string>;
export async function visitAllIdentifiers(
code: string,
visitor: Visitor,
contextWindowSize: number
) {
const ast = await parseAsync(code, { sourceType: "unambiguous" });
if (!ast) {
throw new Error("Failed to parse code");
}
traverse(ast, {
Identifier: async (path) => {
const surroundingCode = await scopeToString(path, contextWindowSize);
const renamed = await visitor(path.node.name, surroundingCode);
if (renamed !== path.node.name) {
let safeRenamed = toIdentifier(renamed);
while (
path.scope.hasBinding(safeRenamed) ||
path.scope.getProgramParent().hasBinding(safeRenamed)
) {
safeRenamed = `_${safeRenamed}`;
}
path.scope.rename(path.node.name, safeRenamed);
}
}
});
const stringified = await transformFromAstAsync(ast);
if (stringified?.code == null) {
throw new Error("Failed to stringify code");
}
return stringified.code;
}
async function scopeToString(
path: NodePath<Identifier>,
contextWindowSize: number
) {
const surroundingPath = closestSurroundingContextPath(path);
const code = `${surroundingPath}`; // Implements a hidden `.toString()`
if (code.length < contextWindowSize) {
return code;
}
if (surroundingPath.isProgram()) {
const start = path.node.start ?? 0;
const end = path.node.end ?? code.length;
if (end < contextWindowSize / 2) {
return code.slice(0, contextWindowSize);
}
if (start > code.length - contextWindowSize / 2) {
return code.slice(-contextWindowSize);
}
return code.slice(
start - contextWindowSize / 2,
end + contextWindowSize / 2
);
} else {
return code.slice(0, contextWindowSize);
}
}
function closestSurroundingContextPath(
path: NodePath<Identifier>
): NodePath<Node> {
const programOrBindingNode = path.findParent(
(p) => p.isProgram() || path.node.name in p.getOuterBindingIdentifiers()
)?.scope.path;
return programOrBindingNode ?? path.scope.path;
}
That said, I could be missing/misunderstanding some nuance of how it's currently implemented that isn't handled properly in this new version, etc.
@babel/traverse has built in support for scope tracking/renames/etc that I believe make most of this redundant?
All the magic happens in smallestScope.scope.rename(smallestScopeNode.name, safeRenamed);, the rest of the code is only used for the LLM to get some context of how the variable is used
All the magic happens in
smallestScope.scope.rename(smallestScopeNode.name, safeRenamed);
@j4k0xb Hrmm, fair.
Skimming through the logic:
visitAllIdentifiers- calls
babel'sparseAsyncto get the AST - creates
Set's forrenames/visited - calls
findScopestraverse's the AST and for eachBindingIdentifiernode, callsclosestSurroundingContextPathand calculates thepathSize, then stores that inscopesas[nodePath: NodePath<Identifier>, scopeSize: number][]- sorts the
scopesby theirpathSize - then
.map's them to just theirnodePath
- extracts
numRenamesExpectedfromscopes.length - loops through
scopes- checks
hasVisited - gets the
surroundingCodewithscopeToString - calls the passed in
visitorfunction to get therenamedvariable - does some
safeRenamedchecking - calls
smallestScope.scope.rename - calls
markVisited - outputs the
onProgress
- checks
- calls
It's been a while since I was looking at this PoC code of mine, so I might be misremembering it, but in this one I was using Scopable with babel.traverse to output all of the scopes:
- https://github.com/0xdevalias/poc-ast-tools/blob/05a34a9e74be6e686a78a49a02b6750ce8796e01/babel_v1_0_scope_debug.js#L68-L72
-
babel.traverse(ast, { Scopable(path) { debugLogScopeBindings({ path }) } }); - https://babeljs.io/docs/babel-types
- https://babeljs.io/docs/babel-types#aliases
- https://babeljs.io/docs/babel-types#scopable
-
A cover of
FunctionParentandBlockParent. - https://babeljs.io/docs/babel-types#functionparent
-
A cover of AST nodes that start an execution context with new
VariableEnvironment. In other words, they define the scope ofvardeclarations.FunctionParentdid not includeProgramsince Babel 7.
-
- https://babeljs.io/docs/babel-types#blockparent
-
A cover of AST nodes that start an execution context with new
LexicalEnvironment. In other words, they define the scope ofletandconstdeclarations.
-
-
- https://babeljs.io/docs/babel-types#scopable
- https://babeljs.io/docs/babel-types#aliases
-
And there was also:
path.scope.bindingspath.scope.getAllBindings()
And the scopes seemed to have uid's and types as well:
binding.scope.uidbinding.scope.path.type
And path.scope.dump() was useful for debugging too:
- https://github.com/0xdevalias/poc-ast-tools/blob/05a34a9e74be6e686a78a49a02b6750ce8796e01/babel_v1_3.js#L146-L180
I can't remember off hand if there was a way to just directly access all of the scopes/bindings from the original parsed AST without even needing to traverse it.
I'd need to dig deeper into it to be certain, but my immediate gut feel here is that the current code here in humanify feels closer to my original explorations in my own code, where I was doing a lot of reinventing the wheel to try and process scope/rename things that babel already handled for me.
Like here is an earlier iteration where I was manually ensuring I only processed binding renames once:
- https://github.com/0xdevalias/poc-ast-tools/blob/05a34a9e74be6e686a78a49a02b6750ce8796e01/babel_v1_1.js#L55-L111
And yet in later versions I switched away from that because there was no need to:
- https://github.com/0xdevalias/poc-ast-tools/blob/05a34a9e74be6e686a78a49a02b6750ce8796e01/babel_v1_3_clean.js#L353-L400
Like if you take all of the debug logging out of that, I think the main implementation ends up being:
babel.traverse(ast, {
Scopable(path) {
Object.entries(path.scope.bindings).forEach(([name, binding]) => {
const newName = generateNewName(path, binding, name);
if (name !== newName) {
path.scope.rename(name, newName);
}
});
},
});
(Then obviously add back in the 'safe rename' checks using path.scope.hasBinding(newName)/similar)
See also:
- https://github.com/jamiebuilds/babel-handbook/blob/master/translations/en/plugin-handbook.md#scopes
- https://github.com/jamiebuilds/babel-handbook/blob/master/translations/en/plugin-handbook.md#toc-scope
- https://github.com/jamiebuilds/babel-handbook/blob/master/translations/en/plugin-handbook.md#generating-a-uid
path.scope.generateUidIdentifierpath.scope.generateUidIdentifierBasedOnNodepath.scope.renamepath.scope.rename("n", "x")-
Alternatively, you can rename a binding to a generated unique identifier
path.scope.rename("n")
- https://github.com/jamiebuilds/babel-handbook/blob/master/translations/en/plugin-handbook.md#generating-a-uid
Some tangentially related notes I wrote on email recently:
it is very slow due to the babel traverse
it takes ~90s on my machine locally with
phi-4-mini-iq3-xxswhen i run it through humanify (which uses@babel/traverse), as opposed to just running it through direct chat locally (prompt: "hey make this readable") takes like 5s. so i know@babel/traverseand thevisitAllIdentifiersfn is slowing things downSo I’m not certain about this off the top of my head; but I suspect that it might not be babel/traverse or the AST parsing stuff that makes things slow; but more that I believe the calls to the LLM are happening within that AST traversal, and chunks of the code are uploaded individually/sequentially; and blocking till that LLM call returns. I believe there’s another issue on the humanify repo (see https://github.com/jehna/humanify/issues/167) that talks a bit about parallelising things.
So I would first see if just stubbing out the actual LLM call part of that gives you the speedups you were hoping to see as a first test. If it’s still too slow even without that part; then my next guess would be that the code within that traverse that tries to decide what/how much context to send to the LLM may not be as efficient as it could be. I wasn’t necessarily thinking about speed when I opened the issue related to cleaning up that bit of the code; but just more that it seemed more complex than my gut feel said it needed to be; and so I found it harder to reason about than I should have. So it wouldn’t surprise me if there might be a performance degradation within that bit.
When you say it takes like 90s with traverse compared to like 5s directly in chat; how big is the code sample you’re providing? Is it like a real world web app, or a tiny snippet? I believe one of the reasons the code is parsed and traversed as it currently is is so that we can control how much context is given to the LLM at any one time; and not confuse/dilute it/overflow its context window/etc.
Another random guess as to something that might potentially impact the speed of things running through the traverse code vs standalone chat is if the humanify code isn’t running the model on the GPU for some reason, but is falling back to CPU/similar.
It’s sort of hard to guess/reason about things beyond that at this stage. Sort of feels like you’d need to try some of the things to narrow stuff down more/get more insight into exactly what part of it is making it slow/etc.
Any thought given to using the TypeScript Compiler API/LSP?
courtesy of ChatGPT
import * as vscode from 'vscode';
import ts from 'typescript';
import * as fs from 'fs';
export function activate(context: vscode.ExtensionContext) {
const disposable = vscode.commands.registerCommand('tsRenameAllIdentifiers.renameAll', async () => {
const editor = vscode.window.activeTextEditor;
if (!editor) {
vscode.window.showErrorMessage('No active editor.');
return;
}
const document = editor.document;
if (document.languageId !== 'typescript' && document.languageId !== 'typescriptreact') {
vscode.window.showErrorMessage('Not a TypeScript file.');
return;
}
const sourceCode = document.getText();
const fileName = document.uri.fsPath;
// Parse the file with TypeScript
const sourceFile = ts.createSourceFile(fileName, sourceCode, ts.ScriptTarget.Latest, true);
const positionsToRename: { name: string, offset: number }[] = [];
const seen = new Set<string>();
// Visit nodes recursively
function visit(node: ts.Node) {
if (ts.isIdentifier(node)) {
const name = node.text;
if (seen.has(name)) return;
seen.add(name);
positionsToRename.push({ name, offset: node.getStart() });
}
ts.forEachChild(node, visit);
}
visit(sourceFile);
const editsToApply: vscode.WorkspaceEdit[] = [];
for (const { name, offset } of positionsToRename) {
const pos = document.positionAt(offset);
const newName = `renamed_${name}`;
try {
const edit = await vscode.commands.executeCommand<vscode.WorkspaceEdit>(
'vscode.executeDocumentRenameProvider',
document.uri,
pos,
newName
);
if (edit) {
editsToApply.push(edit);
}
} catch (err) {
console.warn(`Skipping ${name}:`, err);
}
}
// Apply all collected edits sequentially
for (const edit of editsToApply) {
await vscode.workspace.applyEdit(edit);
}
vscode.window.showInformationMessage(`Renamed ${positionsToRename.length} identifiers.`);
});
context.subscriptions.push(disposable);
}
export function deactivate() {}
Any thought given to using the TypeScript Compiler API/LSP?
@brianjenkins94 For what specifically?
Resurrecting the link from your edit history to try and get more context:
- https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocument_rename
-
Rename Request
-
The rename request is sent from the client to the server to ask the server to compute a workspace change so that the client can perform a workspace-wide rename of a symbol.
-
Babel's AST parsing already provides support for renaming identifiers, which is what humanify is currently using; so I'm not sure what benefits/differences you're thinking would be possible with this?
courtesy of ChatGPT
@brianjenkins94 While I appreciate the intent behind this, based on a quick skim through it, I don't believe it does anything new/different/better than what we're already able to (and doing) directly with Babel's own AST parsing + identifier renaming.
You may find this breakdown / explanation / Babel based example helpful to understand this deeper:
- https://chatgpt.com/share/683d14c2-1f54-8008-a4a9-fec8f6717f7a
The current implementation doesn't rename class or static fields so I thought leveraging the TypeScript API could yield better, more contextually-aware results.
The current implementation doesn't rename class or static fields
@brianjenkins94 Ah, ok. This would seem to be the more relevant issue for that, so your comment on this one was somewhat disconnected from that context:
- https://github.com/jehna/humanify/issues/183
While you may have already seen it, my original exploration into the AST was here:
- https://github.com/jehna/humanify/issues/183#issuecomment-2665020171
And this comment seemed to suggest that Babel's default scope tracking doesn't cover private class fields/methods, but linked to one potential solution using Babel's own babel-helper-create-class-features-plugin's privateNameVisitorFactory:
- https://github.com/jehna/humanify/issues/183#issuecomment-2666094327
I haven't deeply explored that option, nor whether TypeScript's LSP/etc would do better in this case.
But in any case, if that were to be explored deeper, the other issue would be a more appropriate place for that to be done/discussed than here.
Edit: I have copied across the relevant parts of this discussion to that issue:
- https://github.com/jehna/humanify/issues/183#issuecomment-2928555080