range format
https://github.com/astral-sh/ruff/pull/9635
What's your idea about range specification (both API and CLI)?
A straight forward choice is to specify character offsets (in byte) only, but it is not friendly to CLI users.
Here are some examples from other formatters.
Prettier:
| Default | CLI Override | API Override |
|---|---|---|
0 |
--range-start <int> |
rangeStart: <int> |
Infinity |
--range-end <int> |
rangeEnd: <int> |
Editor options:
--cursor-offset <int> Print (to stderr) where a cursor at the given position would move to after formatting.
Defaults to -1.
--range-end <int> Format code ending at a given character offset (exclusive).
The range will extend forwards to the end of the selected statement.
Defaults to Infinity.
--range-start <int> Format code starting at a given character offset.
The range will extend backwards to the start of the first line containing the selected statement.
Defaults to 0.
Clang-Format
--length=<uint> - Format a range of this length (in bytes).
Multiple ranges can be formatted by specifying
several -offset and -length pairs.
When only a single -offset is specified without
-length, clang-format will format up to the end
of the file.
Can only be used with one input file.
--lines=<string> - <start line>:<end line> - format a range of
lines (both 1-based).
Multiple ranges can be formatted by specifying
several -lines arguments.
Can't be used with -offset and -length.
Can only be used with one input file.
--offset=<uint> - Format a range starting at this byte offset.
Multiple ranges can be formatted by specifying
several -offset and -length pairs.
Can only be used with one input file.
--sort-includes - If set, overrides the include sorting behavior
determined by the SortIncludes style flag
Besides, could we offer an option to sort imports?
ruff formatter
Editor options:
--range <RANGE>
When specified, Ruff will try to only format the code in the given range.
It might be necessary to extend the start backwards or the end forwards, to fully enclose a logical line.
The `<RANGE>` uses the format `<start_line>:<start_column>-<end_line>:<end_column>`.
- The line and column numbers are 1 based.
- The column specifies the nth-unicode codepoint on that line.
- The end offset is exclusive.
- The column numbers are optional. You can write `--range=1-2` instead of `--range=1:1-2:1`.
- The end position is optional. You can write `--range=2` to format the entire document starting from the second line.
- The start position is optional. You can write `--range=-3` to format the first three lines of the document.
The option can only be used when formatting a single file. Range formatting of notebooks is unsupported.
Sorry for late reply.
What's your idea about range specification (both API and CLI)?
We may first start with api and don't expose it to cli users. The most important use case for range formatting should be LSP integration.
A straight forward choice is to specify character offsets (in byte) only, but it is not friendly to CLI users.
For cli usaged, we may use similar api from clang-format and ruff. The column and line api is much more friendly. But I really suspect if cli range formating is useful.
Some questions about the implementation.
Take a piece of code from indenta as example:
#let fix-indent(unsafe: false)={
return it=>{
let _is_block(e,fn)=fn==heading or (fn==math.equation and e.block) or (fn==raw and e.has("block") and e.block) or fn==figure or fn==block or fn==list.item or fn==enum.item or fn==table or fn==grid or fn==align or (fn==quote and e.has("block") and e.block)
// TODO: smallcaps returns styled(...)
let _is_inline(e,fn)=fn==text or fn==box or (fn==math.equation and not e.block) or (fn==raw and not (e.has("block") and e.block)) or fn==highlight or fn==overline or fn==smartquote or fn==strike or fn==sub or fn==super or fn==underline or fn==emph or fn==strong or fn==ref or (fn==quote and not (e.has("block") and e.block))
let st=2
for e in it.children{
let fn=e.func()
if fn==heading{
st=2
}else if _is_block(e,fn){
st=1
}else if st==1{
if e==parbreak(){st=2}
else if e!=[ ]{st=0}
}else if st==2 and not (_is_block(e,fn) or e==[ ] or e==parbreak()){
if unsafe or _is_inline(e,fn){context h(par.first-line-indent)}
st=0
}
e
}
}}
We select the line let fn=e.func() (space matters).
1. How do we determine the node satisfying our range?
Like prettier and ruff, the target node has a larger range than given. We find the node with a minimal range that covers the given range.
2. Can space be selected?
In prettier, if you just select a segment of space, nothing happens. But differently in ruff, it will work on a range the space lies in.
In the indenta example, if we do not ignore spaces, we will format the entire for body instead of one line, since the leading and trailing spaces are not counted as a part of that let-binding.
However, we can still handle that statement by specifying a smaller range (described by columns).
3. What is the indentation of formatted code back in the entire source?
Since the formatting can change the level of indentation (e.g., chain access, parentheses), we could not precisely decide that on the given part. Fortunately, the indentation does not matter, unlike Python. But we still need to bring better impressions.
There may be two options:
- follow the indentation of its container. In this case, is the 2 spaces of
for. Tricky to implement. - Conservatively estimate it. Just add an indentation level when entering a content/code block. Easy to implement, but may look bad.
The partially formatted code can be:
for e in it.children{
let fn = e.func()
if fn == heading {
st = 2
} else if _is_block(e, fn) {
st = 1
} else if st == 1 {
if e == parbreak() { st = 2 } else if e != [ ] { st = 0 }
} else if st == 2 and not (
_is_block(e, fn) or e == [ ] or e == parbreak()
) {
if unsafe or _is_inline(e, fn) { context h(par.first-line-indent) }
st = 0
}
e
}
And what's your idea about testing? I think a few (<=5) source files are enough, but for each file, we should give sufficient ranges (how to design?). Of course, various column widths are required.
for 1 and 2, we can simplify the problem in the initial implementation. we can assume that user will only select a full line. this is "format changed lines" in vscode.
for 3. indention is hard and i dont have good idea now. let's say, if the original code use tab=4, range format will never give satisfying result.
for testing. i think we can focus on specific logics in range formating. since general formatting capabilities has been well tested in other test cases
for 1 and 2, we can simplify the problem in the initial implementation. we can assume that user will only select a full line. this is "format changed lines" in vscode.
So, will the leading and trailing space in the selected line matter?
That's tricky. BTW, for VSCode "format changed lines" function, I guess we need do make sure we only change selected range, and nothing outside.(Or do we?)
Range specification and error handling
We may specify a byte range or a row-column range. In a row-column range, we may omit the column. There are some errors:
- Char boundary error: The specified range is not a valid char boundary.
- Source location error: The specified range does not lie in the source. We may just use the valid subrange and silent the error.
- Node error: We fail to find a node covering the given range. (This case may not exist)
- Syntax error: We'd better give a
Resultinstead of a clone of the original string to indicate the problem.
When there is a syntax error, it could be meaningless to return the original text as the format result. For formatting in place in CLI, we can just do nothing with the file in this case, instead of rewrite with nothing changed. For lib users and tinymist integration, it is better to make users aware of the existence of errors. Just like prettier, whose status bar turns red in VS Code when encountering errors.
In partial formatting, we only need to consider whether the selected subtree is error-free.
Make sense to me. We might want to change our public API. Switching String to Result<String> 🤔
It is sensible to describe the range by unicode char index (from 1) instead of byte index, which is consistent with VS Code and other formatters mentioned before. In this way, we need not to worry about char boundary. Also, there should be no extra error kind for partial formatting.