proc-macro2
proc-macro2 copied to clipboard
Consider recalibrating how bits are divided in Span
Currently fallback spans store a pair of 32-bit low and high character indices.
https://github.com/dtolnay/proc-macro2/blob/fecb02df0e2966c7eb39e020dbf650fa8bafd0c5/src/fallback.rs#L491-L496
A span in which lo > hi is malformed, so right off the bat, approximately half of possible Span bit patterns are wasted.
Separately, tokens are usually small compared to the total amount of input parsed by a thread. If we switch to storing lo and hi - lo instead of lo and hi, then an even split of 32 bits each may not be the wisest allocation. For example, we could decide to give 36 bits to lo (supporting 64 GB input size) and 28 bits to hi - lo (limiting token size to 256 MB). Or some other uneven split.
Rustc enforces a file size limit of 4 GB, so a token cannot be bigger than that.
use std::fs::File;
use std::io::Write as _;
fn main() {
let buf = vec![b' '; 1024 * 1024];
let mut file = File::create("spanoverflow.rs").unwrap();
file.write_all(b"fn main() {\n").unwrap();
for _ in 0..4100 {
file.write_all(&buf).unwrap();
}
file.write_all(b"}\n").unwrap();
}
$ ls -lh spanoverflow.rs
-rw-r--r-- 1 dtolnay users 4.1G Oct 9 18:41 spanoverflow.rs
$ rustc spanoverflow.rs
fatal error: rustc does not support files larger than 4GB
There appears to be no limit on the total amount of text parsed by rustc, even though its internal representation for BytePos is 32 bits.
https://github.com/rust-lang/rust/blob/1.73.0/compiler/rustc_span/src/lib.rs#L2010-L2014
If you parse more than 232 bytes, it overflows and you get bogus spans referring to the wrong files.
use std::fs::File;
use std::io::Write as _;
fn main() {
let buf = vec![b' '; 1024 * 1024];
let mut file = File::create("spanoverflow.rs").unwrap();
file.write_all(b"mod module;\n").unwrap();
for _ in 0..2050 {
file.write_all(&buf).unwrap();
}
file.write_all(b"fn main() {}\n").unwrap();
let mut file = File::create("module.rs").unwrap();
for _ in 0..2050 {
file.write_all(&buf).unwrap();
}
file.write_all(b"pub fn f() {}\n").unwrap();
}
According to rustc -Zunpretty=ast-tree,expanded spanoverflow.rs, this is the location of the f function (wrong):
Item {
attrs: [],
id: NodeId(10),
span: spanoverflow.rs:2:4194319: 2:4194332 (#0),
ident: f#0,
kind: Fn(
and this is the location of main (wrong):
Item {
attrs: [],
id: NodeId(12),
span: /home/dtolnay/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/string.rs:2952:2144473580: 2952:2144473592 (#0),
ident: main#0,
kind: Fn(
The correct locations would be module.rs and spanoverflow.rs respectively, which you get if the files do not overflow 232 bytes total size.
For scale, currently there is 200 GB of Rust code published on crates.io. Looking at just the newest version of every crate, it is 16 GB of code. So a workload that involves parsing this, even on multiple threads, would currently hit overflow.