proc-macro2 icon indicating copy to clipboard operation
proc-macro2 copied to clipboard

Consider recalibrating how bits are divided in Span

Open dtolnay opened this issue 2 years ago • 3 comments

Currently fallback spans store a pair of 32-bit low and high character indices.

https://github.com/dtolnay/proc-macro2/blob/fecb02df0e2966c7eb39e020dbf650fa8bafd0c5/src/fallback.rs#L491-L496

A span in which lo > hi is malformed, so right off the bat, approximately half of possible Span bit patterns are wasted.

Separately, tokens are usually small compared to the total amount of input parsed by a thread. If we switch to storing lo and hi - lo instead of lo and hi, then an even split of 32 bits each may not be the wisest allocation. For example, we could decide to give 36 bits to lo (supporting 64 GB input size) and 28 bits to hi - lo (limiting token size to 256 MB). Or some other uneven split.

dtolnay avatar Oct 10 '23 01:10 dtolnay

Rustc enforces a file size limit of 4 GB, so a token cannot be bigger than that.

use std::fs::File;
use std::io::Write as _;

fn main() {
    let buf = vec![b' '; 1024 * 1024];
    let mut file = File::create("spanoverflow.rs").unwrap();
    file.write_all(b"fn main() {\n").unwrap();
    for _ in 0..4100 {
        file.write_all(&buf).unwrap();
    }
    file.write_all(b"}\n").unwrap();
}
$ ls -lh spanoverflow.rs
-rw-r--r-- 1 dtolnay users 4.1G Oct  9 18:41 spanoverflow.rs

$ rustc spanoverflow.rs
fatal error: rustc does not support files larger than 4GB

dtolnay avatar Oct 10 '23 01:10 dtolnay

There appears to be no limit on the total amount of text parsed by rustc, even though its internal representation for BytePos is 32 bits.

https://github.com/rust-lang/rust/blob/1.73.0/compiler/rustc_span/src/lib.rs#L2010-L2014

If you parse more than 232 bytes, it overflows and you get bogus spans referring to the wrong files.

use std::fs::File;
use std::io::Write as _;

fn main() {
    let buf = vec![b' '; 1024 * 1024];

    let mut file = File::create("spanoverflow.rs").unwrap();
    file.write_all(b"mod module;\n").unwrap();
    for _ in 0..2050 {
        file.write_all(&buf).unwrap();
    }
    file.write_all(b"fn main() {}\n").unwrap();

    let mut file = File::create("module.rs").unwrap();
    for _ in 0..2050 {
        file.write_all(&buf).unwrap();
    }
    file.write_all(b"pub fn f() {}\n").unwrap();
}

According to rustc -Zunpretty=ast-tree,expanded spanoverflow.rs, this is the location of the f function (wrong):

                        Item {
                            attrs: [],
                            id: NodeId(10),
                            span: spanoverflow.rs:2:4194319: 2:4194332 (#0),
                            ident: f#0,
                            kind: Fn(

and this is the location of main (wrong):

        Item {
            attrs: [],
            id: NodeId(12),
            span: /home/dtolnay/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/string.rs:2952:2144473580: 2952:2144473592 (#0),
            ident: main#0,
            kind: Fn(

The correct locations would be module.rs and spanoverflow.rs respectively, which you get if the files do not overflow 232 bytes total size.

dtolnay avatar Oct 10 '23 01:10 dtolnay

For scale, currently there is 200 GB of Rust code published on crates.io. Looking at just the newest version of every crate, it is 16 GB of code. So a workload that involves parsing this, even on multiple threads, would currently hit overflow.

dtolnay avatar Oct 10 '23 02:10 dtolnay