bat icon indicating copy to clipboard operation
bat copied to clipboard

Poor performance with large files

Open pamburus opened this issue 10 months ago • 4 comments

I am experiencing extremely poor performance with large assembly source files. It takes 4 minutes for bat to process a 45-MiB assembly source file.

⇛ time bat target/release/deps/bench-f36188efc4eb690a.s --color always >/dev/null
bat target/release/deps/bench-f36188efc4eb690a.s --color always > /dev/null  237.31s user 10.16s system 102% cpu 4:00.49 total

⇛ lsd -lah target/release/deps/bench-f36188efc4eb690a.s
.rw-r--r-- pamburus staff 45 MB Sat Feb  8 18:30:04 2025 target/release/deps/bench-f36188efc4eb690a.s

⇛ sysctl machdep.cpu
machdep.cpu.cores_per_package: 10
machdep.cpu.core_count: 10
machdep.cpu.logical_per_package: 10
machdep.cpu.thread_count: 10
machdep.cpu.brand_string: Apple M1 Max

The average processing speed seems to be about 190 KiB per second.

pamburus avatar Feb 08 '25 17:02 pamburus

Related: https://github.com/sharkdp/bat/issues/304#issuecomment-420416969

keith-hall avatar Feb 09 '25 07:02 keith-hall

Yes, this issue seems to be related, but the reasoning about performance bottlenecks in #304 does not seem to match reality.

Here is the reality: Image

About 92% of the time is spent in syntect::easy::HighlightLines::highlight_line, which in turn spends about 97% of that time in parsing.

Only about 3% of the time is spent in write_fmt, so buffering the output at this point will make any visible difference.

It looks like the performance problem needs to be redirected to syntect. It will probably not be easy to fix.

pamburus avatar Feb 09 '25 11:02 pamburus

As an experiment, I tried replacing the output writer with std::io::sink. Surprisingly, this reduced the total processing time by 13% instead of 3%, but the bottleneck remains the same.

pamburus avatar Feb 09 '25 12:02 pamburus

I'm dealing with some 30MB or so log files and see similar though not as bad performance (prob 10s). I do recall when I had these files being rotated at 10MB that it was a lot more tolerable (maybe 1 or 2s). So there seems to be some kind of n^2 scaling coefficient here.

My current workaround is to just not use bat (i dont really need it, it was just being used as a catch all pager) when the file is too big and I set the cutoff to 2MB. e.g., lf config:

cmd open ${{
    set -f
    exec 2>> ~/.lf_open.debug  # keep the trace
    set -x                          # remove when done

    mime=$(file --mime-type -Lb "$f")        # e.g. text/plain, application/octet-stream
    size=$(stat -c%s "$f" 2>/dev/null || stat -f%z "$f")

    view_file() {
        if [ "${size:-0}" -gt 2097152 ]; then
            less -R "$1"
        else
            bat --paging=always "$1"
        fi
    }

    view_stream() {
        if [ "${size:-0}" -gt 2097152 ]; then
            less -R
        else
            bat --paging=always --file-name "$1"
        fi
    }

    case "$mime" in
        application/gzip)
            gzip -dc "$f" | view_stream "$f" ;;
        application/brotli|application/x-brotli)
            brotli -dc "$f" | view_stream "$f" ;;
        application/json)
            view_file "$f" ;;
        text/*)
            view_file "$f" ;;
        *) # Default case for binary or other non-text files
            case "$f" in
                *.br)
                    brotli -dc "$f" | view_stream "$f" ;;
                *)
                    hexyl "$f" | less -R ;;
            esac ;;
    esac
}}

Now the point I want to emphasize is this:

bat does appear to give up on syntax highlighting after the first few pages. presumably this is in service of performance, but then the ball is totally dropped on performance because it shoots off to Neptune starting at around the 35MB mark and we see above that the work it's doing is highlighting related.

unphased avatar Jul 10 '25 11:07 unphased