Buffered reader is default for jsonl
Describe the bug When using the jsonl source larger files might exceed the standard bufio limit
% % sq add 131467comments.json
sq: bufio.Scanner: token too long
To Reproduce Read in a large file
Expected behavior Ideally we should cope with larger lines
sq version
Paste the output of sq version --yaml into the code block below:
# $ sq version --yaml
# REPLACE THIS WITH YOUR OUTPUT
version: v0.48.3
commit: 3ae4104
timestamp: 2024-03-12T02:55:43Z
latest_version: v0.48.3
host:
platform: darwin
arch: amd64
kernel: Darwin
kernel_version: 23.3.0
variant: macOS
variant_version: 14.3.1
Source details
Source didnt load so had no further detail
Logs
ESC[2m21:46:07.829955ESC[0m ESC[33mINFESC[0m ESC[34mfiles.(*Files).doCacheSweepESC[0m ESC[2mfiles/cache.go:448ESC[0m ESC[33mSwept cache dirESC[0m ESC[2mpid=ESC[0mESC[36;2m60415ESC[0m ESC[2mdir=ESC[0mESC[36;2m/Users/mc/Library/Caches/sq/60f4d1f1ESC[0m ESC[2mdeleted_dirs=ESC[0mESC[36;2m0EESC[0m
ESC[2m21:46:07.830702ESC[0m ESC[1;91mERRESC[0m ESC[34mcli.PrintErrorESC[0m ESC[2mcli/error.go:55ESC[0m ESC[1;91mEXECUTION FAILEDESC[0m ESC[2mpid=ESC[0mESC[36;2m60415ESC[0m ESC[2merr.msg=ESC[0mESC[36;2m"bufio.Scanner: token too long"ESC[0m ESC[2merr.type=ESC[0mESC[36;2m*errz.errzESC[0m EESC[2merr.cause=ESC[0mESC[36;2m*errors.errorStringESC[0m ESC[2mcmd=ESC[0mESC[36;2maddESC[0m
ESC[92;2m*errz.errz: *errz.errz: *errors.errorStringESC[0m
ESC[1;33mbufio.Scanner: token too longESC[0m
ESC[0;35mgithub.com/neilotoole/sq/libsq/files.(*Files).detectTypeESC[0m
ESC[0;35m /Users/runner/work/sq/sq/libsq/files/detect.go:180ESC[0m
ESC[0;35mgithub.com/neilotoole/sq/libsq/files.(*Files).DetectTypeESC[0m
ESC[0;35m /Users/runner/work/sq/sq/libsq/files/detect.go:73ESC[0m
ESC[0;35mgithub.com/neilotoole/sq/cli.execSrcAddESC[0m
ESC[0;35m /Users/runner/work/sq/sq/cli/cmd_add.go:217ESC[0m
ESC[0;35mgithub.com/neilotoole/sq/cli.addCmd.func2ESC[0m
ESC[0;35m /Users/runner/work/sq/sq/cli/cli.go:355ESC[0m
ESC[0;35mgithub.com/spf13/cobra.(*Command).executeESC[0m
ESC[0;35m /Users/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:983ESC[0m
ESC[0;35mgithub.com/spf13/cobra.(*Command).ExecuteCESC[0m
ESC[0;35m /Users/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1115ESC[0m
ESC[0;35mgithub.com/spf13/cobra.(*Command).ExecuteESC[0m
ESC[0;35m /Users/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1039ESC[0m
ESC[0;35mgithub.com/spf13/cobra.(*Command).ExecuteContextESC[0m
ESC[0;35m /Users/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1032ESC[0m
ESC[0;35mgithub.com/neilotoole/sq/cli.ExecuteWithESC[0m
ESC[0;35m /Users/runner/work/sq/sq/cli/cli.go:217ESC[0m
ESC[0;35mgithub.com/neilotoole/sq/cli.ExecuteESC[0m
ESC[0;35m /Users/runner/work/sq/sq/cli/cli.go:76ESC[0m
ESC[0;35mmain.mainESC[0m
ESC[0;35m /Users/runner/work/sq/sq/main.go:56ESC[0m
ESC[0;35mruntime.mainESC[0m
ESC[0;35m /Users/runner/go/pkg/mod/golang.org/[email protected]/src/runtime/proc.go:271ESC[0m
ESC[0;35mruntime.goexitESC[0m
ESC[0;35m /Users/runner/go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1695ESC[0m
ESC[92;2m*errz.errz: *errors.errorStringESC[0m
ESC[1;33mbufio.Scanner: token too longESC[0m
ESC[0;35mgithub.com/neilotoole/sq/cli.FinishRunInit.DetectJSONL.func5ESC[0m
ESC[0;35m /Users/runner/work/sq/sq/drivers/json/ingest_jsonl.go:79ESC[0m
ESC[0;35mgithub.com/neilotoole/sq/libsq/files.(*Files).detectType.func3ESC[0m
ESC[0;35m /Users/runner/work/sq/sq/libsq/files/detect.go:161ESC[0m
ESC[0;35mgolang.org/x/sync/errgroup.(*Group).Go.func1ESC[0m
ESC[0;35m /Users/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78ESC[0m
ESC[0;35mruntime.goexitESC[0m
ESC[0;35m /Users/runner/go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1695ESC[0m
Screenshots n/a
Additional context
When using a bufio.NewScanner the default is 64k
https://pkg.go.dev/bufio#Scanner.Buffer
I have used this example in code before with success
https://stackoverflow.com/questions/39859222/golang-how-to-overcome-scan-buffer-limit-from-bufio/66876416#66876416
However in the this part of the code the file isnt present so we cant really get the info from file.Stat()
https://github.com/neilotoole/sq/blob/master/drivers/json/ingest_jsonl.go#L38
@markcurtis1970 When I wrote that ingest code using bufio.Scanner, in my soul I knew I'd have to answer to that 64K limit some day...
If you have a specific (cleansed) file you can share that breaks the ingester, it could be added as a test case.
I don't think this is an insurmountable problem. The Scanner is basically only being used for line breaks.
Note, in general: that jsonl ingester code is pretty awful (and the other JSON ingesters too). There's been very little attempt at any sort of optimization (or even doing it right). The whole thing should probably be thrown away and rewritten.
I'm very open to a PR, and contributors in general, not just for this part of the codebase, but for the entire project.
...in my soul I knew I'd have to answer to that 64K limit some day...
Haha yeah I know that feeling! (im gonna have to circle back to this)
Sure, thanks - I was going to try a PR but there wasnt a file object to get the stat from and my "go-fu" is intermediate at best ;-) otherwise I would love to help!
I'll see if I can sanitise that file in the meantime!
maybe this solution will solve your problem click
@markcurtis1970: The latest sq version v0.48.5 addresses this issue.
-
The default buffer max size is significantly larger.
-
There's a new config option,
tuning.scan-buffer-limit, that allows the user to increase the buffer limit even higher. E.g.$ sq config set tuning.scan-buffer-limit 64MB # or 1024B, 64KB, 1GB, etc.
Can you give the new version a try, and re-open this issue if there's still trouble?
Thanks for your patience.
Ok I tested this with the original file that gave the issue
% sq version
sq v0.48.5
% sq config set tuning.scan-buffer-limit 1B
% sq add ./131467comments.json
sq: detect: jsonl: bufio.Scanner: token too long: maybe adjust config 'tuning.scan-buffer-limit'
% sq config set tuning.scan-buffer-limit 64MB
% sq add ./131467comments.json
@h131467comments jsonl 131467comments.json
%
Looks good to me thanks!
Excellent, thank you @markcurtis1970 🙏