roc
roc copied to clipboard
CSV Decoding in pure Roc
This PR adds a new example that demonstrates parsing in pure Roc.
The parser is implemented as a parser combinator library, split into three modules:
Parser.Corefor the main primitive building blocks.Parser.Strfor all parsers dealing withStrandList U8as input.Parser.Csvfor all parsers dealing with CSV as input and parsers parsing CSV fromStr.
There still is a lot which could be improved, both in the implementation itself as well as in code quality and documentation, but my hope is that the parsing logic is at least finished enough to be used in @rtfeldman 's upcoming presentation.
This lives in examples/csv for now, together with a main.roc which is currently based on the hello-world platform but this should be swapped out for one of the interactive platforms so it can read from e.g. stdin.
main.roc contains a parser which parsers the "movie info" data which Richard is intending to use in his presentation.
The CSV parser follows RFC 4180, almost to the letter. Currently with two exceptions:
- CSV headers (and extracting fields based on the header names) are not currently supported.
- It is allowed to use plain
\nas record separator instead of\r\n(the RFC requirement).
@Qqwy Hi!
I'd like to write a parser for plsql in Roc. I have seen this code and it looks solid starting point.
Before I dwelve into it, I thought to ping you in what actual state is this Roc code?
I do need to use your host.c?
Thanks! :)
@Qqwy This PR looks fun, thank you! I look forward to forking this example for a few data analysis projects I'm working on 😃
@rtfeldman did you use this in one of your presentations?
@popara No, you don't need the special host.c. You can just integrate this Roc code in your project as-is. :blush:
@Qqwy looks like all this needs is a test, right? Anything else blocking it? I'd love to have a parser in the repo. Happy to help out.
@Qqwy looks like all this needs is a test, right? Anything else blocking it? I'd love to have a parser in the repo. Happy to help out.
Yes, I'm pretty sure that tests are all that is missing. The reason they were not added before was because before the new changes to nested lambda set compilation, compilation of the example was too slow to continue working on it.
Help is very welcome! My time to work on Roc is currently rather limited I'm afraid.
Roc have come long way since then!
Sent from my iPhone
On 24 Sep 2022, at 15:34, Qqwy @.***> wrote:
@Qqwy looks like all this needs is a test, right? Anything else blocking it? I'd love to have a parser in the repo. Happy to help out.
Yes, I'm pretty sure that tests are all that is missing. The reason they were not added before was because before the new changes to nested lambda set compilation, compilation of the example was too slow to continue working on it.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.
Ok I can make a test, and maybe rename the example directory to "parsers" so we can reuse the library and platform for other formats.
This sounds like a very good plan! :blush:
Anyone got any ideas why this won't pass on CI? It passes locally.
OK I got it to fail locally with cargo test --release -p roc_cli csv. It crashes in the surgical linker, because memmove isn't defined.
But why isn't it defined? Other libc stuff is linked. And how does building the Roc compiler in release mode affect this?
@folkertdev @bhansconnect
brian@acer:roc$ RUST_BACKTRACE=1 cargo test --release -p roc_cli csv
Finished release [optimized] target(s) in 0.42s
Running unittests src/lib.rs (target/release/deps/roc_cli-bfc8018b40a1baba)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Running tests/cli_run.rs (target/release/deps/cli_run-e5c952eb2fcb209b)
running 1 test
test cli_run::parse_movies_csv ... FAILED
failures:
---- cli_run::parse_movies_csv stdout ----
thread 'cli_run::parse_movies_csv' panicked at '`roc` command had unexpected stderr: An internal compiler expectation was broken.
This is definitely a compiler bug.
Please file an issue here: https://github.com/roc-lang/roc/issues/new/choose
thread 'main' panicked at 'Undefined Symbol in relocation, (+9ac3, Relocation { kind: PltRelative, encoding: Generic, size: +20, target: Symbol(SymbolIndex(+5b)), addend: +fffffffffffffffc, implicit_addend: false }): Ok(Symbol { name: "memmove", address: +0, size: +0, kind: Unknown, section: Undefined, scope: Unknown, weak: false, flags: Elf { st_info: +10, st_other: +0 } })', crates/linker/src/elf.rs:1279:33
stack backtrace:
0: rust_begin_unwind
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:143:14
2: roc_linker::elf::surgery_elf_help
3: roc_linker::link_preprocessed_host
4: roc_cli::build::build_file
5: roc_cli::build
6: roc::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
', crates/cli/tests/cli_run.rs:131:13
stack backtrace:
0: rust_begin_unwind
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:143:14
2: cli_run::cli_run::run_roc_on
3: cli_run::cli_run::check_output_with_stdin
4: core::ops::function::FnOnce::call_once
5: core::ops::function::FnOnce::call_once
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
failures:
cli_run::parse_movies_csv
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 41 filtered out; finished in 4.00s
error: test failed, to rerun pass '-p roc_cli --test cli_run'
my guess is that memmove needs to be added here
fn collect_roc_definitions<'a>(object: &object::File<'a, &'a [u8]>) -> MutMap<String, u64> {
let mut vaddresses = MutMap::default();
for sym in object.symbols().filter(is_roc_definition) {
// remove potentially trailing "@version".
let name = sym
.name()
.unwrap()
.trim_start_matches('_')
.split('@')
.next()
.unwrap();
let address = sym.address() as u64;
// special exceptions for memcpy and memset.
if name == "roc_memcpy" {
vaddresses.insert("memcpy".to_string(), address);
} else if name == "roc_memset" {
vaddresses.insert("memset".to_string(), address);
}
vaddresses.insert(name.to_string(), address);
}
vaddresses
}
in the linker code. Optimization is relevant because "normal" code can be turned into a memmove by the optimizer.
It's not that simple though because we don't have roc_memset (should we?). But this is where we do some suspicious stuff with libc functions in the linker, so it's the first place I'd look.
I can look at this in more detail tomorrow.
Optimization is relevant because "normal" code can be turned into a memmove by the optimizer.
Hmm. Maybe I'm missing something. I can certainly see how optimization of the Roc app would be relevant. But my understanding is that cargo test --release gets us a release build of the Roc compiler. The compiler should then be functionally equivalent in debug or release, so it should apply exactly the same optimizations to the app as long as we give it the same command line arguments... But maybe there's an extra layer to this that I'm not thinking of?
Yeah I was also wondering if the roc_memcpy stuff was related but it doesn't seem to be called. But I can try defining it anyway and see what happens.
If this doesn't work we can maybe copy another working host from somewhere else and modify it.
Actually roc_memcpy and roc_memset are both defined in the host already and we don't have a roc_memmove.
when running tests with --release, our LLVM backend enables more llvm optimizations (because they are slower, we don't want to use full optimization during development). CI always runs with --release
I wonder if we can/should get llvm to stop introducing memmov, memcpy, memset, etc. Some llvm optimization pass must remove loops in favor of calls to those functions.
Otherwise, yeah, i guess every platform needs a roc_memmov cause llvm might add that call.
I guess every platform needs a roc_memmov cause llvm might add that call.
That sounds like a simple enough solution, I'll give it a go.
Well, damn. That didn't work.
I'm pretty sure it worked. This now looks to be a different issue. Linking now passes, but the example gave a bad exit code in CI.
yes
> cargo run -- --optimize examples/parser/parse-movies-csv.roc
Finished dev [unoptimized + debuginfo] target(s) in 0.23s
Running `target/debug/roc --optimize examples/parser/parse-movies-csv.roc`
🔨 Rebuilding platform...
fish: “cargo run -- --optimize example…” terminated by signal SIGSEGV (Address boundary error)
very confusing
==335201== Memcheck, a memory error detector
==335201== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==335201== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==335201== Command: ./parse-movies-csv
==335201==
==335201==
==335201== Process terminating with default action of signal 11 (SIGSEGV)
==335201== Bad permissions for mapped region at address 0x12000D
==335201== at 0x12000D: ??? (in /home/folkertdev/roc/roc/examples/parser/parse-movies-csv)
==335201== by 0x4BF12E7: ???
==335201== by 0x11FFBF: ??? (in /home/folkertdev/roc/roc/examples/parser/parse-movies-csv)
==335201==
==335201== HEAP SUMMARY:
==335201== in use at exit: 0 bytes in 0 blocks
==335201== total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==335201==
==335201== All heap blocks were freed -- no leaks are possible
==335201==
==335201== For lists of detected and suppressed errors, rerun with: -s
==335201== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
fish: “valgrind --leak-check=full --tr…” terminated by signal SIGSEGV (Address boundary error)
might still be that we mess something up in the linker?
Yeah, really weird behavior in my local testing. The host is running the success path and returning 0 from main, but somehow our runner is getting a nonzero exit code. The code is not even consistent. That being said the running is checking the output and ensuring it ends with "Parse success!\n". I think it is somehow related to how the host is writing to stdout.
this is what lldb says (it's not much)
> lldb parse-movies-csv
(lldb) target create "parse-movies-csv"
Current executable set to '/home/folkertdev/roc/roc/examples/parser/parse-movies-csv' (x86_64).
(lldb) run
Process 336055 launched: '/home/folkertdev/roc/roc/examples/parser/parse-movies-csv' (x86_64)
Process 336055 stopped
* thread #1, name = 'parse-movies-cs', stop reason = signal SIGSEGV: address access protected (fault address: 0x55555556c000)
frame #0: 0x000055555556c000 parse-movies-csv`__libc_csu_init + 64
parse-movies-csv`__libc_csu_init:
-> 0x55555556c000 <+64>: movq %r14, %rdx
0x55555556c003 <+67>: movq %r13, %rsi
0x55555556c006 <+70>: movl %r12d, %edi
0x55555556c009 <+73>: callq *(%r15,%rbx,8)
(lldb) bt
* thread #1, name = 'parse-movies-cs', stop reason = signal SIGSEGV: address access protected (fault address: 0x55555556c000)
* frame #0: 0x000055555556c000 parse-movies-csv`__libc_csu_init + 64
frame #1: 0x00007ffff7c4c010 libc.so.6`__libc_start_main(main=(parse-movies-csv`main), argc=1, argv=0x00007fffffffe1c8, init=(parse-movies-csv`__libc_csu_init), fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffe1b8) at libc-start.c:264:6
frame #2: 0x000055555555825e parse-movies-csv`_start + 46
This fixed it locally for me. We'll see if it passes CI.
M1 CI failure is related to zig not being able to compile builtins due to the path being too long?
thread 'main' panicked at 'zig failed: error: unable to create compilation: NameTooLong
Cool, that worked! Though I have no idea what is wrong with the M1 build. @Anton-4 is this M1 problem a known issue in CI?
No, don't think I've ever seen that one before.
Is it working locally on M1 for you @bhansconnect?
this also still fails for me on linux. I am on commit 30dbad88ae81d3009cc2a6f2c885f49fe3bf3326, and
> cargo run -- --optimize examples/parser/parse-movies-csv.roc
Finished dev [unoptimized + debuginfo] target(s) in 0.21s
Running `target/debug/roc --optimize examples/parser/parse-movies-csv.roc`
🔨 Rebuilding platform...
fish: “cargo run -- --optimize example…” terminated by signal SIGSEGV (Address boundary error)
> cargo run -- --optimize --linker=legacy examples/parser/parse-movies-csv.roc
Finished dev [unoptimized + debuginfo] target(s) in 0.22s
Running `target/debug/roc --optimize --linker=legacy examples/parser/parse-movies-csv.roc`
🔨 Rebuilding platform...
2 movies were found:
The movie 'Airplane!' was released in 1980 and stars Robert Hays and Julie Hagerty
The movie 'Caddyshack' was released in 1980 and stars Chevy Chase, Rodney Dangerfield, Ted Knight, Michael O'Keefe and Bill Murray
Parse success!
fun thing: I ran zig translate-c host.c > host.zig and then everything works fine. Only difference between the host binaries seems that the C host executable has two RELRO program headers
RELRO off 0x0000000000047840 vaddr 0x0000000000049840 paddr 0x0000000000049840 align 2**0
filesz 0x0000000000002a40 memsz 0x0000000000003840 flags r--
STACK off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**0
filesz 0x0000000000000000 memsz 0x0000000001000000 flags rw-
NOTE off 0x000000000000030c vaddr 0x000000000000030c paddr 0x000000000000030c align 2**2
filesz 0x0000000000000020 memsz 0x0000000000000020 flags r--
RELRO off 0x0000000000037568 vaddr 0x0000000000038568 paddr 0x0000000000038568 align 2**0
filesz 0x0000000000002b18 memsz 0x0000000000002b18 flags r--
NULL off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**0
filesz 0x0000000000000000 memsz 0x0000000000000000 flags ---
and zig has just one
RELRO off 0x0000000000047810 vaddr 0x0000000000049810 paddr 0x0000000000049810 align 2**0
filesz 0x0000000000002a40 memsz 0x0000000000003870 flags r--
STACK off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**0
filesz 0x0000000000000000 memsz 0x0000000001000000 flags rw-
NOTE off 0x000000000000030c vaddr 0x000000000000030c paddr 0x000000000000030c align 2**2
filesz 0x0000000000000020 memsz 0x0000000000000020 flags r--
NULL off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**0
filesz 0x0000000000000000 memsz 0x0000000000000000 flags ---
NULL off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**0
filesz 0x0000000000000000 memsz 0x0000000000000000 flags ---
I think that means that the surgical linker will write over that second relro section, because it assumes that the last two are free to write to
I also found that even with the legacy linker, there is a large number of invalid reads. Probably refcounting issues. So those need to be fixed regardless and might actually be the real cause of the problems we're seeing
This passes CI now. Turns out fixing the valgrind issues was enough. There were two silly bugs, and the host did not free a roc string.
How should we proceed? I'd be happy to merge as-is, but I'm not sure what the state of the roc code is