roc icon indicating copy to clipboard operation
roc copied to clipboard

CSV Decoding in pure Roc

Open Qqwy opened this issue 3 years ago • 1 comments

This PR adds a new example that demonstrates parsing in pure Roc.

The parser is implemented as a parser combinator library, split into three modules:

  • Parser.Core for the main primitive building blocks.
  • Parser.Str for all parsers dealing with Str and List U8 as input.
  • Parser.Csv for all parsers dealing with CSV as input and parsers parsing CSV from Str.

There still is a lot which could be improved, both in the implementation itself as well as in code quality and documentation, but my hope is that the parsing logic is at least finished enough to be used in @rtfeldman 's upcoming presentation.

This lives in examples/csv for now, together with a main.roc which is currently based on the hello-world platform but this should be swapped out for one of the interactive platforms so it can read from e.g. stdin. main.roc contains a parser which parsers the "movie info" data which Richard is intending to use in his presentation.

The CSV parser follows RFC 4180, almost to the letter. Currently with two exceptions:

  • CSV headers (and extracting fields based on the header names) are not currently supported.
  • It is allowed to use plain \n as record separator instead of \r\n (the RFC requirement).

Qqwy avatar Jul 16 '22 21:07 Qqwy

@Qqwy Hi!

I'd like to write a parser for plsql in Roc. I have seen this code and it looks solid starting point.

Before I dwelve into it, I thought to ping you in what actual state is this Roc code? I do need to use your host.c?

Thanks! :)

popara avatar Jul 28 '22 16:07 popara

@Qqwy This PR looks fun, thank you! I look forward to forking this example for a few data analysis projects I'm working on 😃

kili-ilo avatar Sep 05 '22 05:09 kili-ilo

@rtfeldman did you use this in one of your presentations?

kili-ilo avatar Sep 05 '22 06:09 kili-ilo

@popara No, you don't need the special host.c. You can just integrate this Roc code in your project as-is. :blush:

Qqwy avatar Sep 05 '22 09:09 Qqwy

@Qqwy looks like all this needs is a test, right? Anything else blocking it? I'd love to have a parser in the repo. Happy to help out.

brian-carroll avatar Sep 24 '22 12:09 brian-carroll

@Qqwy looks like all this needs is a test, right? Anything else blocking it? I'd love to have a parser in the repo. Happy to help out.

Yes, I'm pretty sure that tests are all that is missing. The reason they were not added before was because before the new changes to nested lambda set compilation, compilation of the example was too slow to continue working on it.

Help is very welcome! My time to work on Roc is currently rather limited I'm afraid.

Qqwy avatar Sep 24 '22 14:09 Qqwy

Roc have come long way since then!

Sent from my iPhone

On 24 Sep 2022, at 15:34, Qqwy @.***> wrote:

 @Qqwy looks like all this needs is a test, right? Anything else blocking it? I'd love to have a parser in the repo. Happy to help out.

Yes, I'm pretty sure that tests are all that is missing. The reason they were not added before was because before the new changes to nested lambda set compilation, compilation of the example was too slow to continue working on it.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

popara avatar Sep 24 '22 19:09 popara

Ok I can make a test, and maybe rename the example directory to "parsers" so we can reuse the library and platform for other formats.

brian-carroll avatar Sep 25 '22 13:09 brian-carroll

This sounds like a very good plan! :blush:

Qqwy avatar Sep 25 '22 17:09 Qqwy

Anyone got any ideas why this won't pass on CI? It passes locally.

brian-carroll avatar Sep 26 '22 20:09 brian-carroll

OK I got it to fail locally with cargo test --release -p roc_cli csv. It crashes in the surgical linker, because memmove isn't defined. But why isn't it defined? Other libc stuff is linked. And how does building the Roc compiler in release mode affect this?

@folkertdev @bhansconnect

brian@acer:roc$ RUST_BACKTRACE=1 cargo test --release -p roc_cli csv
    Finished release [optimized] target(s) in 0.42s
     Running unittests src/lib.rs (target/release/deps/roc_cli-bfc8018b40a1baba)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running tests/cli_run.rs (target/release/deps/cli_run-e5c952eb2fcb209b)

running 1 test
test cli_run::parse_movies_csv ... FAILED

failures:

---- cli_run::parse_movies_csv stdout ----
thread 'cli_run::parse_movies_csv' panicked at '`roc` command had unexpected stderr: An internal compiler expectation was broken.
This is definitely a compiler bug.
Please file an issue here: https://github.com/roc-lang/roc/issues/new/choose
thread 'main' panicked at 'Undefined Symbol in relocation, (+9ac3, Relocation { kind: PltRelative, encoding: Generic, size: +20, target: Symbol(SymbolIndex(+5b)), addend: +fffffffffffffffc, implicit_addend: false }): Ok(Symbol { name: "memmove", address: +0, size: +0, kind: Unknown, section: Undefined, scope: Unknown, weak: false, flags: Elf { st_info: +10, st_other: +0 } })', crates/linker/src/elf.rs:1279:33
stack backtrace:
   0: rust_begin_unwind
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:143:14
   2: roc_linker::elf::surgery_elf_help
   3: roc_linker::link_preprocessed_host
   4: roc_cli::build::build_file
   5: roc_cli::build
   6: roc::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
', crates/cli/tests/cli_run.rs:131:13
stack backtrace:
   0: rust_begin_unwind
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:143:14
   2: cli_run::cli_run::run_roc_on
   3: cli_run::cli_run::check_output_with_stdin
   4: core::ops::function::FnOnce::call_once
   5: core::ops::function::FnOnce::call_once
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    cli_run::parse_movies_csv

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 41 filtered out; finished in 4.00s

error: test failed, to rerun pass '-p roc_cli --test cli_run'

brian-carroll avatar Sep 26 '22 21:09 brian-carroll

my guess is that memmove needs to be added here

fn collect_roc_definitions<'a>(object: &object::File<'a, &'a [u8]>) -> MutMap<String, u64> {
    let mut vaddresses = MutMap::default();

    for sym in object.symbols().filter(is_roc_definition) {
        // remove potentially trailing "@version".
        let name = sym
            .name()
            .unwrap()
            .trim_start_matches('_')
            .split('@')
            .next()
            .unwrap();

        let address = sym.address() as u64;

        // special exceptions for memcpy and memset.
        if name == "roc_memcpy" {
            vaddresses.insert("memcpy".to_string(), address);
        } else if name == "roc_memset" {
            vaddresses.insert("memset".to_string(), address);
        }

        vaddresses.insert(name.to_string(), address);
    }

    vaddresses
}

in the linker code. Optimization is relevant because "normal" code can be turned into a memmove by the optimizer.

It's not that simple though because we don't have roc_memset (should we?). But this is where we do some suspicious stuff with libc functions in the linker, so it's the first place I'd look.

I can look at this in more detail tomorrow.

folkertdev avatar Sep 27 '22 07:09 folkertdev

Optimization is relevant because "normal" code can be turned into a memmove by the optimizer.

Hmm. Maybe I'm missing something. I can certainly see how optimization of the Roc app would be relevant. But my understanding is that cargo test --release gets us a release build of the Roc compiler. The compiler should then be functionally equivalent in debug or release, so it should apply exactly the same optimizations to the app as long as we give it the same command line arguments... But maybe there's an extra layer to this that I'm not thinking of?

Yeah I was also wondering if the roc_memcpy stuff was related but it doesn't seem to be called. But I can try defining it anyway and see what happens. If this doesn't work we can maybe copy another working host from somewhere else and modify it.

brian-carroll avatar Sep 27 '22 07:09 brian-carroll

Actually roc_memcpy and roc_memset are both defined in the host already and we don't have a roc_memmove.

brian-carroll avatar Sep 27 '22 07:09 brian-carroll

when running tests with --release, our LLVM backend enables more llvm optimizations (because they are slower, we don't want to use full optimization during development). CI always runs with --release

folkertdev avatar Sep 27 '22 07:09 folkertdev

I wonder if we can/should get llvm to stop introducing memmov, memcpy, memset, etc. Some llvm optimization pass must remove loops in favor of calls to those functions.

Otherwise, yeah, i guess every platform needs a roc_memmov cause llvm might add that call.

bhansconnect avatar Sep 27 '22 15:09 bhansconnect

I guess every platform needs a roc_memmov cause llvm might add that call.

That sounds like a simple enough solution, I'll give it a go.

brian-carroll avatar Sep 27 '22 20:09 brian-carroll

Well, damn. That didn't work.

brian-carroll avatar Sep 27 '22 21:09 brian-carroll

I'm pretty sure it worked. This now looks to be a different issue. Linking now passes, but the example gave a bad exit code in CI.

bhansconnect avatar Sep 27 '22 22:09 bhansconnect

yes

> cargo run -- --optimize examples/parser/parse-movies-csv.roc
    Finished dev [unoptimized + debuginfo] target(s) in 0.23s
     Running `target/debug/roc --optimize examples/parser/parse-movies-csv.roc`
🔨 Rebuilding platform...
fish: “cargo run -- --optimize example…” terminated by signal SIGSEGV (Address boundary error)

very confusing

==335201== Memcheck, a memory error detector
==335201== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==335201== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==335201== Command: ./parse-movies-csv
==335201== 
==335201== 
==335201== Process terminating with default action of signal 11 (SIGSEGV)
==335201==  Bad permissions for mapped region at address 0x12000D
==335201==    at 0x12000D: ??? (in /home/folkertdev/roc/roc/examples/parser/parse-movies-csv)
==335201==    by 0x4BF12E7: ???
==335201==    by 0x11FFBF: ??? (in /home/folkertdev/roc/roc/examples/parser/parse-movies-csv)
==335201== 
==335201== HEAP SUMMARY:
==335201==     in use at exit: 0 bytes in 0 blocks
==335201==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==335201== 
==335201== All heap blocks were freed -- no leaks are possible
==335201== 
==335201== For lists of detected and suppressed errors, rerun with: -s
==335201== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
fish: “valgrind --leak-check=full --tr…” terminated by signal SIGSEGV (Address boundary error)

might still be that we mess something up in the linker?

folkertdev avatar Sep 27 '22 22:09 folkertdev

Yeah, really weird behavior in my local testing. The host is running the success path and returning 0 from main, but somehow our runner is getting a nonzero exit code. The code is not even consistent. That being said the running is checking the output and ensuring it ends with "Parse success!\n". I think it is somehow related to how the host is writing to stdout.

bhansconnect avatar Sep 27 '22 22:09 bhansconnect

this is what lldb says (it's not much)

> lldb parse-movies-csv 
(lldb) target create "parse-movies-csv"
Current executable set to '/home/folkertdev/roc/roc/examples/parser/parse-movies-csv' (x86_64).
(lldb) run
Process 336055 launched: '/home/folkertdev/roc/roc/examples/parser/parse-movies-csv' (x86_64)
Process 336055 stopped
* thread #1, name = 'parse-movies-cs', stop reason = signal SIGSEGV: address access protected (fault address: 0x55555556c000)
    frame #0: 0x000055555556c000 parse-movies-csv`__libc_csu_init + 64
parse-movies-csv`__libc_csu_init:
->  0x55555556c000 <+64>: movq   %r14, %rdx
    0x55555556c003 <+67>: movq   %r13, %rsi
    0x55555556c006 <+70>: movl   %r12d, %edi
    0x55555556c009 <+73>: callq  *(%r15,%rbx,8)
(lldb) bt
* thread #1, name = 'parse-movies-cs', stop reason = signal SIGSEGV: address access protected (fault address: 0x55555556c000)
  * frame #0: 0x000055555556c000 parse-movies-csv`__libc_csu_init + 64
    frame #1: 0x00007ffff7c4c010 libc.so.6`__libc_start_main(main=(parse-movies-csv`main), argc=1, argv=0x00007fffffffe1c8, init=(parse-movies-csv`__libc_csu_init), fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffe1b8) at libc-start.c:264:6
    frame #2: 0x000055555555825e parse-movies-csv`_start + 46

folkertdev avatar Sep 27 '22 22:09 folkertdev

This fixed it locally for me. We'll see if it passes CI.

bhansconnect avatar Sep 27 '22 23:09 bhansconnect

M1 CI failure is related to zig not being able to compile builtins due to the path being too long?

thread 'main' panicked at 'zig failed: error: unable to create compilation: NameTooLong

bhansconnect avatar Sep 27 '22 23:09 bhansconnect

Cool, that worked! Though I have no idea what is wrong with the M1 build. @Anton-4 is this M1 problem a known issue in CI?

bhansconnect avatar Sep 27 '22 23:09 bhansconnect

No, don't think I've ever seen that one before.

Anton-4 avatar Sep 28 '22 09:09 Anton-4

Is it working locally on M1 for you @bhansconnect?

Anton-4 avatar Sep 28 '22 10:09 Anton-4

this also still fails for me on linux. I am on commit 30dbad88ae81d3009cc2a6f2c885f49fe3bf3326, and

> cargo run -- --optimize examples/parser/parse-movies-csv.roc
    Finished dev [unoptimized + debuginfo] target(s) in 0.21s
     Running `target/debug/roc --optimize examples/parser/parse-movies-csv.roc`
🔨 Rebuilding platform...
fish: “cargo run -- --optimize example…” terminated by signal SIGSEGV (Address boundary error)

> cargo run -- --optimize --linker=legacy examples/parser/parse-movies-csv.roc
    Finished dev [unoptimized + debuginfo] target(s) in 0.22s
     Running `target/debug/roc --optimize --linker=legacy examples/parser/parse-movies-csv.roc`
🔨 Rebuilding platform...
2 movies were found:

The movie 'Airplane!' was released in 1980 and stars Robert Hays and Julie Hagerty
The movie 'Caddyshack' was released in 1980 and stars Chevy Chase, Rodney Dangerfield, Ted Knight, Michael O'Keefe and Bill Murray

Parse success!

folkertdev avatar Sep 28 '22 11:09 folkertdev

fun thing: I ran zig translate-c host.c > host.zig and then everything works fine. Only difference between the host binaries seems that the C host executable has two RELRO program headers

   RELRO off    0x0000000000047840 vaddr 0x0000000000049840 paddr 0x0000000000049840 align 2**0
         filesz 0x0000000000002a40 memsz 0x0000000000003840 flags r--
   STACK off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**0
         filesz 0x0000000000000000 memsz 0x0000000001000000 flags rw-
    NOTE off    0x000000000000030c vaddr 0x000000000000030c paddr 0x000000000000030c align 2**2
         filesz 0x0000000000000020 memsz 0x0000000000000020 flags r--
   RELRO off    0x0000000000037568 vaddr 0x0000000000038568 paddr 0x0000000000038568 align 2**0
         filesz 0x0000000000002b18 memsz 0x0000000000002b18 flags r--
    NULL off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**0
         filesz 0x0000000000000000 memsz 0x0000000000000000 flags ---

and zig has just one

   RELRO off    0x0000000000047810 vaddr 0x0000000000049810 paddr 0x0000000000049810 align 2**0
         filesz 0x0000000000002a40 memsz 0x0000000000003870 flags r--
   STACK off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**0
         filesz 0x0000000000000000 memsz 0x0000000001000000 flags rw-
    NOTE off    0x000000000000030c vaddr 0x000000000000030c paddr 0x000000000000030c align 2**2
         filesz 0x0000000000000020 memsz 0x0000000000000020 flags r--
    NULL off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**0
         filesz 0x0000000000000000 memsz 0x0000000000000000 flags ---
    NULL off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**0
         filesz 0x0000000000000000 memsz 0x0000000000000000 flags ---

I think that means that the surgical linker will write over that second relro section, because it assumes that the last two are free to write to

I also found that even with the legacy linker, there is a large number of invalid reads. Probably refcounting issues. So those need to be fixed regardless and might actually be the real cause of the problems we're seeing

folkertdev avatar Sep 28 '22 12:09 folkertdev

This passes CI now. Turns out fixing the valgrind issues was enough. There were two silly bugs, and the host did not free a roc string.

How should we proceed? I'd be happy to merge as-is, but I'm not sure what the state of the roc code is

folkertdev avatar Sep 28 '22 14:09 folkertdev