goblin
                                
                                 goblin copied to clipboard
                                
                                    goblin copied to clipboard
                            
                            
                            
                        tests: binary data and us
We need to have a story for binary data in goblin, as it’s becoming more and more urgent to have a full test suite we can run against for compliance and regression.
We don’t want to store binary data in git, since this adds to clone time and is generally frowned against.
There are a few binary files in goblin or what is effectively binary data with vecs with u8s, but these are grandfathered, and we don’t need to concern ourselves with them.
Im open to any and all suggestions about what best path forward is here :)
Content addressability is a must-have IMO. Test fixtures need to stay constant. Instead of saying "I want test_case_4.exe", it's better to say "I want the file with hash 012abc.", since that lets everyone be confident the test sees the same data unless/until that reference changes.
Proprietary bits are a problem, e.g. the Microsoft bits from the Microsoft symbol server in #183. In general I think that means we can expect to be able to retrieve things but not to redistribute them. If a central repository is prohibited due to licensing reasons, that suggests we're instead talking about a way to assemble a local copy of all the data needed for testing.
Maybe this is a manifest inside the goblin repo and a tool to retrieve it?
# SHA256	URL
b2025742b5f0025ace9821d5722de3f997eeeab21d2f381c9e307882df422579	https://msdl.microsoft.com/download/symbols/WSHTCPIP.DLL/4a5be0b77000/WSHTCPIP.DLL
Tool downloads everything in the manifest, verifies all the hashes, complains if anything can't be retrieved or has changed. Now goblin tests can File::open("<hash>") without worrying about where the data came from originally, without worrying that it changed between runs, and without copying it into the goblin repo.
Bits we're allowed to redistribute could come from a GitHub release or blob in a separate goblin-test-data project, allowing the goblin project to control their fate. Non-redistributable bits could come directly from wherever they live, which puts goblin at the mercy of third parties, but that's unavoidable if goblin is not allowed to redistribute copies.
This is an I interesting idea.
The tool could be a dev dependency and perhaps we have a feature flag enabling “regression suite”, it then downloads as a setup everything in manifest and then tests run on it.
This way it could be opt in only so burden isn’t put on others .
I also wonder if another approach could be for a separate repo to depend on goblin somehow and then this runs a test suite with manifest and then it sends results back to Github CI?
How about a cargo feature?
#[cfg(test)]
mod tests {
  // …
  #[cfg(feature = "external_data")]
  fn test_that_one_weird_thing() {
    let mut file = File::open("<hash>").unwrap();
    // …
  }
}
--feature external_data could be a separate line in the build matrix.
Yea that was exactly what I was thinking.
Except probably it’s on a mod though and we put all manifest/external file based tests in there.
It might also be useful to collect in this issue a list of possible data sources we could use for binary files. I know there’s several floating around Github at least, but I don’t know the legality of them
I think what we want is essentially a symbol server that hosts binaries and a small utility that pulls them in. Since symbol servers generally encode some sort of unique identifier into the lookup path, that would already solve the identification issue.
The only thing that's missing then is a small script that iterates through a list of URLs and mirrors them into a local cache directory. We could even extract that list out of the test code if there's some syntax to detect it from, like a macro or a magic comment.
+1 on the feature flag approach.
As already mentioned in #183, I can see a lot of use cases for such a binary test data server outside of goblin, such as: gimli-rs/gimli, getsentry/symbolic, google/breakpad, willglynn/pdb, etc. For explicit test payloads that are not hosted on a public symbol server already, I'm pretty sure that we'll find a way to create a shared bucket somewhere.
As for the structure of that server, there are obviously multiple options. We've compiled a list of some of the more popular formats here: https://getsentry.github.io/symbolicator/advanced/symbol-server-compatibility/.
WDYT?
hosts binaries
@jan-auer I'm concerned about the legality of re-hosting e.g. Microsoft's binaries. It's fine for them to serve their files, but I don't see how anyone except them can serve their files. That's why I think we need a tool that works with existing third-party servers instead of operating a new first-party server.
@willglynn That's fine. For Microsoft files we can literally point the script / tests at their symbol server and download from there. I was suggesting a similar setup for all custom files that might be interesting for testing ELF, MachO, or anything that's not provided by Microsoft.
@jan-auer What do you see as the ideal interface to this kind of external data? Not just goblin, but the whole ecosystem here. You've been through more of it than I have 😄
As a linux vendor packager, I'm somewhat fond of the "handle these binary assets in a seperate crate" idea, just i know you'll have to exercise some kind of discipline so that goblin and the test data in this other crate either a) don't diverge too far or b) are impervious to problems caused by version mixing.
Just try avoid anything that tries to auto-fetch stuff inside the test/build phases, unless there's a clear way to have an external agent pre-provision these resources. ( As build and test can be run in a network-isloated location, and our tooling is responsible for fetching all external resources before entering the network-isolated context and invoking cargo/rust steps )
I'd like to throw in the obligatory mention of test-assembler which is designed exactly for this use case. I originally wrote it as a port of the C++ implementation that Jim Blandy wrote for Breakpad so he could write a test suite for its DWARF parsing code.
I used it in my minidump crate to write a bunch of test helper types to create synthetic in-memory minidump files, the tests using them look like this.
A slightly simpler real-world example is the X86 stackwalker unit test that uses it to set up stack data.
gimli uses it for a bunch of tests as well. In Breakpad I wrote some helper types to generate synthetic ELF files for tests which worked out nicely. I thought about porting that to Rust a number of times but never got around to it.
Hey, I've gone ahead and ported the breakpad elf synthesization code that @luser mentioned to Rust here as I needed it for testing purposes for some code that is using goblin for parsing in a context where I can't allocate so needed to do a few things by hand.
As noted in the top of the synth-elf lib, I really think this code should be available from goblin itself for these kind of use cases so that people can synthesize their own test cases instead of committing and/or retrieving binary data for tests. I'd be happy to make a PR to integrate this synth-elf code into goblin if people think this approach is ok. If not I can just release the crate separately.
Similarly, Gankra recently refactored the synth-minidump code I had written in the minidump crate out to a separate (test-only) crate so it could be more easily used across the various crates in the workspace:
https://github.com/luser/rust-minidump/tree/master/synth-minidump