velociraptor
velociraptor copied to clipboard
Bug: Artifact validation does not cover source naming
Moving from discussion that started in https://github.com/Velocidex/velociraptor/pull/1790#issuecomment-1136675724.
The documentation at https://docs.velociraptor.app/docs/vql/artifacts/ states that an Artifact with multiple sources must have a unique name for each one. The artifact validation code in Repository.LoadProto doesn't check any of the naming for sources and as a result artifacts that violate these rules have landed over time.
I'd like to use this issue to discuss what we expect the naming rules to actually be since fixing artifacts to adhere to the stated rule may result in user-visible breakage.
After a few test failures that exited after the first issue was encountered, I wrote up a quick little script to sanity check the existing pool of artifacts. The following is what shook out. There are some cases where we need to define what the rules actually are and a few obvious bugs.
Multiple unnamed sources with different preconditions:
Generic.Client.StatsGeneric.Forensic.Timeline
The preconditions are intended to be mutually exclusive but at least with Generic.Forensic.Timeline don't appear to be. Unfortunately, we don't have a programmatic way to decide whether the precondition is mutually exclusive. I'm not sure what to do about this case since there's an obvious argument for the simplicity of using just the Artifact name in all cases, but it's easy to break. Perhaps the answer is just reminding the user not to shoot themselves in the foot.
A mix of named and unnamed sources:
Windows.Applications.MegaSyncWindows.Applications.SBECmdWindows.Forensics.BulkExtractorWindows.System.Powershell.PSReadline
The pattern seems to be to use an unnamed source to do most of the work and a named source to do the cleanup. We should probably continue to allow that.
This artifact could probably use some deeper review since it just looks like perhaps there might've been a bad merge at some point but there's no evidence of it in the git log.
Generic.Collectors.SQLECmd- multiple sources named "Drobpox" with no precondition.
- multiple sources named "Windows EventTranscript.db BrowsingHistory" with no precondition.
- multiple sources named "Windows EventTranscript.db Device Connectivity and Configuration" with no precondition.
- multiple sources named "Windows EventTranscript.db Inking Typing and Speech Utterance" with no precondition.
- multiple sources named "Windows EventTranscript.db_ProductandServicePerformance" with no precondition.
- multiple sources named "Windows EventTranscript.db Product and Service Usage" with no precondition.
- multiple sources named "Windows EventTranscript.db Software Setup and Inventory" with no precondition.
- multiple sources named "Firefox Downloads" with no precondition.
- multiple sources named "Windows Notifications" with no precondition.
- a mix of named and unnamed sources.
Thanks for raising this issue for discussion :-)
So just a bit of background for the reasoning behind these restrictions on naming etc. Since Velociraptor is effectively a VQL engine, the shape of the data (meaning the columns and their types) is determined by the VQL query itself.
So for example, if we have a query SELECT X, Y, Z FROM ... we expect to have three columns, X, Y, and Z.
The server basically collects the result into a result set file (basically JSONL) named after the artifact name / source name if there is a source or just artifact name if the source name is empty.
Generally when we end up reading these files, we want the output format to be consistent - we dont want for example, the first half of the file having columns X, Y, Z and the second half to have columns A, B, C. This is important for a variety of reasons as well as that the GUI can not handle the switching of columns in the middle of a table - the GUI starts to render the table with columns based on the first row then keeps adding rows to that table. So it will end up missing columns as it reaches the second half of the table.
From a data perspective it probably does not matter too much if there are missing fields - each row of the JSONL is a JSON object in its own right so maybe other tools don't care (for example feeding into elastic makes no difference).
So this is the reason for the validation restriction - they are not very strong reasons just trying to make sure data is more or less consistent. We can impose a hard validation on the artifacts but this might end up being overly restrictive.
The example you brought up is a perfect illustration - a number of sources that have the same name but different preconditions. We hope one will be active at any time but there is no way for us to ensure this is the case other than a manual inspection.
The other example was Generic.Collectors.SQLECmd which is autogenerated from the SQLECmd repository and the names there do not necessary mean different things. As it stands right now all those sources of the same name will end up in the same file in some random order. But this is not necessarily the best because we do not have any assurance that the shape of the data in each source is consistent.
So I think it is worth changing the source names of Generic.Collectors.SQLECmd to make it consistent. The script that generates it is here https://github.com/Velocidex/velociraptor/blob/master/scripts/sqlecmd_convert.py
This certainly uncovers some issues - maybe we need to convert this validation into a warning with some way to add exceptions for known misuses of these broad rules?
The GUI these days can handle changing the data shape through the table so it doesnt matter too much.
Maybe we should just say it is best practice and not enforce rigid validation.