ocsf-schema
ocsf-schema copied to clipboard
[ Schema Extension ] FileObject: Extend File Type with “Executable”
Needs
The ability to leverage data stores encoded with OCSF schema, and where a query for file types that are executable - specifically executable images on Windows, Linux, MacOs - i.e., PE32, PE32++, ELF, MACHO
The new enum type will allow workflows for searching by both machines and humans to use a single predicate against some random column as:
— Example query with desired extension of FileType
select file_name, file_sha2
from some_table
where ocsf_file_type_id = 8 or ocsf_file_type = ‘Executable’
Current FileObject type_id enum values
"type_id": {
"description": "The file type ID.",
"enum": {
"0": {
"caption": "Unknown"
},
"1": {
"caption": "Regular File"
},
"2": {
"caption": "Folder"
},
"3": {
"caption": "Character Device"
},
"4": {
"caption": "Block Device"
},
"5": {
"caption": "Local Socket"
},
"6": {
"caption": "Named Pipe"
},
"7": {
"caption": "Symbolic Link"
},
"99": {
"caption": "Other"
}
},
"requirement": "required"
}
Good suggestion.
A thorn with it is that file.type_id seems to have been used to indicate the "filesystem file type" where executable would fall into "Regular File".
mime_type also exists, but that doesn't meet the need where you want to ask about all executable files.
How do you feel about a new field like regular_file_type_id that is set if file.type_id == 1?
Another option could be a new is_executable field.
How do you think library files (static or dynamic linking) should be handled?
Would checking mime_type against common executable types (e.g., based on file/magic) be satisfactory?
E.g.,
application/x-mach-binary(Apple Mach-O)application/x-executable,application/x-pie-executable,application/x-sharedlib(ELF)application/x-dosexec(DOS PE)application/vnd.microsoft.portable-executable(Windows PE)
I suggest mime_type because (1) it is already there, and (2) whether a file is executable is sometimes ambiguous. E.g., The DMG format mentioned is not actually an executable format. It's a disk image container file that the macOS Disk Image Mounter opens. However, it's an installer format and often have security interest like executable files. It may be better for the query to specify exactly what file types it is interested in.
@antchan2 - Just changed to MACHO, thanks for catching that.
@mlmitch
I want to focus the ask on extending the Enum to type_id = 8 and corresponding type_name = Executable.
Better for existing DX, the enum can be passed around and let my local OCSF clients have minimal code changes and the type system enforce the constraint from the type_id.
Additionally, I don’t want to extend existing parquet files or columns to accommodate a new single field when we have an enum that is serving efficiently. For example, If I cap my parquet file at 300MBs, adding a new column for that new mime_type can consume it unnecessarily.
@mlmitch - regarding your good question on linking (static/dynamic), I think this ask is more about what it is, and linking to me feels like how it can be loaded to later execute. In that thought, then the activity types like memory activity feels closer for that discussion, happy to chime in on a separate GH discussion and maybe that could lead to a new ask via issues here - wdyt?
On the linking stuff, I mean how do we treat library files that aren't directly executable but still contain executable code?
@mlmitch
It is an executable, whether it be an .so, dll, rlib, or object file, etc.
@antchan2 on the use of mime_type, I agree that it is a feasible solution.
I have two high-level thoughts on the matter.
First, "Regular File" is a ridiculous file.type_id value (no offence to whoever made it - I get what you're going for) and I would like to fix it.
Second, I think mime_type will end up with a poor experience when it comes to dataset querying.
Considering the ask in this issue, a dataset user wants to ask for executable files. Doing so with mime_type requires querying with several string equality conditions and it relies on the datasource correctly setting / providing the type information. When we consider mapping existing endpoint data, that information isn't often reported, but certain events will imply that the file is executable (e.g. library load, file execution). So I think it will be easier for the community at large to set file.type_id.
We don't need to boil the ocean on this either. We can leave "Regular File" as a sort of fallback/default and add in additional file types as the need arises, starting with "Executable".
@mlmitch:
"Regular File" is a ridiculous file.type_id value
OCSF's current file.type_id definition appears to be based on the Unix definition of a file. Through that lens, "regular file" and the other values may make more sense as they map directly to the File Type (S_IFMT) bit definitions in POSIX file stat:
S_IFMT
Type of file.
S_IFBLK
Block special.
S_IFCHR
Character special.
S_IFIFO
FIFO special.
S_IFREG
Regular.
S_IFDIR
Directory.
S_IFLNK
Symbolic link.
S_IFSOCK
Socket.
That said, I completely understand when we talk about file type in a cybersecurity context we are typically referring to the file content type, so I can relate to the surprise.
mime_type will end up with a poor experience when it comes to dataset querying... add in additional file types as the need arises, starting with "Executable".
If mime_type is too heavy for this case, does adding an optional boolean is_executable provide an equally good dataset querying experience as a new type_id value?
If so, that may be more attractive as there is precedent (e.g., is_encrypted, is_system) with this approach and would not disrupt an existing required field where all defined values are mutually exclusive and do not require judgement.
Well I've certainly got some egg on my face. TIL (or today I relearned) regarding the POSIX file stat.
POSIX file stat and Unix' interest in regular files is definitely under-the-hood. Glad pointing out the possible origin was helpful.
@antchan2 @mlmitch
POSIX or not, I don’t believe OCSF aims at describing OS internal structures (😀), I thought OCSF is a schema to express the cyber vocabulary. There’s no such thing as a Regular File.
Do we see this ask as possible to be applied in OCSF?
To be crystal, my ask is extend the existing enum to TypeId == 8 and corresponding TypeName == “Executable”
Regular file aside (we can't change the label without making a breaking change, due to the convention of populating the sibling with the label) - we do not want to add an executable file type (it's not strictly a file type I would suggest but an attribute of the file) to the enum list since a file can have a combination of attributes as evidenced by its bitmask. For example, a file can also be read-only, and we don't want to have a file type for each combination of attributes of this sort.
I suggest we add an is_executable boolean field that can be applied along with the file type, and likely we should also add an is_readonly boolean as well. Today we have is_encrypted and is_system as well as is_deleted and so it will be consistent with that approach. Note there is also an attributes integer attribute for flags but admittedly that is too cumbersome for normal query predicates and it isn't standardized (by us anyway).
@pagbabian-splunk - I am not asking for executables based on bitmask values like chmod +x foo having that attribute on.
I also don’t want the is_executable since that forces me to have a new column with bloated truthy/falsy in my parquet tables.
I was hoping given today I am using type_id, to have a new number and type_name. My logic stays the same , my parquet tables don’t expand with new columns, and my data retrieval logic stays the same.
I am already using this locally with my own extension, I thought this ask could result in an upstream change so I won’t deviate from schema too much. Worst case, I can continue to use my local implementation of the enum with type_id == 8.
All the suggestions and thoughts are noble, they can serve other use cases, for me the efficiencies desired are in that enum value extension.
I don’t want to complicate this or ask for too much effort, we can close this issue as wont fix.
Thank you @dfirence - but you have brought up an important gap in this object, we will still need to address it. I now understand your desire, due to your existing schema.
@dfirence as I have been thinking about this some more, and your desire to keep this to a single field, column, I think there is a good rationale to distinguish a Regular File from an Executable File since the construction of those files are different (taking advantage of the maybe odd sounding "Regular" adjective as mentioned).
I'm going to re-open your issue for now, and create the PR and add you as a reviewer for a PR with an additional type_id.
@dfirence Could this be closed out since the PR #1438 was merged?