ocsf-schema
ocsf-schema copied to clipboard
What is the value of Account Type?
In order to promote interoperability, OCSF must define a "schema", not just a "schema framework". The data that goes into logging information must be defined across vendors, not just "captioned".
Consider dictionary.json:
"account_type": {
"caption": "Account Type ID",
"description": "The user account type (e.g. AWS, LDAP, Windows account, etc.).",
"type": "string_t"
},
"account_type_id": {
"caption": "Account Type ID",
"description": "The user account type identifier (e.g. AWS, LDAP, Windows account, etc.).",
"enum": {
"-1": {
"caption": "Other",
"description": "The user account type is not mapped."
},
"0": {
"caption": "Unknown",
"description": "The user account type is unknown."
},
"1": {
"caption": "LDAP Account"
},
"2": {
"caption": "Windows Account"
},
"3": {
"caption": "AWS IAM Account"
},
"4": {
"caption": "GCP Account"
},
"5": {
"caption": "Azure AD Account"
}
},
"type": "integer_t"
},
This is a framework for an enumeration, but OCSF defines no value for the "account_type" "string_t". An information model (abstract schema) does define enumerations:
ID | Name | Description |
---|---|---|
-1 | ? | Other: The user account type is not mapped. |
0 | ? | Unknown: The user account type is unknown. |
1 | ? | LDAP Account: |
2 | ? | Windows Account: |
3 | ? | AWS IAM Account: |
4 | ? | GCP Account: |
5 | ? | Azure AD Account: |
The name column (the string_t account_type) is undefined. Which means that when looking at, for example, Splunk logs, OCSF provides no guidance:
<TS> phonenumber=333-444-4444, app=angrybirds, installdate=xx/xx/xx, acct=Windows Account
<TS> phonenumber=333-444-4444, app=facebook, installdate=yy/yy/yy, acct=Azure AD Account
Using captions might work for comma-separated data fields (assuming captions prohibit commas), but it definitely will not work for space-separated data:
<TS>
USER ACCT PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
Root Windows Account 41 21.9 1.7 3233968 143624 ?? Rs 7Jul11 48:09.67 /System/Library/foo
Rdas Azure AD Account 790 4.5 0.4 4924432 32324 ?? S 8Jul11 9:00.57 /System/Library/baz
Enumeration names enable interchangeable logging data:
ID | Name | Description |
---|---|---|
-1 | other | Other: The user account type is not mapped. |
0 | unknown | Unknown: The user account type is unknown. |
1 | ldap | LDAP Account: |
2 | windows | Windows Account: |
3 | aws_iam | AWS IAM Account: |
4 | gcp | GCP Account: |
5 | azure_ad | Azure AD Account: |
enables
<TS>
USER ACCT PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
Root windows 41 21.9 1.7 3233968 143624 ?? Rs 7Jul11 48:09.67 /System/Library/foo
Rdas azure_ad 790 4.5 0.4 4924432 32324 ?? S 8Jul11 9:00.57 /System/Library/baz
Defining enumerated strings is the rationale for formatting "enum" entries with both a property name and an integer id, as proposed in Issue #214:
"account_type_id": {
"caption": "Account Type ID",
"name": "AccountType",
"description": "The user account type identifier (e.g. AWS, LDAP, Windows account, etc.).",
"enum": {
"other": {
"caption": "Other",
"description": "The user account type is not mapped."
"id": -1
},
...
An example schema containing just Enumerated data types defined in the OCSF enums folder is available here. The OCSF files could easily be updated to define both datatype names and property names.
The text-based enum values must be translated to the integer values, otherwise it will be very confusing to have 2 sets of values that represent the same thing. The caption is just the user-friend name of the integer value.
An enumeration is a 1:1 equivalence between a text string and an integer - they go both ways, just like C language defines:
#define O_RDONLY 00000000 /* Read Only */
#define O_WRONLY 00000001 /* Write Only */
#define O_RDWR 00000002 /* Read and Write */
The problem with caption is that it is not an identifier, which is why it doesn't work in the Splunk log example shown above, or as the identifier in a #define. Captions exist in the natural language space, text identifiers are in the human-readable computer language space.
"Read and Write" and "Azure AD Account" are natural-language captions in unrestricted strings, O_RDWR and azure_ad are text identifiers with a defined lexical form.
Correct, the enum value is the identifier, the caption is a user friend name of the integer value.
Regarding the example above, the raw values found in the logs must be translated to the OCSF enum values. Otherwise, depending on who logged the data different values may represent the save data.
Caption represents enum string value, I don't see an issue in current enum definition.
I believe the issue is that Caption is not considered a discrete value, or token, for example in a switch statement. If we want to have dual mode enums (integers <-> string token) we would need to add the token to the enum definitions and the caption would never be used to populate an event, it would only be for documentation. If there is a desire for the dual mode enum, e.g. because a token might be easier to remember for a consumer doing an ad hoc query, we would need to go through every enum and assign a (memorable, consistent) token.
String enums siblings have been addressed with #450