openobserve icon indicating copy to clipboard operation
openobserve copied to clipboard

parse_xml issue

Open juju4 opened this issue 9 months ago • 11 comments

Which OpenObserve functionalities are the source of the bug?

functions

Is this a regression?

Yes

Description

Not sure if a regression (no option "Don't know")

Trying to reformat sysmonforlinux logs ingested with otel-collector on openobserve with parse_xml but not working. Currently on v0.14.5-rc2

Sample log

{"_timestamp":1740278298956019,"body___cursor":"s=1f1c1569ecaf4433b2af2adb4ecd7ab8;i=3631dffc;b=a8b664d9874f45a0ae605b44aab17725;m=33e41b94a6;t=62ec61dd1d0f3;x=a6c989aa8bba9fb6","body___monotonic_timestamp":"222870344870","body__boot_id":"a8b664d9874f45a0ae605b44aab17725","body__cap_effective":"1c3a1a061d3","body__cmdline":"/opt/sysmon/sysmon -i /opt/sysmon/config.xml -service","body__comm":"sysmon","body__exe":"/opt/sysmon/sysmon","body__gid":"0","body__hostname":"myhostname","body__machine_id":"f7e6787db2d84830854e33af0a1338b8","body__pid":"26407","body__selinux_context":"unconfined\n","body__source_realtime_timestamp":"1740278298953418","body__systemd_cgroup":"/system.slice/sysmon.service","body__systemd_invocation_id":"b8bf958180274ce7a503e5251ec15954","body__systemd_slice":"system.slice","body__systemd_unit":"sysmon.service","body__transport":"syslog","body__uid":"0","body_message":"<Event><System><Provider Name=\"Linux-Sysmon\" Guid=\"{ff032593-a8d3-4f13-b0d6-01fc615a0f97}\"/><EventID>1</EventID><Version>5</Version><Level>4</Level><Task>1</Task><Opcode>0</Opcode><Keywords>0x8000000000000000</Keywords><TimeCreated SystemTime=\"2025-02-23T02:38:18.953287000Z\"/><EventRecordID>22924914</EventRecordID><Correlation/><Execution ProcessID=\"26407\" ThreadID=\"26407\"/><Channel>Linux-Sysmon/Operational</Channel><Computer>myhostname</Computer><Security UserId=\"0\"/></System><EventData><Data Name=\"RuleName\">-</Data><Data Name=\"UtcTime\">2025-02-23 02:38:18.961</Data><Data Name=\"ProcessGuid\">{f7e6787d-8a1a-67ba-306e-20d452560000}</Data><Data Name=\"ProcessId\">37605</Data><Data Name=\"Image\">/bin/bin/sleep</Data><Data Name=\"FileVersion\">-</Data><Data Name=\"Description\">-</Data><Data Name=\"Product\">-</Data><Data Name=\"Company\">-</Data><Data Name=\"OriginalFileName\">-</Data><Data Name=\"CommandLine\">sleep 1</Data><Data Name=\"CurrentDirectory\">/path/to/current/dir</Data><Data Name=\"User\">root</Data><Data Name=\"LogonGuid\">{f7e6787d-0000-0000-0000-000000000000}</Data><Data Name=\"LogonId\">0</Data><Data Name=\"TerminalSessionId\">4294967295</Data><Data Name=\"IntegrityLevel\">no level</Data><Data Name=\"Hashes\">-</Data><Data Name=\"ParentProcessGuid\">{00000000-0000-0000-0000-000000000000}</Data><Data Name=\"ParentProcessId\">880</Data><Data Name=\"ParentImage\">-</Data><Data Name=\"ParentCommandLine\">-</Data><Data Name=\"ParentUser\">-</Data></EventData></Event>","body_priority":"6","body_syslog_facility":"1","body_syslog_identifier":"sysmon","body_syslog_timestamp":"Feb 23 02:38:18 ","dropped_attributes_count":0,"host_name":"myhostname","os_type":"linux","severity":0}

$ echo "<Event><System><Provider Name=\"Linux-Sysmon\" Guid=\"{ff032593-a8d3-4f13-b0d6-01fc615a0f97}\"/><EventID>1</EventID><Version>5</Version><Level>4</Level><Task>1</Task><Opcode>0</Opcode><Keywords>0x8000000000000000</Keywords><TimeCreated SystemTime=\"2025-02-23T02:38:18.953287000Z\"/><EventRecordID>22924914</EventRecordID><Correlation/><Execution ProcessID=\"26407\" ThreadID=\"26407\"/><Channel>Linux-Sysmon/Operational</Channel><Computer>myhostname</Computer><Security UserId=\"0\"/></System><EventData><Data Name=\"RuleName\">-</Data><Data Name=\"UtcTime\">2025-02-23 02:38:18.961</Data><Data Name=\"ProcessGuid\">{f7e6787d-8a1a-67ba-306e-20d452560000}</Data><Data Name=\"ProcessId\">37605</Data><Data Name=\"Image\">/bin/bin/sleep</Data><Data Name=\"FileVersion\">-</Data><Data Name=\"Description\">-</Data><Data Name=\"Product\">-</Data><Data Name=\"Company\">-</Data><Data Name=\"OriginalFileName\">-</Data><Data Name=\"CommandLine\">sleep 1</Data><Data Name=\"CurrentDirectory\">/path/to/current/dir</Data><Data Name=\"User\">root</Data><Data Name=\"LogonGuid\">{f7e6787d-0000-0000-0000-000000000000}</Data><Data Name=\"LogonId\">0</Data><Data Name=\"TerminalSessionId\">4294967295</Data><Data Name=\"IntegrityLevel\">no level</Data><Data Name=\"Hashes\">-</Data><Data Name=\"ParentProcessGuid\">{00000000-0000-0000-0000-000000000000}</Data><Data Name=\"ParentProcessId\">880</Data><Data Name=\"ParentImage\">-</Data><Data Name=\"ParentCommandLine\">-</Data><Data Name=\"ParentUser\">-</Data></EventData></Event>" | xmllint --format -
<?xml version="1.0"?>
<Event>
  <System>
    <Provider Name="Linux-Sysmon" Guid="{ff032593-a8d3-4f13-b0d6-01fc615a0f97}"/>
    <EventID>1</EventID>
    <Version>5</Version>
    <Level>4</Level>
    <Task>1</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000000</Keywords>
    <TimeCreated SystemTime="2025-02-23T02:38:18.953287000Z"/>
    <EventRecordID>22924914</EventRecordID>
    <Correlation/>
    <Execution ProcessID="26407" ThreadID="26407"/>
    <Channel>Linux-Sysmon/Operational</Channel>
    <Computer>myhostname</Computer>
    <Security UserId="0"/>
  </System>
  <EventData>
    <Data Name="RuleName">-</Data>
    <Data Name="UtcTime">2025-02-23 02:38:18.961</Data>
    <Data Name="ProcessGuid">{f7e6787d-8a1a-67ba-306e-20d452560000}</Data>
    <Data Name="ProcessId">37605</Data>
    <Data Name="Image">/bin/bin/sleep</Data>
    <Data Name="FileVersion">-</Data>
    <Data Name="Description">-</Data>
    <Data Name="Product">-</Data>
    <Data Name="Company">-</Data>
    <Data Name="OriginalFileName">-</Data>
    <Data Name="CommandLine">sleep 1</Data>
    <Data Name="CurrentDirectory">/path/to/current/dir</Data>
    <Data Name="User">root</Data>
    <Data Name="LogonGuid">{f7e6787d-0000-0000-0000-000000000000}</Data>
    <Data Name="LogonId">0</Data>
    <Data Name="TerminalSessionId">4294967295</Data>
    <Data Name="IntegrityLevel">no level</Data>
    <Data Name="Hashes">-</Data>
    <Data Name="ParentProcessGuid">{00000000-0000-0000-0000-000000000000}</Data>
    <Data Name="ParentProcessId">880</Data>
    <Data Name="ParentImage">-</Data>
    <Data Name="ParentCommandLine">-</Data>
    <Data Name="ParentUser">-</Data>
  </EventData>
</Event>

Current VRL

ret1 = parse_xml!(.body_message)
. = merge!(., ret1)
del(.body_message)
del(.body___cursor)
del(.body___monotonic_timestamp)
del(.body__boot_id)
del(.body__cap_effective)
del(.body__cmdline)
del(.body__exe)
del(.body__gid)
del(.body__hostname)
del(.body__pid)
del(.body__systemd_cgroup)
del(.body__systemd_slice)
del(.body__systemd_unit)
del(.body__runtime_scope)
del(.body__selinux_context)
del(.body__source_realtime_timestamp)
del(.body__systemd_invocation_id)
del(.body__transport)
del(.body__uid)
del(.body_priority)
del(.body_syslog_facility)
del(.body_syslog_identifier) 
del(.body_syslog_pid)
del(.dropped_attributes_count) 
del(.severity)
.

My current result:

    {
        "_timestamp": 1740278298956019,
        "body__comm": "sysmon",
        "body__machine_id": "f7e6787db2d84830854e33af0a1338b8",
        "body_syslog_timestamp": "Feb 23 02:38:18 ",
        "event_eventdata_data": "[{\"@Name\":\"RuleName\",\"text\":\"-\"},{\"@Name\":\"UtcTime\",\"text\":\"2025-02-23 02:38:18.961\"},{\"@Name\":\"ProcessGuid\",\"text\":\"{f7e6787d-8a1a-67ba-306e-20d452560000}\"},{\"@Name\":\"ProcessId\",\"text\":37605},{\"@Name\":\"Image\",\"text\":\"/bin/bin/sleep\"},{\"@Name\":\"FileVersion\",\"text\":\"-\"},{\"@Name\":\"Description\",\"text\":\"-\"},{\"@Name\":\"Product\",\"text\":\"-\"},{\"@Name\":\"Company\",\"text\":\"-\"},{\"@Name\":\"OriginalFileName\",\"text\":\"-\"},{\"@Name\":\"CommandLine\",\"text\":\"sleep 1\"},{\"@Name\":\"CurrentDirectory\",\"text\":\"/path/to/current/dir\"},{\"@Name\":\"User\",\"text\":\"root\"},{\"@Name\":\"LogonGuid\",\"text\":\"{f7e6787d-0000-0000-0000-000000000000}\"},{\"@Name\":\"LogonId\",\"text\":0},{\"@Name\":\"TerminalSessionId\",\"text\":4294967295},{\"@Name\":\"IntegrityLevel\",\"text\":\"no level\"},{\"@Name\":\"Hashes\",\"text\":\"-\"},{\"@Name\":\"ParentProcessGuid\",\"text\":\"{00000000-0000-0000-0000-000000000000}\"},{\"@Name\":\"ParentProcessId\",\"text\":880},{\"@Name\":\"ParentImage\",\"text\":\"-\"},{\"@Name\":\"ParentCommandLine\",\"text\":\"-\"},{\"@Name\":\"ParentUser\",\"text\":\"-\"}]",
        "event_system_channel": "Linux-Sysmon/Operational",
        "event_system_computer": "myhostname",
        "event_system_eventid": 1,
        "event_system_eventrecordid": 22924914,
        "event_system_execution__processid": "26407",
        "event_system_execution__threadid": "26407",
        "event_system_keywords": "0x8000000000000000",
        "event_system_level": 4,
        "event_system_opcode": 0,
        "event_system_provider__guid": "{ff032593-a8d3-4f13-b0d6-01fc615a0f97}",
        "event_system_provider__name": "Linux-Sysmon",
        "event_system_security__userid": "0",
        "event_system_task": 1,
        "event_system_timecreated__systemtime": "2025-02-23T02:38:18.953287000Z",
        "event_system_version": 5,
        "host_name": "myhostname",
        "os_type": "linux"
    }

Expected result from docs https://vector.dev/docs/reference/vrl/functions/#parse_xml

    {
        "_timestamp": 1740278298956019,
        "body__comm": "sysmon",
        "body__machine_id": "f7e6787db2d84830854e33af0a1338b8",
        "body_syslog_timestamp": "Feb 23 02:38:18 ",
        "event_eventdata_data": {
             "@RuleName": "-",
             "@UtcTime": "2025-02-23 02:38:18.961",
             "@Image": "/bin/bin/sleep",
             ...
        }
        "event_system_channel": "Linux-Sysmon/Operational",
        "event_system_computer": "myhostname",
        "event_system_eventid": 1,
        "event_system_eventrecordid": 22924914,
        "event_system_execution__processid": "26407",
        "event_system_execution__threadid": "26407",
        "event_system_keywords": "0x8000000000000000",
        "event_system_level": 4,
        "event_system_opcode": 0,
        "event_system_provider__guid": "{ff032593-a8d3-4f13-b0d6-01fc615a0f97}",
        "event_system_provider__name": "Linux-Sysmon",
        "event_system_security__userid": "0",
        "event_system_task": 1,
        "event_system_timecreated__systemtime": "2025-02-23T02:38:18.953287000Z",
        "event_system_version": 5,
        "host_name": "myhostname",
        "os_type": "linux"
    }

See https://github.com/vectordotdev/vrl/discussions/1287

Please provide a link to a minimal reproduction of the bug

No response

Please provide the exception or error you saw


Please provide the version you discovered this bug in (check about page for version information)

v0.14.5-rc2

Anything else?

OpenObserve vrl is 0.19.0 per https://github.com/openobserve/openobserve/blob/main/Cargo.toml#L374 no listed changed to parse_xml after 0.19.0 per https://github.com/vectordotdev/vrl/blob/main/CHANGELOG.md parse_xml is working as expected in vrl playground (Vector Version: 02343258 VRL Version: 0.22.0)

juju4 avatar Mar 09 '25 19:03 juju4

@juju4 In 0.14.5-rc4 vrl was bumped to 0.22.0, try that one. Although i think your issue is related to how O2 flats objects. It doesnt convert arrays. Instead they become a long string, this happens too with Elastic Beats data.

Ping @hengfeiyang

gaby avatar Mar 22 '25 17:03 gaby

More like this is a limitation of DataFusion. There's a contrib package that helps with handling JSON in DataFusion. It's made by the pydantic folks. It would need to be added into O2 though.

https://github.com/datafusion-contrib/datafusion-functions-json

gaby avatar Mar 22 '25 17:03 gaby

I did the update to 0.14.5-rc4 and does not change thing. yes that's an array issue. just to check, I tried increased flattening (3 to 5) but does not help. Is there a way to match VRL playground behavior which works?

As workaround, I also tried parse_regex the string but failing...

# error[E101]: invalid regular expression
ret2 = parse_regex!(.event_eventdata_data, r'{"@Name":"User","text":"(?P<user>.*)"}')
ret2 = parse_regex!(.event_eventdata_data, r'{\\"@Name":\\"User\\",\\"text\\":\\"(?P<user>.*)\\"}')

juju4 avatar Mar 23 '25 20:03 juju4

@juju4 it's not a vrl limitation, it's a DataFusion limitation like I said above. OpenObserve would need to add support for https://github.com/datafusion-contrib/datafusion-functions-json

gaby avatar Mar 23 '25 22:03 gaby

@gaby you need to know, we have this json functions from v0.14.2

hengfeiyang avatar Mar 24 '25 03:03 hengfeiyang

@hengfeiyang That's awesome, is there any way to have O2 deal with array fields? When ingested by O2 they look like one long string instead. Making them impossible to use.

gaby avatar Mar 24 '25 03:03 gaby

@juju4

"event_eventdata_data": "[{\"@Name\":\"RuleName\",\"text\":\"-\"},{\"@Name\":\"UtcTime\",\"text\":\"2025-02-23 02:38:18.961\"},{\"@Name\":\"ProcessGuid\",\"text\":\"{f7e6787d-8a1a-67ba-306e-20d452560000}\"},{\"@Name\":\"ProcessId\",\"text\":37605},{\"@Name\":\"Image\",\"text\":\"/bin/bin/sleep\"},{\"@Name\":\"FileVersion\",\"text\":\"-\"},{\"@Name\":\"Description\",\"text\":\"-\"},{\"@Name\":\"Product\",\"text\":\"-\"},{\"@Name\":\"Company\",\"text\":\"-\"},{\"@Name\":\"OriginalFileName\",\"text\":\"-\"},{\"@Name\":\"CommandLine\",\"text\":\"sleep 1\"},{\"@Name\":\"CurrentDirectory\",\"text\":\"/path/to/current/dir\"},{\"@Name\":\"User\",\"text\":\"root\"},{\"@Name\":\"LogonGuid\",\"text\":\"{f7e6787d-0000-0000-0000-000000000000}\"},{\"@Name\":\"LogonId\",\"text\":0},{\"@Name\":\"TerminalSessionId\",\"text\":4294967295},{\"@Name\":\"IntegrityLevel\",\"text\":\"no level\"},{\"@Name\":\"Hashes\",\"text\":\"-\"},{\"@Name\":\"ParentProcessGuid\",\"text\":\"{00000000-0000-0000-0000-000000000000}\"},{\"@Name\":\"ParentProcessId\",\"text\":880},{\"@Name\":\"ParentImage\",\"text\":\"-\"},{\"@Name\":\"ParentCommandLine\",\"text\":\"-\"},{\"@Name\":\"ParentUser\",\"text\":\"-\"}]",

This is an array, it can't convert to an object.

"event_eventdata_data": {
             "@RuleName": "-",
             "@UtcTime": "2025-02-23 02:38:18.961",
             "@Image": "/bin/bin/sleep",
             ...
        }

If you need convert to an object, Maybe you need set the value to first element of the array, then it will be object like you expect.

hengfeiyang avatar Mar 24 '25 03:03 hengfeiyang

array just an array, no good solution for ingestion because we don't know how many elements in the array, we can't flatten it, it will crease a lot of no meaning fields. maybe you can try https://github.com/datafusion-contrib/datafusion-functions-json when query.

hengfeiyang avatar Mar 24 '25 03:03 hengfeiyang

I will try using the datafusion json tomorrow, I didn't know O2 had support for it.

gaby avatar Mar 24 '25 03:03 gaby

I dont know if you were able to resolve this, but this is the function I used for sysmon.

if .log_type == "windows_event.sysmon" { .sysmon = parse_key_value!( (.body_message), field_delimiter: "\n", key_value_delimiter: ":", ) .sysmon.RuleName = parse_key_value!( (.sysmon.RuleName), field_delimiter: ",", key_value_delimiter: "=") } .

And this is what I have parsed so far

Image

eminentv avatar Mar 26 '25 20:03 eminentv

Thanks @eminentv . Yes, that seems to work better than parse_xml or regex. On my side with sysmonforlinux, I don't have "\n". Changed to:

   sysmon2 = parse_key_value!( (.body_message), field_delimiter: "</Data>", key_value_delimiter: "\">", )
   # sysmon2 = parse_key_value!( (.body_message), field_delimiter: "(<EventData>|</Data>)", key_value_delimiter: "\">", )  # NOK
   .commandline = sysmon2."<Data Name=\"CommandLine"
   .company = .sysmon2."<Data Name=\"Company"
   .currentdirectory = .sysmon2."<Data Name=\"CurrentDirectory"
   .description = .sysmon2."<Data Name=\"CommandLine"
   .fileversion = .sysmon2."<Data Name=\"FileVersion"
   .hashes = .sysmon2."<Data Name=\"Hashes"
   .image = .sysmon2."<Data Name=\"Image"
   .integritylevel = .sysmon2."<Data Name=\"IntegrityLevel"
   .logonguid = .sysmon2."<Data Name=\"LogonGuid"
   .logonid = sysmon2."<Data Name=\"LogonId"
   .originalfilename = sysmon2."<Data Name=\"OriginalFileName"
   .parentcommandline = sysmon2."<Data Name=\"ParentCommandLine"
   .parentimage = sysmon2."<Data Name=\"ParentImage"
   .parentprocessguid = sysmon2."<Data Name=\"ParentProcessGuid"
   .parentprocessid = sysmon2."<Data Name=\"ParentProcessId"
   .parentuser = sysmon2."<Data Name=\"ParentUser"
   .processguid = sysmon2."<Data Name=\"ProcessGuid"
   .processid = sysmon2."<Data Name=\"ProcessId"
   .product = sysmon2."<Data Name=\"Product"
   .terminalsessionid = sysmon2."<Data Name=\"CommandLine"
   .user = sysmon2."<Data Name=\"User"
   .utctime = sysmon2."<Data Name=\"UtcTime"
   # and more depending on eventid
   # as first variable has very long/variable name, separate extract
   #.rulename_tmp = parse_regex!(.body_message, r'<Data Name="RuleName">(.*)<\/Data>')  # NOK. returns {}

This should be easier. both json/xml have easier ways to do complex extraction (jmespath, xpath...)

juju4 avatar Mar 30 '25 20:03 juju4