vector feat(codecs): Add syslog encoder

Summary

This pull request introduces a new syslog encoder. This work is a continuation of the feature originally started in PR #21307.

The encoder is designed to be lean and performant, expecting users to perform complex data shaping with an upstream remap transform. It correctly handles both RFC 5424 and RFC 3164 standards, including specific field length limitations, character sanitization, and security escaping.

Key Features

Simple Configuration: The configuration uses standard Option<ConfigTargetPath> for all fields.

Flexible Parsing: facility and severity values read from the event are parsed intelligently, accepting either a string name (e.g., "user") or a number (e.g., 16), with case-insensitive matching for names.

Strict RFC Compliance: Added logic to truncate app_name, proc_id, and msg_id to their specified maximum lengths for RFC 5424. Implemented robust truncation for the RFC 3164 TAG field to ensure it never exceeds 32 characters. Added a sanitization step for RFC 3164 messages to remove non-printable ASCII characters. Implemented correct character escaping (\, ", ]) for structured data parameter values to prevent log injection.

Unit tests: including parsing, truncation, sanitization, and escaping.

Vector configuration

[sinks.example]
type = "socket"
inputs = ["example_parse_encoding"]
address = "logserver:514"
mode = "xyz"

[sinks.example.encoding]
codec = "syslog"
rfc = "rfc5424"
app_name = ".app"
proc_id = ".pid"
msg_id = ".mid"
facility = ".fac"
severity = ".sev"
payload_key = ".message"

How did you test this PR?

This plan covers the basic functionality of the syslog encoder for both RFC5424 and RFC3164, focusing on dynamic field resolution from a JSON source.

Note: All tests assume the stdin source is configured with decoding.codec = "json" to parse the input. Expected timestamps and hosts are illustrative.

data_dir = "./data"

[sources.input]
type = "stdin"
[sources.input.decoding]
codec = "json"

[sinks.console]
type = "console"
inputs = [ "input" ]
target = "stdout"

[sinks.console.buffer]
type = "disk"
max_size = 268_435_488
when_full = "block"

Test Case 1: RFC 5424 - field references

Verify that all configured fields are correctly read from a JSON event. Config:

[sinks.console.encoding]
codec = "syslog"
[sinks.console.encoding.syslog]
rfc = "rfc5424"
app_name = ".app"
proc_id = ".pid"
msg_id = ".mid"
facility = ".fac"
severity = ".sev"

Input: {"host": "my-host", "@timestamp": "2025-10-23T19:00:00.123456Z", "message": "hello world", "app": "my_app", "pid": "987", "mid": "REQ-1", "fac": "daemon", "sev": 3} Expected Output: <27>1 2025-10-23T17:37:08.711556Z my-host my_app 987 REQ-1 - hello world

Test Case 2: RFC 3164 - fields references

Verify that all configured fields are correctly read for the legacy format. Config:

[sinks.console.encoding]
codec = "syslog"
[sinks.console.encoding.syslog]
rfc = "rfc3164"
app_name = ".app"
proc_id = ".pid"
facility = ".fac"
severity = ".sev"

Input: {"host": "my-host", "@timestamp": "2025-10-23T19:00:00Z", "message": "hello legacy", "app": "legacy_app", "pid": "456", "fac": "user", "sev": 5} Expected Output: <13>Oct 23 19:00:00 my-host legacy_app[456]: hello legacy

Test Case 3: Field Parsing

Verify facility and severity are parsed from names (case-insensitive) and numbers. Config:

[sinks.console.encoding.syslog]
rfc = "rfc5424"
facility = ".fac"
severity = ".sev"

Input 1 (Name): {"fac": "local1", "sev": "warning"} Output 1: <140>1 ...

Input 2 (Number): {"fac": 17, "sev": 4} Output 2: <140>1 ... (same PRI)

Input 3 (Uppercase): {"fac": "LOCAL1", "sev": "WARNING"} Output 3: <140>1 ... (same PRI)

Input 4 (Mix): {"fac": "LOCAL1", "sev": "WARNING"} Output 4: <140>1 ... (same PRI)

Test Case 4: Default Fallbacks

Verify the encoder uses defaults. Config:

[sinks.console.encoding]
codec = "syslog"

Input: {"host": "my-host", "@timestamp": "2025-10-23T19:00:00Z", "message": "hello default"} Expected Output: <14>1 2025-10-23T19:00:00.000000Z my-host vector - - - hello default

Change Type

[ ] Bug fix
[x] New feature
[ ] Non-functional (chore, refactoring, docs)
[ ] Performance

Is this a breaking change?

[ ] Yes
[x] No

Does this PR include user facing changes?

[ ] Yes. Please add a changelog fragment based on our guidelines.
[ ] No. A maintainer will apply the no-changelog label to this PR.

References

Notes

Please read our Vector contributor resources.
Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
Some CI checks run only after we manually approve them.
- We recommend adding a pre-push hook, please see this template.
- Alternatively, we recommend running the following locally before pushing to the remote branch:
  - make fmt
  - make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
  - make test
After a review is requested, please avoid force pushes to help us review incrementally.
- Feel free to push as many commits as you want. They will be squashed into one before merging.
- For example, you can run git merge origin master and git push.
If this PR introduces changes Vector dependencies (modifies Cargo.lock), please run make build-licenses to regenerate the license inventory and commit the changes (if any). More details here.

Sep 15 '25 08:09 vparfonov

TL;DR: It looks like this PR is trying to mix in an additional feature for the encoder config to support referencing LogEvent data with different keys, when VRL remap transform should be capable of abstracting that out? (if it's something worthwhile to contribute perhaps extract that out to a separate PR/discussion after landing the encoder support?)

My feedback below is regarding changes related to the TLDR concern raised above, when comparing to my stale PR and it's review feedback.

EDIT: The concern appears to have been raised internally, so further review should wait until the refactor is pushed. I can also update my PR if that'd help with changes I note below.

Regex for RFC 3164 tag encoding

RFC3164 Tag Handling: The logic for generating the RFC3164 TAG was improved with a regex to prevent incorrectly duplicating the proc_id if the app_name field was already formatted.

Could you please explain when the app_name would also carry the proc_id, such that this regex check was necessary?

I assume this is specifically for the use-case where you want the static encoder config to additionally support referencing a different LogEvent field? Or something else specific to how you're providing the data?

You have this as:

// This regex pattern checks for "something[digits]".
// It's created lazily to be compiled only once.
static RFC3164_TAG_REGEX: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"^\S+\[\d+]$").unwrap());

impl Tag {
    fn encode_rfc_3164(&self) -> String {
        if RFC3164_TAG_REGEX.is_match(&self.app_name) {
            // If it's already formatted
            format!("{}:", self.app_name)
        } else if let Some(proc_id) = self.proc_id.as_deref() {
            format!("{}[{}]:", self.app_name, proc_id)
        } else {
            format!("{}:", self.app_name)
        }
    }

    // ...
}

Compared to my implementation:

impl Tag {
    // Roughly equivalent - RFC 5424 fields can compose the start of
    // an RFC 3164 MSG part (TAG + CONTENT fields):
    // https://datatracker.ietf.org/doc/html/rfc5424#appendix-A.1
    fn encode_rfc_3164(&self) -> String {
        let Self { app_name, proc_id, msg_id } = self;

        match proc_id.as_deref().or(msg_id.as_deref()) {
            Some(context) => [&app_name, "[", &context, "]:"].concat(),
            None => [&app_name, ":"].concat()
        }
    }

    // ...
}

NOTE: I don't use the convenience of format!() since IIRC performance was worse, so concat on a fixed array of strings was better 😅

From the looks of it with your regex addition, you're adding support for when app_name has bundled the proc_id formatting but without the :, but that's more of an issue/workaround for how you may be providing the log data to the encoder on your end? app_name really shouldn't have [ (or :) in it's value to begin with.

As this doesn't actually seem like a valid requirement for the encoder to support (see reference links in next section for both RFCs on this encoding step), I would discourage it and defer to correction of the input before it is processed by the encoder.

Serde alias `tag` for `app_name` field (`Tag` struct)

added #[serde(alias = "tag")] to the app_name field, allowing users to use the more familiar tag option for RFC3164 configurations.

/// App Name. Can be a static string or a dynamic field path like "$.app".
/// `tag` is supported as an alias for `app_name` for RFC3164 compatibility.
#[serde(alias = "tag")]
app_name: Option<String>,

I don't see why the alias is needed?

In my PR I have relevant comment links:

RFC 3164 section 4.1.3 explains the MSG part, which is comprised of the TAG and CONTENT fields.

The value in the TAG field will be the name of the program or process that generated the message. The CONTENT contains the details of the message.
RFC 5424 - Appendix 1 explains that compared to RFC 3164:
In this document, MSG is what was called CONTENT in RFC 3164.
- The TAG is now part of the header, but not as a single field.
- The TAG has been split into APP-NAME, PROCID, and MSGID.

Why is it important to serialize the app_name from a tag field?

Is it not redundant with your desire for config values to support this custom reference of alternative keys from LogEvent data already? A feature that is already unclear for why it is required when Remap transform can be used with the Vector config?

Personally with Vector already having it's own standard approach to handle transforming the input, it would be helpful to understand what value these two alternatives are providing that aren't specifically to support your personal usage when some config changes could be made to keep the implementation simple on Vector's side?

`except_fields` config feature

support except_fields, this allows users to remove specified fields from the LogEvent before it is processed, which is useful for stripping internal or sensitive data before it's included in the final payload.

This feature might have a bit of extra convenience, but it sounds more like a generic feature than SysLog support specific, how much of a value add is that over the equivalent config in VRL? (a remap transform should be able to fairly easily do the equivalent via remove()?)

AFAIK, the encoder should just focus on it's task to encode input received, and the config is meant to support that but any need to transform data prior to fit the encoders expected input shape is a separate task that should be generic not encoder specific?

`DynamicOrStatic<T>` enum

a new DynamicOrStatic<T> enum was introduced to allow facility and severity to be configured in multiple ways:

as a static name (e.g., "user")

as a static number (e.g., 16)

ss a dynamic field path (e.g., "$.level")

custom serde deserializers were implemented to handle this complex, multi-format input, providing clear error messages for invalid values.

/// A configuration value that can be either a static value or a dynamic path.
#[configurable_component]
#[derive(Clone, Debug, PartialEq, Eq)]
pub enum DynamicOrStatic<T: 'static> {
    /// A static, fixed value.
    Static(T),
    /// A dynamic value read from a field in the event using `$.` path syntax.
    Dynamic(String),
}

Dynamic variant support aside (since concerns for that were covered above already), this has deviated quite a bit from what my PR had, was there a reason to do so?

I did have the Dynamic variant support working perfectly fine IIRC, if not in the current iteration of the PR, there would be some commits back (when config struct handled a string dynamic key lookup for these enums in the decant method).

This is what you have for the deserializing support:

// Generic helper.
fn deserialize_syslog_code<'de, D, T>(
    deserializer: D,
    type_name: &'static str,
    max_value: usize,
    from_repr_fn: fn(usize) -> Option<T>,
) -> Result<DynamicOrStatic<T>, D::Error>
where
    D: Deserializer<'de>,
    T: FromStr + VariantNames,
{
    let s = String::deserialize(deserializer)?;
    if s.starts_with("$.") {
        Ok(DynamicOrStatic::Dynamic(s))
    } else {
        parse_syslog_code(&s, from_repr_fn)
            .map(DynamicOrStatic::Static)
            .ok_or_else(|| {
                serde::de::Error::custom(format!(
                    "Invalid {type_name}: '{s}'. Expected a name, integer 0-{max_value}, or path."
                ))
            })
    }
}

fn parse_syslog_code<T>(s: &str, from_repr_fn: fn(usize) -> Option<T>) -> Option<T>
where
    T: FromStr,
{
    if let Ok(value_from_name) = s.parse::<T>() {
        return Some(value_from_name);
    }
    if let Ok(value_from_number) = s.parse::<u64>() {
        return from_repr_fn(value_from_number as usize);
    }
    None
}

fn deserialize_facility<'de, D>(deserializer: D) -> Result<DynamicOrStatic<Facility>, D::Error>
where
    D: Deserializer<'de>,
{
    deserialize_syslog_code(deserializer, "facility", 23, Facility::from_repr)
}

fn deserialize_severity<'de, D>(deserializer: D) -> Result<DynamicOrStatic<Severity>, D::Error>
where
    D: Deserializer<'de>,
{
    deserialize_syslog_code(deserializer, "severity", 7, Severity::from_repr)
}

And here is what I had with my existing PR, with some review feedback applied (switch from akin crate to plain macro_rules + some minor revision):

macro_rules! deserialize_impl {
    ($enum:ty) => {
	impl $enum {
	    fn deserialize<'de, D>(deserializer: D) -> Result<Self, D::Error>
        where
            D: Deserializer<'de>,
        {
            let value = NumberOrString::deserialize(deserializer)?;
            Self::try_from(value).map_err(D::Error::custom)
        }
    }

    impl TryFrom<NumberOrString> for $enum {
        type Error = StrumParseError;

        fn try_from(value: NumberOrString) -> Result<Self, <Self as TryFrom<NumberOrString>>::Error> {
            let variant: Option<Self> = match &value {
                NumberOrString::Number(num) => Self::from_repr(*num),
                NumberOrString::String(s) => Self::from_str(&s.to_ascii_lowercase()).ok(),
            };
  
            variant.with_context(|| InvalidVariantSnafu {
                input: value.to_string(),
                variants: Self::VARIANTS.join("`, `"),
            })
        }
    }
}

deserialize_impl!(Facility);
deserialize_impl!(Severity);

// An intermediary container to deserialize config value into.
// Ensures that a string number is properly deserialized to the `usize` variant.
#[derive(derive_more::Display, Deserialize)]
#[serde(untagged)]
enum NumberOrString {
    Number(
        #[serde(deserialize_with = "deserialize_number_from_string")]
        usize
    ),
    String(String)
}

#[derive(Debug, Snafu)]
enum StrumParseError {
    #[snafu(display("Unknown variant `{input}`, expected one of `{variants}`"))]
    InvalidVariant { input: String, variants: String },
}

NOTES:

I split out logic from my current PR's deserializer to separate out a TryFrom impl as shown above, as I think I intended to support that conversion somewhere where serde wasn't used.

I had also moved the deserializer annotation to the Pri struct, with the parent config struct deserializing with flattened fields:

#[configurable_component]
#[derive(Clone, Debug, Default)]
#[serde(default)]
pub struct SyslogSerializerOptions {
    /// RFC
    rfc: SyslogRFC,

    #[serde(flatten)]
    #[configurable(derived)]
    priority: Pri,

    #[serde(flatten)]
    #[configurable(derived)]
    tag: Tag,
}

/// Priority Value
#[derive(Clone, Default, Debug)]
#[configurable_component]
#[serde(default)]
struct Pri {
    /// Facility
    #[serde(deserialize_with = "Facility::deserialize")]
    facility: Facility,
    /// Severity
    #[serde(deserialize_with = "Severity::deserialize")]
    severity: Severity,
}

Usage of serde(untagged) attribute for NumberOrString could be replaced in favor of more performance AFAIK via some additional verbosity with the serde-untagged crate instead, or similar to your approach with parse_syslog_code().

This is also technically an inconsistency with your SyslogSerializerOptions since DynamicOrStatic<T> and Option<String> are both being used with that intent of dynamic/static string value.

So our approaches are mostly similar, except mine allows for keeping the type consistent as the enums in both config struct and the later derived decant/encoding struct. Doing so I think is better to grok/maintain?

Sep 20 '25 05:09 polarathene

Thank you for the thorough feedback @polarathene.

Some general comments from me to help move things forward:

I prefer starting with the simplest possible version, e.g. except_fields can be added later if we deem it necessary.
When giving review feedback prefer multiple comments (one per topic) and when possible attach them on the code itself. It's hard to track what was discussed and what was resolved.
I would not worry about micro-optimizations such as concat vs format at this stage.

Sep 24 '25 13:09 pront

When giving review feedback prefer multiple comments (one per topic) and when possible attach them on the code itself. It's hard to track what was discussed and what was resolved.

I normally do :)

However I was mostly focused on the PR description statements and comparing to my own equivalent snippets from my earlier PR. Along with the understanding that the current PR was going to see some notable refactoring... it was easier for me to structure it as I did for my own reference.

As such it was more of a discussion with the PR authors decisions, to get on the same page rather than review the existing PR. IIRC when reviewing inline on proposed changes, it doesn't always convey the scope of lines the comment is for (only highlights the last one and a few above it), so without a suggested change proposed, it was clearer for me to use as reference for comparison, especially after any rework was done, I'm not particularly interested in that extra leg work to dig up the old context intended 😅

I would not worry about micro-optimizations such as concat vs format at this stage.

I was more curious about the deviation from what I already had there.

Sep 25 '25 00:09 polarathene

Hi @polarathene and @pront,

First, apologies for the delayed response—I was tied up with other work last week.

Thank you for the feedback. I also reviewed the comments from the related PR, and I see the key design question is whether this encoder should be:

a simple, focused component that depends on upstream remap transforms, or
a more flexible encoder with built-in handling for common syslog patterns.

To help finalize the design, here are the two approaches we've explored:

Option 1: Simple, Focused Encoder (Aligns with Review Feedback)

Design: Minimal configuration, expects correctly shaped data as field references. Any data shaping (e.g., removing _internal fields, parsing facility/severity) is done upstream with remap. Pros: Consistent with the "remap-first" philosophy, easy to maintain. Cons: Less convenient for common syslog use cases—users would need VRL even for simple tasks.

Option 2: Flexible Encoder (Current PR)

Design: Includes DynamicOrStatic and except_fields. Handles facility/severity from names, numbers, or dynamic paths, and can remove redundant fields. Pros: Better out-of-the-box UX, more declarative and self-contained Cons: More internal complexity, adds features that could also be handled with remap.

Our internal experience has shown the flexibility of Option 2 to be useful, which is why we opened the PR in this form. That said, we respect the goal of keeping components simple and maintainable.

To avoid unnecessary work or future refactoring, could you let us know which direction you’d prefer for this initial implementation (Option 1, Option 2, or something in between)? We’re happy to adjust accordingly.

Thanks again for your guidance!

/cc @jcantrill

Sep 30 '25 10:09 vparfonov

Option 1: Simple, Focused Encoder (Aligns with Review Feedback)

@vparfonov my expectation of this PR based upon earlier feedback is that option 1 is more in line with the design of existing encoders and fastest path to adoption in the upstream. This option is my preference. We can rework our implementation to rely upon transforms once syslog merges. I would prefer simple

Sep 30 '25 14:09 jcantrill

Hello, can you git merge origin/master and resolve the conflicts? Thanks.

Oct 21 '25 16:10 pront

Hello, can you git merge origin/master and resolve the conflicts? Thanks.

done

Oct 22 '25 09:10 vparfonov

Hi @vparfonov, thanks for the PR! This is missing both a changelog and a test plan. Something very useful you could provide is an input and expected (raw) output for example (one output for RFC3164 and one for RFC5424).

Also another important thing, when running this with only a minimal set of features I get lots of compilation errors. You can run the command by yourself to see:

cargo clippy --workspace --no-default-features --features=sinks-papertrail

I also suggest running this to test (no errors occur here)

cargo clippy --workspace --no-default-features --features=sources-syslog

Oct 22 '25 20:10 thomasqueirozb

Thanks for checking @thomasqueirozb, i've fixed compilation errors, added test plan and changelog

Oct 23 '25 18:10 vparfonov

Hello @polarathene and @pront, thank you for the review! I have addressed all mentioned comments and incorporated the suggested changes.

applied #[serde(deny_unknown_fields)] to SyslogSerializerOptions to preventing silent configuration errors caused by obsolete or mistyped fields.
removed obsolete payload_key field. The encoder now exclusively uses the standard event .message field, simplifying the configuration interface as requested.
implemented semantic application name fallback. The app_name lookup now prioritizes the explicit configuration, then falls back to log.get_by_meaning("service") after goes with default value.
fixed RFC3164 TAG truncation logic, ensured the 32-character limit is maintained
added edge case tests
bumped derive_more to v2.0.1

Nov 19 '25 14:11 vparfonov

Hey @vparfonov and @pront , does this open up the door to a dedicated syslog sink?

Nov 20 '25 06:11 tot19

Hey @vparfonov and @pront , does this open up the door to a dedicated syslog sink?

Not yet, but after merging will be possible to use it to pair with socket sink, something like:

[sinks.example]
type = "socket"
inputs = ["example_parse_encoding"]
address = "logserver:514"
mode = "tcp"
keepalive.time_secs = 60

[sinks.example.encoding]
codec = "syslog"
rfc = "rfc5424"

Nov 20 '25 08:11 vparfonov

Sweet! Thank you

Nov 20 '25 08:11 tot19

Hi @vparfonov. I tried to fix the failing checks but it seems like I don't hash push permissions to this branch. I suggest git fetch && git merge origin/master and doing git checkout origin/master -- Cargo.lock && cargo check when you get a conflict to resolve it. Also, once master is merged you'll also need to run make generate-component-docs and cargo vdev build licenses.

@thomasqueirozb, thanks for pointing this out. I've attempted to run the generation commands, but I am hitting environment issues that I can't resolve quickly.

cargo vdev build licenses: Failed. I am getting:

> cargo vdev build licenses
Error: No such file or directory (os error 2)

Dec 23 '25 16:12 vparfonov

cargo vdev build licenses: Failed. I am getting:
> cargo vdev build licenses
Error: No such file or directory (os error 2)

This error message has been on my todo list to fix since forever. You're missing dd-rust-license-tool. You can install it by running cargo install dd-rust-license-tool --version 1.0.4 and then running cargo vdev build licenses again.

It also looks like changes to website/cue/reference/components/sinks/generated/greptimedb_logs.cue need to be reverted

Dec 23 '25 17:12 thomasqueirozb

This error message has been on my todo list to fix since forever. You're missing dd-rust-license-tool. You can install it by running cargo install dd-rust-license-tool --version 1.0.4 and then running cargo vdev build licenses again.

got it now works, thanks

It also looks like changes to website/cue/reference/components/sinks/generated/greptimedb_logs.cue need to be reverted

reverted, but it strange why it failed, only this changes was observed

- examples: [{}]
+ examples: [{},
+ ]

Dec 23 '25 18:12 vparfonov

reverted, but it strange why it failed, only this changes was observed

I have ran into that before with this same file. Not sure what is going on there - might be a difference between how formatting occurs inside the CI and make generate-component-docs works locally.

Dec 23 '25 21:12 thomasqueirozb

feat(codecs): Add syslog encoder

Summary

Key Features

Vector configuration

How did you test this PR?

Test Case 1: RFC 5424 - field references

Test Case 2: RFC 3164 - fields references

Test Case 3: Field Parsing

Test Case 4: Default Fallbacks

Change Type

Is this a breaking change?

Does this PR include user facing changes?

References

Notes

Regex for RFC 3164 tag encoding

Serde alias tag for app_name field (Tag struct)

except_fields config feature

DynamicOrStatic<T> enum

Option 1: Simple, Focused Encoder (Aligns with Review Feedback)

Option 2: Flexible Encoder (Current PR)

Serde alias `tag` for `app_name` field (`Tag` struct)

`except_fields` config feature

`DynamicOrStatic<T>` enum