feat(codecs): Add syslog encoder
Summary
This pull request introduces a new syslog encoder. This work is a continuation of the feature originally started in PR #21307.
The encoder is designed to be lean and performant, expecting users to perform complex data shaping with an upstream remap transform. It correctly handles both RFC 5424 and RFC 3164 standards, including specific field length limitations, character sanitization, and security escaping.
Key Features
Simple Configuration: The configuration uses standard Option<ConfigTargetPath> for all fields.
Flexible Parsing: facility and severity values read from the event are parsed intelligently, accepting either a string name (e.g., "user") or a number (e.g., 16), with case-insensitive matching for names.
Strict RFC Compliance:
Added logic to truncate app_name, proc_id, and msg_id to their specified maximum lengths for RFC 5424.
Implemented robust truncation for the RFC 3164 TAG field to ensure it never exceeds 32 characters.
Added a sanitization step for RFC 3164 messages to remove non-printable ASCII characters.
Implemented correct character escaping (\, ", ]) for structured data parameter values to prevent log injection.
Unit tests: including parsing, truncation, sanitization, and escaping.
Vector configuration
[sinks.example]
type = "socket"
inputs = ["example_parse_encoding"]
address = "logserver:514"
mode = "xyz"
[sinks.example.encoding]
codec = "syslog"
rfc = "rfc5424"
app_name = ".app"
proc_id = ".pid"
msg_id = ".mid"
facility = ".fac"
severity = ".sev"
payload_key = ".message"
How did you test this PR?
This plan covers the basic functionality of the syslog encoder for both RFC5424 and RFC3164, focusing on dynamic field resolution from a JSON source.
Note: All tests assume the stdin source is configured with decoding.codec = "json" to parse the input. Expected timestamps and hosts are illustrative.
data_dir = "./data"
[sources.input]
type = "stdin"
[sources.input.decoding]
codec = "json"
[sinks.console]
type = "console"
inputs = [ "input" ]
target = "stdout"
[sinks.console.buffer]
type = "disk"
max_size = 268_435_488
when_full = "block"
Test Case 1: RFC 5424 - field references
Verify that all configured fields are correctly read from a JSON event. Config:
[sinks.console.encoding]
codec = "syslog"
[sinks.console.encoding.syslog]
rfc = "rfc5424"
app_name = ".app"
proc_id = ".pid"
msg_id = ".mid"
facility = ".fac"
severity = ".sev"
Input:
{"host": "my-host", "@timestamp": "2025-10-23T19:00:00.123456Z", "message": "hello world", "app": "my_app", "pid": "987", "mid": "REQ-1", "fac": "daemon", "sev": 3}
Expected Output:
<27>1 2025-10-23T17:37:08.711556Z my-host my_app 987 REQ-1 - hello world
Test Case 2: RFC 3164 - fields references
Verify that all configured fields are correctly read for the legacy format. Config:
[sinks.console.encoding]
codec = "syslog"
[sinks.console.encoding.syslog]
rfc = "rfc3164"
app_name = ".app"
proc_id = ".pid"
facility = ".fac"
severity = ".sev"
Input:
{"host": "my-host", "@timestamp": "2025-10-23T19:00:00Z", "message": "hello legacy", "app": "legacy_app", "pid": "456", "fac": "user", "sev": 5}
Expected Output:
<13>Oct 23 19:00:00 my-host legacy_app[456]: hello legacy
Test Case 3: Field Parsing
Verify facility and severity are parsed from names (case-insensitive) and numbers. Config:
[sinks.console.encoding.syslog]
rfc = "rfc5424"
facility = ".fac"
severity = ".sev"
Input 1 (Name): {"fac": "local1", "sev": "warning"}
Output 1: <140>1 ...
Input 2 (Number): {"fac": 17, "sev": 4}
Output 2: <140>1 ... (same PRI)
Input 3 (Uppercase): {"fac": "LOCAL1", "sev": "WARNING"}
Output 3: <140>1 ... (same PRI)
Input 4 (Mix): {"fac": "LOCAL1", "sev": "WARNING"}
Output 4: <140>1 ... (same PRI)
Test Case 4: Default Fallbacks
Verify the encoder uses defaults. Config:
[sinks.console.encoding]
codec = "syslog"
Input: {"host": "my-host", "@timestamp": "2025-10-23T19:00:00Z", "message": "hello default"}
Expected Output: <14>1 2025-10-23T19:00:00.000000Z my-host vector - - - hello default
Change Type
- [ ] Bug fix
- [x] New feature
- [ ] Non-functional (chore, refactoring, docs)
- [ ] Performance
Is this a breaking change?
- [ ] Yes
- [x] No
Does this PR include user facing changes?
- [ ] Yes. Please add a changelog fragment based on our guidelines.
- [ ] No. A maintainer will apply the
no-changeloglabel to this PR.
References
Notes
- Please read our Vector contributor resources.
- Do not hesitate to use
@vectordotdev/vectorto reach out to us regarding this PR. - Some CI checks run only after we manually approve them.
- We recommend adding a
pre-pushhook, please see this template. - Alternatively, we recommend running the following locally before pushing to the remote branch:
make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make test
- We recommend adding a
- After a review is requested, please avoid force pushes to help us review incrementally.
- Feel free to push as many commits as you want. They will be squashed into one before merging.
- For example, you can run
git merge origin masterandgit push.
- If this PR introduces changes Vector dependencies (modifies
Cargo.lock), please runmake build-licensesto regenerate the license inventory and commit the changes (if any). More details here.
TL;DR: It looks like this PR is trying to mix in an additional feature for the encoder config to support referencing LogEvent data with different keys, when VRL remap transform should be capable of abstracting that out? (if it's something worthwhile to contribute perhaps extract that out to a separate PR/discussion after landing the encoder support?)
My feedback below is regarding changes related to the TLDR concern raised above, when comparing to my stale PR and it's review feedback.
EDIT: The concern appears to have been raised internally, so further review should wait until the refactor is pushed. I can also update my PR if that'd help with changes I note below.
Regex for RFC 3164 tag encoding
- RFC3164 Tag Handling: The logic for generating the RFC3164 TAG was improved with a regex to prevent incorrectly duplicating the proc_id if the app_name field was already formatted.
Could you please explain when the app_name would also carry the proc_id, such that this regex check was necessary?
I assume this is specifically for the use-case where you want the static encoder config to additionally support referencing a different LogEvent field? Or something else specific to how you're providing the data?
You have this as:
// This regex pattern checks for "something[digits]". // It's created lazily to be compiled only once. static RFC3164_TAG_REGEX: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"^\S+\[\d+]$").unwrap()); impl Tag { fn encode_rfc_3164(&self) -> String { if RFC3164_TAG_REGEX.is_match(&self.app_name) { // If it's already formatted format!("{}:", self.app_name) } else if let Some(proc_id) = self.proc_id.as_deref() { format!("{}[{}]:", self.app_name, proc_id) } else { format!("{}:", self.app_name) } } // ... }
Compared to my implementation:
impl Tag { // Roughly equivalent - RFC 5424 fields can compose the start of // an RFC 3164 MSG part (TAG + CONTENT fields): // https://datatracker.ietf.org/doc/html/rfc5424#appendix-A.1 fn encode_rfc_3164(&self) -> String { let Self { app_name, proc_id, msg_id } = self; match proc_id.as_deref().or(msg_id.as_deref()) { Some(context) => [&app_name, "[", &context, "]:"].concat(), None => [&app_name, ":"].concat() } } // ... }
NOTE: I don't use the convenience of format!() since IIRC performance was worse, so concat on a fixed array of strings was better 😅
From the looks of it with your regex addition, you're adding support for when app_name has bundled the proc_id formatting but without the :, but that's more of an issue/workaround for how you may be providing the log data to the encoder on your end? app_name really shouldn't have [ (or :) in it's value to begin with.
As this doesn't actually seem like a valid requirement for the encoder to support (see reference links in next section for both RFCs on this encoding step), I would discourage it and defer to correction of the input before it is processed by the encoder.
Serde alias tag for app_name field (Tag struct)
- added
#[serde(alias = "tag")]to theapp_namefield, allowing users to use the more familiar tag option for RFC3164 configurations.
/// App Name. Can be a static string or a dynamic field path like "$.app". /// `tag` is supported as an alias for `app_name` for RFC3164 compatibility. #[serde(alias = "tag")] app_name: Option<String>,
I don't see why the alias is needed?
In my PR I have relevant comment links:
- RFC 3164 section 4.1.3 explains the
MSGpart, which is comprised of theTAGandCONTENTfields.The value in the
TAGfield will be the name of the program or process that generated the message. TheCONTENTcontains the details of the message. - RFC 5424 - Appendix 1 explains that compared to RFC 3164:
In this document,
MSGis what was calledCONTENTin RFC 3164.- The
TAGis now part of the header, but not as a single field. - The
TAGhas been split intoAPP-NAME,PROCID, andMSGID.
- The
Why is it important to serialize the app_name from a tag field?
Is it not redundant with your desire for config values to support this custom reference of alternative keys from LogEvent data already? A feature that is already unclear for why it is required when Remap transform can be used with the Vector config?
Personally with Vector already having it's own standard approach to handle transforming the input, it would be helpful to understand what value these two alternatives are providing that aren't specifically to support your personal usage when some config changes could be made to keep the implementation simple on Vector's side?
except_fields config feature
- support
except_fields, this allows users to remove specified fields from the LogEvent before it is processed, which is useful for stripping internal or sensitive data before it's included in the final payload.
This feature might have a bit of extra convenience, but it sounds more like a generic feature than SysLog support specific, how much of a value add is that over the equivalent config in VRL? (a remap transform should be able to fairly easily do the equivalent via remove()?)
AFAIK, the encoder should just focus on it's task to encode input received, and the config is meant to support that but any need to transform data prior to fit the encoders expected input shape is a separate task that should be generic not encoder specific?
DynamicOrStatic<T> enum
a new
DynamicOrStatic<T>enum was introduced to allowfacilityandseverityto be configured in multiple ways:
- as a static name (e.g., "user")
- as a static number (e.g., 16)
- ss a dynamic field path (e.g., "$.level")
custom
serde deserializerswere implemented to handle this complex, multi-format input, providing clear error messages for invalid values.
/// A configuration value that can be either a static value or a dynamic path. #[configurable_component] #[derive(Clone, Debug, PartialEq, Eq)] pub enum DynamicOrStatic<T: 'static> { /// A static, fixed value. Static(T), /// A dynamic value read from a field in the event using `$.` path syntax. Dynamic(String), }
Dynamic variant support aside (since concerns for that were covered above already), this has deviated quite a bit from what my PR had, was there a reason to do so?
I did have the Dynamic variant support working perfectly fine IIRC, if not in the current iteration of the PR, there would be some commits back (when config struct handled a string dynamic key lookup for these enums in the decant method).
This is what you have for the deserializing support:
// Generic helper. fn deserialize_syslog_code<'de, D, T>( deserializer: D, type_name: &'static str, max_value: usize, from_repr_fn: fn(usize) -> Option<T>, ) -> Result<DynamicOrStatic<T>, D::Error> where D: Deserializer<'de>, T: FromStr + VariantNames, { let s = String::deserialize(deserializer)?; if s.starts_with("$.") { Ok(DynamicOrStatic::Dynamic(s)) } else { parse_syslog_code(&s, from_repr_fn) .map(DynamicOrStatic::Static) .ok_or_else(|| { serde::de::Error::custom(format!( "Invalid {type_name}: '{s}'. Expected a name, integer 0-{max_value}, or path." )) }) } } fn parse_syslog_code<T>(s: &str, from_repr_fn: fn(usize) -> Option<T>) -> Option<T> where T: FromStr, { if let Ok(value_from_name) = s.parse::<T>() { return Some(value_from_name); } if let Ok(value_from_number) = s.parse::<u64>() { return from_repr_fn(value_from_number as usize); } None } fn deserialize_facility<'de, D>(deserializer: D) -> Result<DynamicOrStatic<Facility>, D::Error> where D: Deserializer<'de>, { deserialize_syslog_code(deserializer, "facility", 23, Facility::from_repr) } fn deserialize_severity<'de, D>(deserializer: D) -> Result<DynamicOrStatic<Severity>, D::Error> where D: Deserializer<'de>, { deserialize_syslog_code(deserializer, "severity", 7, Severity::from_repr) }
And here is what I had with my existing PR, with some review feedback applied (switch from akin crate to plain macro_rules + some minor revision):
macro_rules! deserialize_impl {
($enum:ty) => {
impl $enum {
fn deserialize<'de, D>(deserializer: D) -> Result<Self, D::Error>
where
D: Deserializer<'de>,
{
let value = NumberOrString::deserialize(deserializer)?;
Self::try_from(value).map_err(D::Error::custom)
}
}
impl TryFrom<NumberOrString> for $enum {
type Error = StrumParseError;
fn try_from(value: NumberOrString) -> Result<Self, <Self as TryFrom<NumberOrString>>::Error> {
let variant: Option<Self> = match &value {
NumberOrString::Number(num) => Self::from_repr(*num),
NumberOrString::String(s) => Self::from_str(&s.to_ascii_lowercase()).ok(),
};
variant.with_context(|| InvalidVariantSnafu {
input: value.to_string(),
variants: Self::VARIANTS.join("`, `"),
})
}
}
}
deserialize_impl!(Facility);
deserialize_impl!(Severity);
// An intermediary container to deserialize config value into.
// Ensures that a string number is properly deserialized to the `usize` variant.
#[derive(derive_more::Display, Deserialize)]
#[serde(untagged)]
enum NumberOrString {
Number(
#[serde(deserialize_with = "deserialize_number_from_string")]
usize
),
String(String)
}
#[derive(Debug, Snafu)]
enum StrumParseError {
#[snafu(display("Unknown variant `{input}`, expected one of `{variants}`"))]
InvalidVariant { input: String, variants: String },
}
NOTES:
- I split out logic from my current PR's deserializer to separate out a
TryFromimpl as shown above, as I think I intended to support that conversion somewhere where serde wasn't used. - I had also moved the deserializer annotation to the
Pristruct, with the parent config struct deserializing with flattened fields:#[configurable_component] #[derive(Clone, Debug, Default)] #[serde(default)] pub struct SyslogSerializerOptions { /// RFC rfc: SyslogRFC, #[serde(flatten)] #[configurable(derived)] priority: Pri, #[serde(flatten)] #[configurable(derived)] tag: Tag, } /// Priority Value #[derive(Clone, Default, Debug)] #[configurable_component] #[serde(default)] struct Pri { /// Facility #[serde(deserialize_with = "Facility::deserialize")] facility: Facility, /// Severity #[serde(deserialize_with = "Severity::deserialize")] severity: Severity, } - Usage of
serde(untagged)attribute forNumberOrStringcould be replaced in favor of more performance AFAIK via some additional verbosity with theserde-untaggedcrate instead, or similar to your approach withparse_syslog_code().
This is also technically an inconsistency with your SyslogSerializerOptions since DynamicOrStatic<T> and Option<String> are both being used with that intent of dynamic/static string value.
So our approaches are mostly similar, except mine allows for keeping the type consistent as the enums in both config struct and the later derived decant/encoding struct. Doing so I think is better to grok/maintain?
Thank you for the thorough feedback @polarathene.
Some general comments from me to help move things forward:
- I prefer starting with the simplest possible version, e.g.
except_fieldscan be added later if we deem it necessary. - When giving review feedback prefer multiple comments (one per topic) and when possible attach them on the code itself. It's hard to track what was discussed and what was resolved.
- I would not worry about micro-optimizations such as
concatvsformatat this stage.
- When giving review feedback prefer multiple comments (one per topic) and when possible attach them on the code itself. It's hard to track what was discussed and what was resolved.
I normally do :)
However I was mostly focused on the PR description statements and comparing to my own equivalent snippets from my earlier PR. Along with the understanding that the current PR was going to see some notable refactoring... it was easier for me to structure it as I did for my own reference.
As such it was more of a discussion with the PR authors decisions, to get on the same page rather than review the existing PR. IIRC when reviewing inline on proposed changes, it doesn't always convey the scope of lines the comment is for (only highlights the last one and a few above it), so without a suggested change proposed, it was clearer for me to use as reference for comparison, especially after any rework was done, I'm not particularly interested in that extra leg work to dig up the old context intended 😅
- I would not worry about micro-optimizations such as
concatvsformatat this stage.
I was more curious about the deviation from what I already had there.
Hi @polarathene and @pront,
First, apologies for the delayed response—I was tied up with other work last week.
Thank you for the feedback. I also reviewed the comments from the related PR, and I see the key design question is whether this encoder should be:
- a simple, focused component that depends on upstream remap transforms, or
- a more flexible encoder with built-in handling for common syslog patterns.
To help finalize the design, here are the two approaches we've explored:
Option 1: Simple, Focused Encoder (Aligns with Review Feedback)
Design: Minimal configuration, expects correctly shaped data as field references. Any data shaping (e.g., removing _internal fields, parsing facility/severity) is done upstream with remap. Pros: Consistent with the "remap-first" philosophy, easy to maintain. Cons: Less convenient for common syslog use cases—users would need VRL even for simple tasks.
Option 2: Flexible Encoder (Current PR)
Design: Includes DynamicOrStatic and except_fields. Handles facility/severity from names, numbers, or dynamic paths, and can remove redundant fields.
Pros: Better out-of-the-box UX, more declarative and self-contained
Cons: More internal complexity, adds features that could also be handled with remap.
Our internal experience has shown the flexibility of Option 2 to be useful, which is why we opened the PR in this form. That said, we respect the goal of keeping components simple and maintainable.
To avoid unnecessary work or future refactoring, could you let us know which direction you’d prefer for this initial implementation (Option 1, Option 2, or something in between)? We’re happy to adjust accordingly.
Thanks again for your guidance!
/cc @jcantrill
Option 1: Simple, Focused Encoder (Aligns with Review Feedback)
@vparfonov my expectation of this PR based upon earlier feedback is that option 1 is more in line with the design of existing encoders and fastest path to adoption in the upstream. This option is my preference. We can rework our implementation to rely upon transforms once syslog merges. I would prefer simple
Hello, can you git merge origin/master and resolve the conflicts? Thanks.
Hello, can you
git merge origin/masterand resolve the conflicts? Thanks.
done
Hi @vparfonov, thanks for the PR! This is missing both a changelog and a test plan. Something very useful you could provide is an input and expected (raw) output for example (one output for RFC3164 and one for RFC5424).
Also another important thing, when running this with only a minimal set of features I get lots of compilation errors. You can run the command by yourself to see:
cargo clippy --workspace --no-default-features --features=sinks-papertrail
I also suggest running this to test (no errors occur here)
cargo clippy --workspace --no-default-features --features=sources-syslog
Thanks for checking @thomasqueirozb, i've fixed compilation errors, added test plan and changelog
Hello @polarathene and @pront, thank you for the review! I have addressed all mentioned comments and incorporated the suggested changes.
- applied
#[serde(deny_unknown_fields)]toSyslogSerializerOptionsto preventing silent configuration errors caused by obsolete or mistyped fields. - removed obsolete
payload_keyfield. The encoder now exclusively uses the standard event.messagefield, simplifying the configuration interface as requested. - implemented semantic application name fallback. The
app_namelookup now prioritizes the explicit configuration, then falls back tolog.get_by_meaning("service")after goes with default value. - fixed
RFC3164TAGtruncation logic, ensured the 32-character limit is maintained - added edge case tests
- bumped
derive_moretov2.0.1
Hey @vparfonov and @pront , does this open up the door to a dedicated syslog sink?
Hey @vparfonov and @pront , does this open up the door to a dedicated syslog sink?
Not yet, but after merging will be possible to use it to pair with socket sink, something like:
[sinks.example]
type = "socket"
inputs = ["example_parse_encoding"]
address = "logserver:514"
mode = "tcp"
keepalive.time_secs = 60
[sinks.example.encoding]
codec = "syslog"
rfc = "rfc5424"
Sweet! Thank you
Hi @vparfonov. I tried to fix the failing checks but it seems like I don't hash push permissions to this branch. I suggest
git fetch && git merge origin/masterand doinggit checkout origin/master -- Cargo.lock && cargo checkwhen you get a conflict to resolve it. Also, once master is merged you'll also need to runmake generate-component-docsandcargo vdev build licenses.
@thomasqueirozb, thanks for pointing this out. I've attempted to run the generation commands, but I am hitting environment issues that I can't resolve quickly.
cargo vdev build licenses: Failed. I am getting:
> cargo vdev build licenses
Error: No such file or directory (os error 2)
cargo vdev build licenses: Failed. I am getting:> cargo vdev build licenses Error: No such file or directory (os error 2)
This error message has been on my todo list to fix since forever. You're missing dd-rust-license-tool. You can install it by running cargo install dd-rust-license-tool --version 1.0.4 and then running cargo vdev build licenses again.
It also looks like changes to website/cue/reference/components/sinks/generated/greptimedb_logs.cue need to be reverted
This error message has been on my todo list to fix since forever. You're missing
dd-rust-license-tool. You can install it by runningcargo install dd-rust-license-tool --version 1.0.4and then runningcargo vdev build licensesagain.
got it now works, thanks
It also looks like changes to website/cue/reference/components/sinks/generated/greptimedb_logs.cue need to be reverted
reverted, but it strange why it failed, only this changes was observed
- examples: [{}]
+ examples: [{},
+ ]
reverted, but it strange why it failed, only this changes was observed
I have ran into that before with this same file. Not sure what is going on there - might be a difference between how formatting occurs inside the CI and make generate-component-docs works locally.