vector icon indicating copy to clipboard operation
vector copied to clipboard

Document `u8`s that represent ASCII characters in config fields as characters in the docs

Open jszwedko opened this issue 2 years ago • 4 comments

For example, https://github.com/vectordotdev/vector/pull/18320, introduced a few new fields that take u8s that are supposed to be characters. These end up being documented as unsigned integers like so:

Screenshot 2023-09-25 at 12 17 52 PM

Even though in configuration they would be specified as character literals (i.e. , rather than 44).

jszwedko avatar Sep 25 '23 19:09 jszwedko

Is there a good way of representing u8 as characters in the documentation? I can not think of anything else then changing the datatype in the config to strings an converting them to u8 later when creating the actual serializer. In any case i can take over this task.

scMarkus avatar Nov 30 '23 14:11 scMarkus

@jszwedko any thoughts on how to best solve this? I am by no means an rust or vector expert but I am happy to put in the work given some thought through guidance :)

scMarkus avatar Apr 29 '24 19:04 scMarkus

I discovered that the delimiter is actually already marked as an "ASCII character" in the code:

https://github.com/vectordotdev/vector/blob/056c2df60a7f776eca2fad6a4686deee35f9b340/lib/codecs/src/encoding/format/csv.rs#L77-L83

However this is not correctly translated in the cue definition:

https://github.com/vectordotdev/vector/blob/056c2df60a7f776eca2fad6a4686deee35f9b340/website/cue/reference/components/sinks/base/socket.cue#L153-L156

It should be using the ascii_char type mentioned here:

https://github.com/vectordotdev/vector/blob/056c2df60a7f776eca2fad6a4686deee35f9b340/website/cue/reference.cue#L500

So I think the issue might lie in the translation from the configurable definition to the cue files (and potentially, past that, to the website rendering of the config options). The translation happens in a couple of places:

  • https://github.com/vectordotdev/vector/tree/master/lib/vector-config-macros is the macro that generates JSON schema (via vector generate-schema)
  • https://github.com/vectordotdev/vector/blob/master/scripts/generate-component-docs.rb is what takes the JSON schema and writes out the cue files

This is a nuanced issue 😅 If you are interested in digging into it, those would be some places to start though. Let me know if you have additional questions!

jszwedko avatar Apr 30 '24 18:04 jszwedko

@jszwedko thanks a lot for those hints. what I found out so far (in the end I will ask for your intuition), is the generated json schema file, which is then fed into the ruby script does contain the undesired type already. so the faulty behavior happens before that.

regarding the generated json schema I have not yet identified the code doing the conversion but I found this list of valid types for the json schema (I could imagine based on that assumptions are made about u8 not being a string but a number): https://github.com/vectordotdev/vector/blob/c2765f45e243636b40d22a819d1496549eb40ed4/lib/vector-config-common/src/schema/json_schema.rs#L537-L550

looking through the quite well written documentation I found this peace of insight, which basically hints on the type conversion being just inherently complicated...: https://github.com/vectordotdev/vector/blob/c2765f45e243636b40d22a819d1496549eb40ed4/lib/vector-config-macros/src/ast/field.rs#L63-L95

going one step further finding this hint which explicetly states the #[serde(with = "...")] syntax. maybe line 278 contains some type casting magic? I have to admit not fully understating what is happening there: https://github.com/vectordotdev/vector/blob/c2765f45e243636b40d22a819d1496549eb40ed4/lib/vector-config-macros/src/ast/util.rs#L243-L281

I would be willing to look into this more deeply but first i like to check back if I am on the right track?

scMarkus avatar May 04 '24 19:05 scMarkus

@scMarkus 👋🏻 I'm the engineer who originally wrote all of the Configurable macro machinery and Jesse asked me to take a look at this issue.

Overall, while delegated types are sort of related to the problem, there's more of a fundamental underlying issue here: both Configurable and the documentation generation script don't actually handle ascii_char, or have a concept of it, really.

Essentially, there's no type in the standard library that represents an ASCII character. UTF-8 characters? Sure, char. Those may or may not be ASCII, though. We have two approaches we can take: simply annotate all these u8 fields being used as a way to handle an ASCII character, or create a newtype wrapper to enforce it.

The simpler approach is just annotating the u8 fields, because it solves half of the problem which is getting the right type into the documentation generation script. We can do this by adding the following annotation to the field: #[configurable(metadata(docs::type_override = "ascii_char"))].

This adds metadata that tells the documentation generation script to ignore determining the field type itself, and to instead just use the type we've given it. That gets us generating these fields as being ascii_char, to follow the logic in the Cue reference schema.

The second part we have to solve, likely, is actually having the documentation render that default value -- e.g. 44 -- as an ASCII character. My guess is that it may just end up rendering it as an integer, or maybe not at all, even after making the change to override the field type to ascii_char.

That's something you'll likely need to test, but we can also work through that as part of any PR you open.

tobz avatar May 14 '24 13:05 tobz

Hi @tobz thanks a lot for your input. you seam to be quite right. I naively added your idea.

    /// The field delimiter to use when writing CSV.
    #[configurable(metadata(docs::type_override = "ascii_char"))]
    #[serde(
        default = "default_delimiter",
        with = "vector_core::serde::ascii_char",
        skip_serializing_if = "vector_core::serde::is_default"
    )]
    pub delimiter: u8,

and looked at the intermediate json output. which changes from

            "delimiter": {
              "description": "The field delimiter to use when writing CSV.",
              "default": 44,
              "type": "integer",
              "maximum": 255.0,
              "minimum": 0.0,
              "_metadata": {
                "docs::numeric_type": "uint",
                "docs::human_name": "Delimiter"
              }
            },

to

            "delimiter": {
              "description": "The field delimiter to use when writing CSV.",
              "default": 44,
              "type": "integer",
              "maximum": 255.0,
              "minimum": 0.0,
              "_metadata": {
                "docs::numeric_type": "uint",
                "docs::type_override": "ascii_char",
                "docs::human_name": "Delimiter"
              }
            },

which makes the sinks cue file change but not website/cue/reference/remap/functions/parse_csv.cue file

diff --git a/website/cue/reference/components/sinks/base/aws_s3.cue b/website/cue/reference/components/sinks/base/aws_s3.cue
index b8cb9be0d..769c2d4d7 100644
--- a/website/cue/reference/components/sinks/base/aws_s3.cue
+++ b/website/cue/reference/components/sinks/base/aws_s3.cue
@@ -421,7 +421,7 @@ base: components: sinks: aws_s3: configuration: {
                                        delimiter: {
                                                description: "The field delimiter to use when writing CSV."
                                                required:    false
-                                               type: uint: default: 44
+                                               type: ascii_char: {}
                                        }
                                        double_quote: {
                                                description: """

I do agree to opening and working on a PR.

EDIT: rereading previous comment I noticed there being a hint of maybe not just use ascii chars but support utf8 as well in the documentation. in my opinion utf8 would in general be nice for vector but then probably should be string anyway? in case of the csv parsing the underlying lib asks for a u8 specifically.

scMarkus avatar May 14 '24 14:05 scMarkus

Ah, yeah. The documentation for remap functions is not generated automatically through all of the Configurable stuff, so it would have to be edited manually no matter what.

tobz avatar May 20 '24 14:05 tobz

@jszwedko is this still open? I am asking because #20498 was merged.

scMarkus avatar Aug 05 '24 07:08 scMarkus

@scMarkus thanks for noticing! I'll close this.

jszwedko avatar Aug 05 '24 14:08 jszwedko