dd-trace-rb icon indicating copy to clipboard operation
dd-trace-rb copied to clipboard

Quantize URLs in HTTP headers

Open dasch opened this issue 6 years ago • 11 comments

We'd like to track these headers, but need to strip out PII.

dasch avatar Sep 11 '18 12:09 dasch

@dasch While I look at this, since this might be urgent on your end, I do want to offer a possible workaround using our processing pipeline. You should be able to modify the tags with our quantization functions with the following in your Datadog initializer file:

tags_to_url_quantize = [
  'http.request.headers.referer',
  'http.response.headers.location'
]

Datadog::Pipeline.before_flush(
  Datadog::Pipeline::SpanProcessor.new do |span|
    if span.name == 'rack.request'
      tags_to_url_quantize.each do |tag|
        if value = span.get_tag(tag)
          span.set_tag(
            tag,
            Datadog::Quantization::HTTP.url(value, Datadog.configuration[:rack][:quantize])
          )
        end
      end
    end
  end
)

delner avatar Sep 11 '18 20:09 delner

Although it has a broken test (which wouldn't be hard to fix), I think this would work just fine.

If I may ask, if this applies to these headers, would it reasonably apply to other tags as well? If so, it might be reasonable to expect more requests for quantizing specific content.

I think this illuminates an interesting set of improvements we could make for quantization, by enabling greater control over what and how content is quantized. I might suggest we expand upon what you have here and add a more accessible, generic way of describing quantization in the configuration.

What would you think if you were offered a quantization configuration API like the following? Just trying to feel out what an expansion of quantization might look like, if it were to be a good fit for the problem at hand.

Datadog.configure do |c|
  c.use :rack do |rack|
    rack.quantize do
      resource do |value|
        # Return a quantized value
      end

      # Quantize the tag as a URL with default options
      tag 'http.request.headers.referer', as: :url

      # Quantize the tag as a URL with custom options
      tag 'http.request.headers.referer', as: :url, options: { query: { exclude: :all } }

      # Quantize the tag using a custom block
      tag 'http.request.headers.referer' do |value|
        Datadog::Quantization::HTTP.url(value, query: { exclude: :all })
      end
    end
  end
end

We also could consider something of a more Rack-specific interface as well:

Datadog.configure do |c|
  c.use :rack do |rack|
    rack.quantize do
      # Rack-specific quantization option for 'Referer' request header
      request_header 'Referer', as: :url, options: { query: { exclude: :all } }
    end
  end
end

delner avatar Sep 11 '18 20:09 delner

I think a better solution would be to have a notion of data types for tags, with URLs being one type. So you'd do e.g. trace.set_tag("foo.url", "http://hdhfdsh.com", type: :url). Would be nice for dates, time durations, etc. as well.

I'd rather not keep track of which exact tags are used in a centralized place, but would prefer to compose my configuration of domain specific integrations.

dasch avatar Sep 12 '18 07:09 dasch

@delner would your suggested pipeline-based approach not incur a significant overhead?

dasch avatar Sep 12 '18 07:09 dasch

@dasch It shouldn't be a significant amount using the pipeline workaround. We can certainly make it perform better still by applying the change within Rack, as we're looking to do in this PR.

I do really like the idea of having types on tags or resources. There are a number of scenarios in which having types like this would be valuable when applying more sophisticated quantization, or displaying content in the UI. Some challenges this might face could include compatibility issues given how it changes a very stable API, particularly with community standards like OpenTracing which don't really have a concept of type on tags either.

Ultimately, I think it'd be super cool to implement something like this, and I've suggested this to our team for further consideration. In the pragmatic scheme of things, I think it could be a while before we're able to add this type dimension to tags.

On a different note, would it make sense to quantize all URLs the same way? Or in practice, would some scenarios require some URLs be quantized differently than others?

Could you also expand on what you mean by composing a configuration of domain specific integrations?

delner avatar Sep 12 '18 15:09 delner

On a different note, would it make sense to quantize all URLs the same way? Or in practice, would some scenarios require some URLs be quantized differently than others?

Referer, for instance, may contain private information that would be covered by SOC2, so that would perhaps need extra pruning.

Could you also expand on what you mean by composing a configuration of domain specific integrations?

Just that putting the logic into the integration (Rack) makes more sense than trying to centrally manage things.

dasch avatar Sep 13 '18 09:09 dasch

If URLs then (say between the resource and Referer) require different quantization settings, would it require different configuration settings to match?

I think introducing quantization to this tag is great, but in the same vein, I suppose I'm anticipating that additional tags or headers might appear in new Issues/PRs that require the same kind of URL quantization, but with slightly different behavior. If possible, a more general fit could be really useful for users who'd like to quantize headers would be empowered to do so.

delner avatar Sep 13 '18 15:09 delner

I mean... the HTTP spec only defines so many headers that contain URLs.

dasch avatar Sep 14 '18 08:09 dasch

Yes, that's true. I'm also thinking of the case if users were to send their own custom headers containing URLs or other sensitive data.

delner avatar Sep 14 '18 15:09 delner

I think it's fine to support tooling for that, but out of the box, the default headers should "just work". In general, I'm more interested in the default config being sound than us being able to do fancy things – at our scale, we want to avoid fragmentation.

dasch avatar Sep 14 '18 15:09 dasch

For sure, I can understand that. I think we can have the headers you've suggested receive quantization out of the box as a default so it doesn't require as much configuration as it might otherwise.

I'm going to check in with the team and see how we can incorporate this in.

delner avatar Sep 14 '18 15:09 delner