k6 Refactor metric samples to use Atlas for tags and have time series

Integrated https://github.com/mstoykov/atlas as the internal data structure for SampleTags. Atlas is a Btree implementation based on [2]string array for the nodes (the stored tags pairs). It uses pointers most of the time for comparisons so it's very fast and read locks can be avoided.

It allows allocating a specific Tag pair just once. The reduction in terms of memory from my basic benchmarks is very important. For example, the benchmark on a similar function reduces the overall allocated memory from 600MB to 87MB.

Closes #1831

Jul 11 '22 12:07 codebien

Hmm I'd definitely want to see more benchmarks and tests before we merge something like this :thinking: Considering we'll have just a single copy of the tags per TimeSeries, I don't think we need to replace a slice with such a more complicated struct, the memory savings won't be as impressive as what you claim. It can probably reduce the allocations if you have gradual construction of the final tag set, at the cost of lock contention, but not as it's currently used in this PR (since you already have constructed the map you pass to NewSampleTags()).

Besides, I have just skimmed the atlas code, but I am not sure it works like how we'd need it to work. A tag set with keys {a: b, c: d} should be equivalent to a tag set with the same keys in a different order, {c: d, a: b}, which I don't think is the case now.

Jul 11 '22 13:07 na--

Hmm I'd definitely want to see more benchmarks and tests before we merge something like this

We are definittely not just merging this. I would even expect that we would rewrite most of the code to actually not have a global root, but have it in the metrics.Registry before we even think of merging.

I would also expect at least some of the code to actually use the root node directly (the http path definitely) instead of NewSampleTags as that will also be a lot more performant then creating a map and then making it into an atlas node.

Considering we'll have just a single copy of the tags per TimeSeries

How do we build and have only one copy and keep it only one copy? From the code that I have seen so far - we always build a new slice of tags and then we hash it and go get it from the registry. But we still made a new copy - we just threw it away in the mean time.

Besides, I have just skimmed the atlas code, but I am not sure it works like how we'd need it to work. A tag set with keys {a: b, c: d} should be equivalent to a tag set with the same keys in a different order, {c: d, a: b}, which I don't think is the case now.

It should, but you can test it and open an issue if it doesn't ;)

To be honest I would expect the reduction in memory to be negligable for most real uses case, but the reduction in GC time in those same cases should be significant once we are using this directly building from lib.State#Tags instead of going from the root node up for each set.

One of the points of this structure is that it "memorizes" how it was constructed and optimizes the same route for the next time. So as long as the code that generates it takes the same route it will be faster the following times. But building it from a map has more chances of it being randomized.

I still would like to see if even like that it makes a difference and in which direction if you just have a fairly basic test. I would expect thsi will be slightly ... worse, before we stop generating the whole map and instead use the atlas.Node for that as well

Jul 11 '22 15:07 mstoykov

@na-- @mstoykov @oleiade I pushed a new iteration with a basic Atlas integration. Please, take a look and let me know your opinions. I expect several points of discussion so I didn't complete the entire migration, in any case, the js/http package should work as expected.

Jul 26 '22 10:07 codebien

Updates:

Restored lib.State.Tags and renamed metrics.TagMap to metrics.TagSet
metrics.TagSet is now always lock free
lib.RunTags from SampleTags to map[string]string
Root set in PreInitState
Resolved RunTags set in TestState

Note: as you can see from the failing xk6 test, we are breaking the extensions removing the ability to create a SampleTags from a map.

Aug 03 '22 06:08 codebien

@codebien, @mstoykov and I had a call to discuss some of the important details in this PR and here's what we decided:

We won't have a metrics.TagSet with an internally mutating atlas.Node, instead we'll have an immutable TagSet that is redefining atlas.Node like this:


type TagSet atlas.Node

func (tg *TagSet) Add(name, value string) *TagSet {
	return (*TagSet)(((*atlas.Node)(tg)).AddLink(name, value))
}

func (tg *TagSet) Delete(name string) *TagSet {
	return (*TagSet)(tg.Delete(name))
}

// ... + wrappers for other atlas.Node methods that return a *Node or have unsuitable names
// ... + JSON methods from SampleTags

We'll completely get rid of metrics.SampleTags, the JSON marshaling/unmarshaling methods will be added to metrics.TagSet and the metrics.Sample struct will look somewhat like this (after non-indexable tags):
```
type Sample struct {
	Metric   *Metric
	Time     time.Time
	Tags     *TagSet
	Metadata map[string]string // could be nil
	Value    float64
}
```

Naming things in the metrics.Registry will be simpler because they are now immutable, e.g.:

type Registry struct {
	metrics map[string]*Metric
	l       sync.RWMutex

	rootTagSet *atlas.Node
}

func NewRegistry() *Registry {
	return &Registry{
		metrics: make(map[string]*Metric),
		// All the new TagSts must branch out from this root, otherwise
		// comparing them and using their Equals() method won't work correctly.
		rootTagSet: atlas.New(),
	}
}

// ...

func (r *Registry) RootTagSet() *TagSet {
	return (*TagSet)(r.rootTagSet)
}

If we have a lot of tags being created somewhere, e.g. when emitting the HTTP metrics, we can easily create accumulator-like helpers to help with that task. Whether we export them or make them global will be decided on a case-by-case basis. For example, there is probably some benefit of having a helper with the system tags in the above situation:
```
tags := NewSystemTagAccumulator(t.tags, t.state.Options.SystemTags)
// ...
tags.AddSystemTag(metrics.TagMethod, unfReq.request.Method)
// ...
tags.AddSystemTag(metrics.TagStatus, unfReq.response.StatusCode)
// ...
```
SortAndAddTags() is another such helper, it is still needed and can either be a TagSet method or a standalone function.

@mstoykov, @codebien, am I missing anything?

Aug 04 '22 13:08 na--

As an addendum to the previous note, we realized that we can make a tiny change to metrics.Sample and group the Metric and Tags fields into their own simple TimeSeries struct and use that in Sample, like this:

type TimeSeries struct {
    Metric *Metric
    Tags *TagSet
}

type Sample struct {
	TimeSeries
	Time     time.Time
	Metadata map[string]string // could be nil
	Value    float64
}

This doesn't change anything significant for Sample and the TimeSeries type is a plain naked struct without any methods. However, it will allow us to go a long way towards addressing https://github.com/grafana/k6/issues/2580. Because of the way comparison works in go, you'd be able to use == to compare the TimeSeries values of 2 metric Samples, or use the them as map indexes. From https://go.dev/ref/spec#Comparison_operators

Pointer values are comparable. Two pointer values are equal if they point to the same variable or if both have value nil. Pointers to distinct zero-size variables may or may not be equal. ... Struct values are comparable if all their fields are comparable. Two struct values are equal if their corresponding non-blank fields are equal.

We might still have a central registry for TimeSeries in the future (e.g. to be able to count exactly how many unique ones we have), but if we do this small change now, it's likely we won't need to make another breaking change to metrics.Sample in the future (:crossed_fingers: :sweat_smile:).

Aug 04 '22 14:08 na--

FYI we almost inadvertently introduced a VU activation bug with the migration to atlas. Changes to vu.tags (from k6/execution) shouldn't persist between scenarios, and they don't in master, we get our activation base tags from the main test's runTags: https://github.com/grafana/k6/blob/5b70e70b657750234651a337c1b736c177926855/js/runner.go#L613 but we changed it to get it from the VU.State: https://github.com/grafana/k6/blob/6fbde0ab9a35dd6520b83acbfd482477438e2147/js/runner.go#L616 and I perpetuated the problem in https://github.com/grafana/k6/pull/2638 with https://github.com/grafana/k6/blob/e1a641a1aa0c82d1d2f8db77611ccf4ba9bf301b/js/runner.go#L638-L639

Noticed this when I was doing the unindexable tags and I've fixed it and added a test and an explicit comment, see the latest force-pushes (since I chose to fix the older commit that perpetuated it): https://github.com/grafana/k6/blob/1c8b7e9f758fcaecf6cb60a2e6938fcfd07c34b8/js/runner.go#L618-L621

Aug 10 '22 09:08 na--

I added also my opinion regarding With/Without naming here https://github.com/grafana/k6/pull/2594#discussion_r947724265

Aug 18 '22 08:08 codebien

@olegbespalov I added your suggestions

Sep 09 '22 14:09 codebien

@codebien there is a merge conflict

Oct 03 '22 15:10 mstoykov

At this point, I consider squashing to merge so I just merged the current master branch here to avoid any possibility to break in some way this branch rebasing.

Oct 03 '22 16:10 codebien

As a reminder, otherwise, the xk6-websockets extension will be broken:

from now rebasing is denied for this PR
for merging this PR use the classic Merge feature for keep the commit history

Oct 04 '22 13:10 codebien

k6 k6 copied to clipboard

Refactor metric samples to use Atlas for tags and have time series

k6
k6 copied to clipboard