hawktracer icon indicating copy to clipboard operation
hawktracer copied to clipboard

Friendly complementary project

Open windelbouwman opened this issue 4 years ago • 26 comments

Hi,

Thanks for the talk at fosdem! Interesting stuff. This might be interesting as a visualization tool for incoming data:

https://github.com/windelbouwman/lognplot

  • Windel

windelbouwman avatar Feb 02 '20 13:02 windelbouwman

@windelbouwman sorry for late reply. The project looks really cool, and the performance seems very promising. I see that your DataBase object (TsDB) supports time-integer pair of values; Do you think it'd be possible to extend that so the DB can handle unstructured data? HawkTracer event is user-defined data structure, so I'm just curious if your library would be able to handle it.

loganek avatar May 14 '20 19:05 loganek

No problemo! The TSDB is all in memory, so it might even be more a simple data structure, and it supports timestamp / f64 pairs as well as timestamp / string pairs. I guess you could always convert the data to a string as a catch all case?

Strings are rendered as text, f64 values as a chart.

windelbouwman avatar May 14 '20 20:05 windelbouwman

Thanks for the response. HawkTracer events are more like JSON structures, and we'd like to be able to render some of the fields on the graph. E.g. we can have event like:

{
   timestamp: 1234,
   seq_number: 4,
   cpu_usage: 37.5,
   mem_usage: 45
}

and we'd like to use timestamp as the timestamp in the graph, and we'd like to be able to choose which field will be visualized on the graph (cpu_usage, mem_usage, or both). If I serialize the whole event as a string, I'm afraid I won't be able to draw the values on the diagram easily. I guess I could have multiple TSDB instances and push data from each field of the event to separate instance, but I was wondering if the database could also handle the whole event rather than f64 values? Thanks

loganek avatar May 15 '20 08:05 loganek

Okay, now I understand what you mean by structured format. No, lognplot cannot handle this directly, the simplest solution to this would be to write a python adapter script which takes in this format and forwards it to the lognplot GUI. Take a look for example at this MQTT adapter script: https://github.com/windelbouwman/lognplot/blob/master/adapters/mqtt_adapter.py

For now, I deliberately kept the data ingestion API super simple, and made some external python adatper scripts, since python lends itself very well for this job.

In the case of hawktracer, one would process each structured event, take the timestamp, and process each field and forward it to the lognplot GUI tool.

So, no, the database does not support these kind of events as a whole. What would be the benefit of this vs splitting out the event in seperate traces?

windelbouwman avatar May 15 '20 12:05 windelbouwman

@windelbouwman thanks for explanation. I don't think there's any particular benefit of supporting unstructured data in the TSDB other than simplicity, but I see that having adapter for that would completely solve the problem. Would you be interested in building some kind of proof of concept integration of lognplot with HawkTracer? If that works well, we could then start thinking of having a simple UI for HawkTracer that uses lognplot widget for visualization and TSDB for storing event data.

loganek avatar May 17 '20 11:05 loganek

That would be an interesting idea. Currently my incentive is low, since I do not use hawktracer yet. My approach would be to take the hello world example, run the demo, and then collect the data as a json stream.

My question here would be: is there a mechanism to convert the binary stream to json while the stream is live? Of is this only possible as a post processing batch operation?

The idea would be this chain:

hello-world-c.exe --> TCP/IP (binary trace data) --> conversion from binary into JSON --> TCP/IP (json data) --> lognplot adapter script.py --> TCP/IP (lognplot binary protocol) --> lognplot GUI

Is there a script which converts the hawktracer binary data into a JSON stream? Or is there a python library to decode the incoming binary hawktracer data?

windelbouwman avatar May 17 '20 14:05 windelbouwman

Even though you don't use HawkTracer yet, I think it'd be valuable for the lognplot project to do the integration as it might bring quite a few new users to the project (at the moment HawkTracer does not have 1st class UI integraiton at all, so I'm sure many users, including myself, would like to try it out).

There's a python example for reading HawkTracer stream and convert it to python dictionary and then plots the data: https://github.com/amzn/hawktracer/blob/master/bindings/python3/plot-events.py I think it could easily be converted into an application that converts data to JSON and streams it over TCP/IP. Please let me know if you need any help with that, I'm happy to assist.

Many thanks for doing this, I think this integration will be beneficial for users of both projects!

loganek avatar May 17 '20 21:05 loganek

Allright, this example looks straight forward, and I know how to hook it up to lognplot GUI. This seems rather easy to take the applicable parts of the plot-events.py and the mqtt-adapter.py scripts.

windelbouwman avatar May 18 '20 06:05 windelbouwman

Okay, a first prototype is here:

https://github.com/amzn/hawktracer/pull/79

Things to note:

  • I used the hello-world-c example, and added a tcp listener like this ht_tcp_listener_register( ht_global_timeline_get(), 9876, 2048, &error_code);.

Not all events are handled, so there might be more interesting events.

hawktracer_hello_world

windelbouwman avatar May 18 '20 18:05 windelbouwman

That looks great, thanks for that! I'll have a look closer into the integration soon and update the issue then :)

loganek avatar May 22 '20 09:05 loganek

@windelbouwman I've looked closer into the project; am I right that python and rust are 2 completely separate implementations? Somehow I assumed that the python lognplot is just a wrapper for Rust, but from what I see in the source code, it's not. What's the reason of having 2 separate implementations rather than having a binding layer in Python? Did you measure the performance of both implementations? If so, how much slower/faster python is?

Many thanks

loganek avatar May 24 '20 12:05 loganek

Hi, yes, it's two fully redundant implementations. The python version lends itself for experiments, but is many times slower. I did not benchmark this, but my advice is to use the rust variant, since it has more performance and contains less bugs.

windelbouwman avatar May 24 '20 12:05 windelbouwman

Thanks I've been trying to compile rust project, but I'm getting an error:

error[E0277]: the trait bound `ndarray::ArrayBase<ndarray::OwnedRepr<f64>, ndarray::dimension::dim::Dim<[usize; 2]>>: std::convert::From<&ndarray::ArrayBase<ndarray::OwnedRepr<f64>, ndarray::dimension::dim::Dim<[usize; 2]>>>` is not satisfied
  --> lognplotgtk/src/io.rs:95:24
   |
95 |             sig1.write(&signal)?;
   |                        ^^^^^^^ the trait `std::convert::AsRef<[_]>` is not implemented for `ndarray::ArrayBase<ndarray::OwnedRepr<f64>, ndarray::dimension::dim::Dim<[usize; 2]>>`
   |
   = note: required because of the requirements on the impl of `std::convert::From<&ndarray::ArrayBase<ndarray::OwnedRepr<f64>, ndarray::dimension::dim::Dim<[usize; 2]>>>` for `ndarray::ArrayBase<ndarray::ViewRepr<&_>, ndarray::dimension::dim::Dim<[usize; 1]>>`
   = note: required because of the requirements on the impl of `std::convert::Into<ndarray::ArrayBase<ndarray::ViewRepr<&_>, ndarray::dimension::dim::Dim<[usize; 1]>>>` for `&ndarray::ArrayBase<ndarray::OwnedRepr<f64>, ndarray::dimension::dim::Dim<[usize; 2]>>`

I use latest stable rust compiler (rustc 1.43.1 (8d69840ab 2020-05-04)); should I use any specific version to build the project?

loganek avatar May 24 '20 13:05 loganek

Okay, this is interesting, the project should compile well. This might be caused by the fact that there is no Cargo.lock in the repository. Last time I faced this issue, it was a versioning issue of ndarray used in the hdf5 package. Which versions of hdf5 and ndarray do you use?

Latest rust version should be good.

windelbouwman avatar May 24 '20 14:05 windelbouwman

Thanks, I use hd25 version 0.6.1 and have 2 versions of ndarray: 0.12.1 and 0.13.1

loganek avatar May 24 '20 15:05 loganek

Okay, that explains. Please update the lognplotgtk/Cargo.toml file to use ndarray 0.13 instead of 0.12, that should fix the issue.

Btw, to understand this problem, hdf5 uses ndarray 0.13.1 and lognplotgtk uses 0.12.1. This is caused by the relax spec for the hdf5 version (and actually the missing Cargo.lock file). Having two versions of ndarray prevents from passing ndarray 0.12 data to ndarray 0.13 functions (in this case).

The HDF5 support was an option in the past, maybe it should be an build option again?

windelbouwman avatar May 24 '20 17:05 windelbouwman

That worked, thanks! I slightly modified the example so I could see values from different event types:

 timestamp = float(data["timestamp"]) / 1e9
        for field_name in data:
            if field_name == "timestamp":
                continue

            value = data[field_name]
            if not isinstance(value, numbers.Number):
                continue
            label = f'{event_type}.{field_name}'
            lognplot_client.send_sample(label, timestamp, float(value))

but that gave me a bit unexpected result; in the UI, instead of aggregating all the fields into the same label, there's a lot of duplicated labels with _BACKUP_* suffixes. Is that by design? Or I misuse the library?

loganek avatar May 24 '20 17:05 loganek

The _BACKUP_ labels are given when a trace receives a new data point with a timestamp earlier than the last data point in the trace. This was done to ensure that new data always has a later timestamp, or it must be a new signal, so the old trace is backed up. In other words, please check the timestamp value, and that the timestamp values are monotonically increasing.

windelbouwman avatar May 24 '20 17:05 windelbouwman

I see, thanks for the explanation. Is this feature configurable? In quite common scenario, HawkTracer creates per-thread buffers to avoid unnecessary locking, and those buffers might be flushed in different times (e.g. when one thread is more active, it might flush the buffer several times, while the less-active threda might only flush the buffer at the end of the program) - so in this case, we won't be able to achieve monotonically increasing timestamps. Would be great if the _BACKUP_ feature could be somehow disabled, but I'm not sure if that would fit the architecture of lognplot?

loganek avatar May 24 '20 18:05 loganek

Ah, that's interesting. For now for sure, the best fixing place would be in de adapter script in the pull request, simply ignoring events at an earlier time. The internal data structure of lognplot expects monotonically increasing timestamps for subsequent events, and this is not something that will change probably.

windelbouwman avatar May 24 '20 19:05 windelbouwman

I'm afraid this issue will be a blocker for integrating HawkTracer with the lognplot library as multi-threaded environment is quite a common usecase for the profiler, and ignoring some events comes with some implications:

  • we have to keep track of timestamp per event field for the latest datapoint, so we can ignore invalid events - this complicates implementation a bit, but I think it's not a big deal and we could live with that
  • we are loosing data, as events from different threads are equally important

As mentioned, the first problem is not a big one, but the second one sounds to me like an issue which will block the integration.
Is there any specific reason lognplot requires increasing timestamps? Is it because of the performance, or other reasons?

loganek avatar May 24 '20 19:05 loganek

The main reason is performance, since the data is appended in increasing time, a quick query can be made on intervals. This is one of the reasons why you are able to smoothly zoom / pan data sets with millions of points. This, combined with the fact that events occur at increasing timestamps (after one another) led to the current implementation.

I don't think discarding data is a good idea, all events should be preserved.

For my understanding of hawk tracer, I have some questions:

  • When are timestamps attached to events? At the flush of a buffer, or at the moment the event happens?
  • I checked the sourcecode, and I saw a clock implementation which looks like a monotonic clock, so this should be good
  • Can different threads emit the same event?

windelbouwman avatar May 24 '20 19:05 windelbouwman

  • When are timestamps attached to events? At the flush of a buffer, or at the moment the event happens?

Events are timestamped when the event happens in the system, not when the buffer gets flushed

  • I checked the sourcecode, and I saw a clock implementation which looks like a monotonic clock, so this should be good

Monotonic clock is used for internal implementation and timestamping events, but users (if they really want to) can override the timestamp and e.g. for specific group of events can use different clocks.

  • Can different threads emit the same event?

Yes, different threads can emit the same type of event (e.g. function foo() can be called in thread T1 or in thread T2) - user can attach thread ID to the event as well (and that's what we do e.g. for callstack events), but it's not a requirement.

Overall, HawkTracer is not very strict regarding the values of the events being passed around - that makes it more flexible and suitable for many usecases. We don't want to put more restrictions on that, as that could break quite a few existing usecases.

Since in most of the cases incomming events are (more or less) sorted by timestamp, would it be possible in lognplot to sort the data set on insert (I guess some kind of implementation of insertsort would be optimal, as the data is likely to be added somewhere to the end of the dataset)? It could be an optional feature so we could enable it only when really needed (e.g. for HawkTracer usecase)?

Thanks

loganek avatar May 25 '20 09:05 loganek

Events are timestamped when the event happens in the system, not when the buffer gets flushed

This is great news, so events will have increasing timestamps!

Yes, different threads can emit the same type of event (e.g. function foo() can be called in thread T1 or in thread T2) - user can attach thread ID to the event as well (and that's what we do e.g. for callstack events), but it's not a requirement.

So one solution would be to include the thread ID in the name of the event, as a sort of prefix? Or would this only work when users explicitly attach the thread ID?

Overall, HawkTracer is not very strict regarding the values of the events being passed around - that makes it more flexible and suitable for many usecases. We don't want to put more restrictions on that, as that could break quite a few existing usecases.

Agreed, this API should remain flexible.

Since in most of the cases incomming events are (more or less) sorted by timestamp, would it be possible in lognplot to sort the data set on insert (I guess some kind of implementation of insertsort would be optimal, as the data is likely to be added somewhere to the end of the dataset)? It could be an optional feature so we could enable it only when really needed (e.g. for HawkTracer usecase)?

Sure, I guess this could be added, but for the moment being I would rather implement this logic in the adapter script. My proposal would be to have a sort of ordered queue of events for each signal name, and only forward events when they have a certain age. This would mean in python something like this:

queues = []
name_to_queue = {}

def on_sample(trace_name, timestamp, value):
    q = name_to_queue[trace_name]
    q.insert_ordered((timestamp, value))

def once_in_a_while():
    for q in queues:
        while q.first.timestamp is old_enough:
             lognplot.send_value(trace_name, q.pop_first())

Something like that.

Is there an example application which has the behavior you describe? I do not know how to reproduce the issue with the backup signals. I guess it has to be a multithreaded demo?

windelbouwman avatar May 26 '20 20:05 windelbouwman

So one solution would be to include the thread ID in the name of the event, as a sort of prefix? Or would this only work when users explicitly attach the thread ID?

User has the flexibility to attach the thread (or not). Also, even if user decides to attach the thread, it's up to the user to come up with the field name (it doesn't have to be always thread_id).

Is there an example application which has the behavior you describe? I do not know how to reproduce the issue with the backup signals. I guess it has to be a multithreaded demo?

No, actually it doesn't even have to be multithreaded demo. If you consider the following example:

void foo()
{
   HT_G_TRACE_FUNCTION();
   bar();
}

void bar()
{
   HT_G_TRACE_FUNCTION();
   // some code
}

In this case, HawkTracer generates 2 events (one for measuring scope of bar, and one for foo). The callstack would look something like that:

+-------- foo --------+
     +-- bar --+

Since the bar finishes earier, HawkTracer pushes bar event first to the timeline (even though its timestamp is higher than foos timestamp), and when foo completes, it pushes the foo event.

You should be able to reproduce this by running one of the hello-world examples, e.g. https://github.com/amzn/hawktracer/blob/master/examples/tutorials/hello_world/hawktracer-hello-world.cpp

The proposal with keeping track of age would work for many cases, unfortunately, I saw quite a few cases where people do:

void main_function()
{
   HT_G_TRACE_FUNCTION();
   while (true) 
   {
      // main application logic goes here
   }
}

That means, that the main_function's tracepoint will always be pushed at the very end of the program, while all the newer events are already pushed to the database. This is just one example, but we might have other scenarios where keeping the age of events might not be enough.

One solution to overcome it would be to generate 2 events (start and stop) for the function, but that again is just a pre-defined HawkTracer event, that means that user might implement the old behaviour anyway. Also, we try to avoid having 2 events for one function call to save some bandwidth (in many cases, the bandwidth is critical).

I'm happy to consider other ideas as well, but at the moment nothing that works comes to my mind :)

loganek avatar May 26 '20 21:05 loganek

Aw snap, I understand the problem now. So the issue is, an event can be received with a timestamp much longer ago, since the timestamp will be of the start event. The value of the event will be the duration of that event, say the function call. One solution would be as you suggest, emit a start event and a stop event for functions. Another option would be to take the timestamp of the function end event as timestamp. Both options are pretty lame I must say.

Too bad, it would be nice to visualize hawktracer output in realtime, instead of via post processing. lognplot is realy intended for live data. One option would be to add code which allows measurements to be inserted in the middle of a trace, but this would add extra code, which I'm reluctant to add. The usecase for the backup signal is that if a trace starts over, say from zero, it is added as a new trace. This may happen in embedded systems where the clock starts at zero after a reset.

windelbouwman avatar May 27 '20 13:05 windelbouwman