Building End-to-End Diagnostics and Tracing: Diagnostic Events

Posts in this series:

Source Code

In the last post, we looked at the W3C Trace Context standard, and how we can extend NServiceBus to propagate these new headers using middleware. With many monitoring and observability tools, this can be enough for tracing tooling to stitch together the entire request timeline.

These tools rely on agents installed in your environments, either as background services or container sidecars to "listen in" on your traffic and report back to some centralized reporting store that we can then use to view traces. Although this approach can seem a bit intrusive, it does not require any changes to existing code.

One disadvantage to this approach is that it can be difficult to report more fine grained metrics - things like:

  • How long did this operation take?
  • What was the nature of this communication?
  • What was the cycle time/processing time?
  • Did this operation complete successfully? Was it retried? How many times?

For more complex observability scenarios, merely attaching headers to requests isn't going to suffice.

Instead, we need to surface diagnostics out of our system into "something". Eventually, that "something" will be OpenTelemetry, but first, we need to surface diagnostics.

Diagnostics Events

With the Activity API, we have a central place where we can store diagnostic information across requests, but the Activity API can do quite a bit more. We want to surface diagnostic information out to OpenTelemetry, but I don't want to tie my middleware directly to one kind of telemetry listener. Even though OpenTelemetry is an emerging standard, I'd like some indirection from my events to OpenTelemetry so that any tool that cares about diagnostic events can listen in to what's going on.

Similar to Activity, we need a fundamental component to expose diagnostic events. With .NET Core, that component is a DiagnosticListener. A DiagnosticListener exposes an observable (rather than pure .NET events) so that interested subscribers can listen in to diagnostic events, completely decoupled from our code raising the events.

Of course, we won't raise events in a vacuum - we actually need to design what these events look like.

In the .NET Core codebase, one can find many different examples of raising diagnostic events:

You can find others in the EF Core codebase, and some Azure SDKs as well. The general idea is we create a listener with a well-known name, and write to the listener at appropriate times (typically, when our activity starts/stops).

The Activity API and diagnostic listener API roughly correspond to the OpenTelemetry, but there is an open GitHub issue for .NET 5 to strengthen this bond.

Raising a diagnostic event is fairly straightforward, we call the Write method:

 _listener.Write("EventName", <event object>);

Now we need to figure out two things - what to call our event and what that event object should be!

Diagnostics Events

In some usages of raising the diagnostic events inside the .NET Core codebase, there wasn't any correlation between the activity name and event name, and worse - that event object was an anonymous type!

_listener.Write("My Cool Event", new {
    SomeValue = someValue,
    AnotherValue = anotherValue
});

This causes major headaches with consumers of the event who only receive something of type object but have to use reflection instead of casting to get the information out.

With our events, I don't want to fall into that trap, we want our consumers to have

  • Some type that they can cast the object to
  • That type should include all available information

We can find some more guidelines in the Diagnostic Source Users Guide, which recommends how we design and use diagnostic listeners and events.

We also need to decide what interesting points we want to raise the diagnostic events, but that's somewhat straightforward for us - when the activities we create start and stop. We can always add other diagnostic events in the future if we like. With that in mind, we can create two consumer and two producer events, representing the start/stop of the activity:

public class BeforeSendMessage
{
    public const string EventName = Constants.ProducerActivityName + ".Start";

    public BeforeSendMessage(IOutgoingPhysicalMessageContext context) 
        => Context = context;

    public IOutgoingPhysicalMessageContext Context { get; }
}

public class AfterSendMessage
{
    public const string EventName = Constants.ProducerActivityName + ".Stop";

    public AfterSendMessage(IOutgoingPhysicalMessageContext context)
        => Context = context;

    public IOutgoingPhysicalMessageContext Context { get; }
}

public class BeforeProcessMessage
{
    public const string EventName = Constants.ConsumerActivityName + ".Start";

    public BeforeProcessMessage(IIncomingPhysicalMessageContext context)
        => Context = context;

    public IIncomingPhysicalMessageContext Context { get; }
}

public class AfterProcessMessage
{
    public const string EventName = Constants.ConsumerActivityName + ".Stop";

    public AfterProcessMessage(IIncomingPhysicalMessageContext context) 
        => Context = context;

    public IIncomingPhysicalMessageContext Context { get; }
}

Rather than trying to parse and pluck individual pieces of information out of the currently executing context, I just include the entire Context of whatever behavior is executing. I don't have any additional information while processing, but if I did, I'd likely leverage the Activity instead of customizing the event message.

With the events defined, now I can move on to declaring and writing diagnostic events in my middleware.

Producing diagnostic events

Following the diagnostic source user guidelines, each of our middleware behaviors needs to declare a DiagnosticSource:

public class ConsumerDiagnostics : Behavior<IIncomingPhysicalMessageContext>
{
    private static readonly DiagnosticSource _diagnosticListener 
        = new DiagnosticListener(Constants.ConsumerActivityName);

I use the same name for my activity as I do for the listener name. Individual diagnostic events written by that DiagnosticSource will then have that name prepended to the event name, so I'll have NServiceBus.Diagnostics.Receive as my activity name and NServiceBus.Diagnostics.Receive.Start and NServiceBus.Diagnostics.Receive.Stop as the diagnostic event names.

With my listener created, I want to write the start and stop events where I used to start my Activity. However, the usage guidelines recommend only raising a diagnostic event when there are listeners to avoid the overhead. I'll use the StartActivity method as a convenience to start my activity if there is a listener:

if (_diagnosticListener.IsEnabled(BeforeProcessMessage.EventName, context))
{
    _diagnosticListener.StartActivity(activity, new BeforeProcessMessage(context));
}
else
{
    activity.Start();
}

The StartActivity and StopActivity methods on DiagnosticSource will take care of the work of starting/stopping the activity and raising the diagnostic event, named based on the Activity operation name and adding ".Start" or ".Stop" to the end.

Ending the activity now delegates to the diagnostics listener:

private static void StopActivity(Activity activity,
    IOutgoingPhysicalMessageContext context)
{
    _diagnosticListener.StopActivity(activity, new AfterSendMessage(context));
}

With this in place, I've now got diagnostic events firing as part of my middleware, and any interested listeners can pick those events up and do whatever they want!

Right now, I'm only interested in listening to diagnostic events for telemetry purposes, but diagnostic events are also useful for logging, debugging, testing, and anything else where I want to peak "under the hood" of whatever is going on.

In the next post, we'll look at leveraging the OpenTelemetry SDK (in alpha) to create listeners and collectors for these new events.