Refactoring Towards Resilience: Async Workflow Options

Other posts in this series:

In the last post, we looked at coupling options in our 3rd-party resources we use as part of "button-click" place order, and whether or not we truly needed that coupling or not. As a reminder, coupling is neither good nor bad, it's the side-effects of coupling to the business we need to evaluate in terms of desirable and undesirable, based on tradeoffs. We concluded that in terms of what we needed to couple:

  • Stripe: minimize checkout fallout rate; process offline
  • Sendgrid: Send offline
  • RabbitMQ: Send offline

Basically, none of our actions we determined to need to happen right at button click. This doesn't hold true for every checkout page, but we can make that assumption for this example.

Sidenote - in the real-life version of this, we opted for Stripe - undo, SendGrid - ignore, RabbitMQ - ignore, with offline manual re-sending based on alerts.

With this in mind, we can design a process that manages the side effects of an order placement separate from placing the order itself. This is going to make our process more complicated, but distributed system tend to be more complicated if we decide we don't want to blissfully ignore failures.

Starting the worfklow

Now that we've decided we can process our three resources out-of-band, the next question becomes "how do I signal to that back-end processing to do its work?" This largely depends on what your backend includes, if you're on-prem or cloud etc etc. Azure for example offers a number of options for us for "async processing" including:

  • Azure WebJobs
  • Azure Service Bus
  • Azure Scheduler

In my situation, we weren't deploying to Azure so that wasn't an option for us. For on-prem, we can look at:

From these three, I'm inclined towards Hangfire as it's easy to integrate into my Web API/MVC/ASP.NET Core app. A single background job executing say, once a minute, can check for any pending messages and send them along:

RecurringJob.AddOrUpdate(() => {
    using (var db = new CartContext()) {
        var unsent = db.OutboxMessages.ToList();
        foreach (var msg in unsent) {
            Bus.Send(msg);
            db.OutboxMessages.Delete(msg);
        }
        db.SaveChanges();
    }
}, "*/1 * * * *");

Not too complicated, and this will catch any unsent messages from our API that we tried to send after the DB transaction. Once a minute should be quick enough to catch unsent messages and still not have it seem like to the end user that they're missing emails.

Now that we've got a way to kick off our workflow, let's look at our workflow options themselves.

Workflow options

There's still some ordering I need to enforce on my external resource operations, as I don't want emails to be sent without payment success. Additionally, because of the resilience options we saw earlier, I don't really want to couple each operation together. Because of this, I really want to break my workflow in multiple steps:

In our case, we can look at three major workflows: Routing Slip, Saga, and Process Manager. The Process Manager pattern can further break down into more detailed patterns, from the Microservices book "Choreography" and "Orchestration", or as I detailed them a few years back, "Controller" and "Observer".

With these options in mind, let's look at each in turn to see if they would be appropriate to use for our workflow.

Routing Slip

Routing slip is an interesting pattern that allows each individual step in the process to be decoupled from the overall process flow. With a routing slip, our process would look like:

We start create a message that includes a routing slip, and include a mechanism to forward along:

Bus.Route(msg, new [] {"Stripe", "SendGrid", "RabbitMQ"} 

Where "RabbitMQ" is really just and endpoint at this point to publish a "OrderComplete" message.

From the business perspective, does this flow make sense? Going back to our coordination options for Stripe, under what conditions should we flow to the next step? Always? Only on successful payment? How do we handle failures?

The downside of the Routing Slip pattern is it's quite difficult to introduce logic to our workflow, handle failures, retries etc. We've used it in the past successfully, and I even built an extension to NServiceBus for it, but it tends to fall down in our scenario especially around Stripe where I might need to do an refund. Additionally, it's not entirely clear if we publish our "Order Complete" message when SendGrid is down. Right now, it doesn't look good.

Saga

In the saga pattern, we have a series of actions and compensations, the canonical example being a travel booking. I'm booking together a flight, car, and hotel, and only if I can book all 3 do I call my travel booking "complete":

Image CC 3.0 from http://vasters.com/archive/Sagas.html

Does a Saga make sense in my case? From our coordination examination, we found that only the Stripe resource had a capability of "Undo". I can't "Undo" a SendGrid call, nor can I "Undo" a message published.

For this reason, a Saga doesn't make much sense. Additionally, I don't really need to couple my process together like this. Sagas are great when I have an overall business transaction that I need to decompose into smaller, compensate-friendly transactions. That's clearly not what I have here.

Process Manager - Orchestration/Controller

Our third option is a process manager that acts as a controller, orchestrating a process/workflow from a central point:

Now, this still doesn't make much sense because we're coupling together several of our operations, making an assumption that our actions need to be coordinated in the first place.

So perhaps let's take a step back at our process, and examine what actually needs to be coordinated with what!

Process Examination

So far we've looked at process patterns and tried to bolt them onto our steps. Let's flip that, and go back to our original flow. We said our process had 4 main parts:

  1. Save order
  • Process payment
  • Email customer
  • Notify downstream

From a coupling perspective, we said that "Process Payment" must happen only after I save the order. We also said that "Email customer" must happen only if "Process Payment" was successful. Additionally, our "notify downstream" step must only happen if our order successfully processed payment.

Taking a step back, isn't the "email customer" a form of notifying downstream systems that the order was created? Can we just make an additional consumer of the OrderCreatedEvent be one that sends onwards? I think so!

But we still have the issue of payment failure, so our process manager can handle those cases as well. And since we've already made payments asynchronous, we need some way to signal to the help team that an order is in a failed payment state.

With that in mind, our process will be a little of both, orchestration AND choreography:

We treat the email as just another subscriber of our event, and our process manager now is only really concerned about completing the order.

In our final post, we'll look at implementing our process manager using NServiceBus.