Refactoring Towards Resilience: Evaluating Coupling

Other posts in this series:

So far, we've been looking at our options on how to coordinate various services, using Hohpe as our guide:

  • Ignore
  • Retry
  • Undo
  • Coordinate

These options, valid as they are, make an assumption that we need to coordinate our actions at a single point in time. One thing we haven't looked at is breaking the coupling of our actions, which greatly widens our ability to deal with failures. The types of coupling I encounter in distributed systems (but not limited to) include:

  • Behavioral
  • Temporal
  • Platform
  • Location
  • Process

In our code:

public async Task<ActionResult> ProcessPayment(CartModel model) {
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");
}

Of the coupling types we see here, the biggest offender is Temporal coupling. As part of placing the order for the customer's cart, we also tie together several other actions at the same time. But do we really need to? Let's look at the three external services we interact with and see if we really need to have these actions happen immediately.

Stripe Temporal Coupling

First up is our call to Stripe. This is a bit of a difficult decision - when the customer places their order, are we expected to process their payment immediately?

This is a tough question, and one that really needs to be answered by the business. When I worked on the cart/checkout team of a Fortune 50 company, we never charged the customer immediately. In fact, we did very little validation beyond basic required fields. Why? Because if anything failed validation, it increased the chance that the customer would abandon the checkout process (we called this the fallout rate). For our team, it made far more sense to process payments offline, and if anything went wrong, we'd just call the customer.

We don't necessarily have to have a black-and-white choice here, either. We could try the payment, and if it fails, mark the order as needing manual processing:

public async Task<ActionResult> ProcessPayment(CartModel model) {
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    try {
        var payment = await stripeService.PostPaymentAsync(order);
    } catch (Exception e) {
        Logger.Exception(e, $"Payment failed for order {order.Id}");
        order.MarkAsPaymentFailed();
    }
    if (!order.PaymentFailed) {
        await sendGridService.SendPaymentSuccessEmailAsync(order);
    }
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");
}

There may also be business reasons why we can't process payment immediately. With orders that ship physical goods, we don't charge the customer until we've procured the product and it's ready to ship. Otherwise we might have to deal with refunds if we can't procure the product.

There are also valid business reasons why we'd want to process payments immediately, especially if what you're purchasing is digital (like a software license) or if what you're purchasing is a finite resource, like movie tickets. It's still not a hard and fast rule, we can always build business rules around the boundaries (treat them as reservations, and confirm when payment is complete).

Regardless of which direction we go, it's imperative we involve the business in our discussions. We don't have to make things technical, but each option involves a tradeoff that directly affects the business. For our purposes, let's assume we want to process payments offline, and just record the information (naturally doing whatever we need to secure data at rest).

SendGrid Temporal Coupling

Our question now is, when we place an order, do we need to send the confirmation email immediately? Or sometime later?

From the user's perspective, email is already an asynchronous messaging system, so there's already an expectation that the email won't arrive synchronously. We do expect the email to arrive "soon", but typically, there's some sort of delay. How much delay can we handle? That again depends on the transaction, but within a minute or two is my own personal expectation. I've had situations where we intentionally delay the email, as to not inundate the customer with emails.

We also need to consider what the email needs to be in response to. Does the email get sent as a result of successfully placing an order? Or posting the payment? If it's for posting the payment, we might be able to use Stripe Webhooks to send emails on successful payments. In our case, however, we really want to send the email on successful order placement not order payment.

Again, this is a business decision about exactly when our email goes out (and how many, for what trigger). The wording of the message depends on the condition, as we might have a message for "thank you for your order" and "there was a problem with your payment".

But regardless, we can decouple our email from our button click.

RabbitMQ Coupling

RabbitMQ is a bit of a more difficult question to answer. Typically, I generally assume that my broker is up. Just the fact that I'm using messaging here means that I'm temporally decoupled from recipients of the message. And since I'm using an event, I'm behaviorally decoupled from consumers.

However, not all is well and good in our world, because if my database transaction fails, I can't un-send my message. In an on-premise world with high availability, I might opt for 2PC and coordinate, but we've already seen that RabbitMQ doesn't support 2PC. And if I ever go to the cloud, there are all sorts of reasons why I wouldn't want to coordinate in the cloud.

If we can't coordinate, what then? It turns out there's already a well-established pattern for this - the outbox pattern.

In this pattern, instead of sending our messages immediately, we simply record our messages in the same database as our business data, in an "outbox" table":

public async Task<ActionResult> ProcessPayment(CartModel model) {
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    dbContext.SaveMessage(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");
}

Internally, we'll serialize our message into a simple outbox table:

public class Message {
    public Guid Id { get; set; }
    public string Destination { get; set; }
    public byte[] Body { get; set; }
}

We'll serialize our message and store in our outbox, along with the destination. From there, we'll create some offline process that polls our table, sends our message, and deletes the original.

while (true) {
    var unsentMessages = await dbContext.Messages.ToListAsync();
    var tasks = new List<Task>();
    foreach (var msg in unsentMessages) {
        tasks.Add(bus.SendAsync(msg)
           .ContinueWith(t => dbContext.Messages.Remove(msg)));
    }
    await Task.WhenAll(tasks.ToArray());
}

With an outbox in place, we'd still want to de-duplicate our messages, or at the very least, ensure our handlers are idempotent. And if we're using NServiceBus, we can quite simply turn on Outbox as a feature.

The outbox pattern lets us nearly mimic the 2PC coordination of messages and our database, and since this message is a critical one to send, warrants serious consideration of this approach.

With all these options considered, we're now able to design a solution that properly decouples our different distributed resources, still satisfying the business goals at hand. Our next post - workflow options!