Optimal Deck Chair Rearranging: The Engineering Work I Did at Convoy

written by Zach Morrissey on 10/20/2023

R.I.P. Convoy!

#TruckYeah is no longer. I joined Convoy in Jul 2022, primarily out of a desire to build things that had a more direct impact on customers than my prior role in rapid software prototyping at Tableau. To be fully honest, I realized the dire economics of it long before the collapse. We all did. My wife and I were expecting our first child in Aug 2023 and it seemed worthwhile to risk sticking around for the generous paternity leave policy. Leaving would’ve meant I’d be sacrificing that, since most companies around here require you to be at the company for 12 months prior to taking leave. I don’t regret that decision at all (even though I only got half of it in the end!).

And let’s be real: It’s a little amusing to think of what our priorities were internally as the ship was sinking. Here’s some short vignettes of my favorite parts of working there. As a bonus, I’m going to add a “Was it worth it?” judgment for each project.

Some of the information has been obscured a bit so as not to reveal the Convoy secret sauce.

1. Bidding - An Actually Good Use Case for Migrating Away from Lambda

Ragging on Lambda seems to be en vogue recently (as is bemoaning that trend), so I’ll pile on. First and foremost, I think AWS Lambda is incredible at what it does, but not everything is a good fit for it. I love using the Serverless framework; it’s incredibly seamless and makes deployments a breeze.

Here’s what our system looked like, roughly:

Why Lambda Was a Bad Fit

It really had nothing to do with Lambda’s usefulness as a technology. It was a expertise problem.

  1. Lacking AWS Lambda expertise - Everything else was deployed in k8s at Convoy. In the prior year we’d lost a lot of headcount, and with it most of the people who knew Lambda well. I’ve used Lambda extensively, but I was not interested in signing up to be the support staff for all Lambda deployments.
  2. Duplicated libraries - Since some of the internal libraries relied on the service’s Docker container, it required rewrites of shared libs for metrics/logging/etc. Lambda’s got some quirks that prevented easy reuse. These became burdensome to maintain. Any time something went wrong with the Lambdas, you could guarantee it was going to be much longer to debug.

Maintaining 25+ lambdas for multiple different bidding integrations was taxing, and our iteration speed slowed.

Implementing the Alternative: Agent Processes + SQS

  1. Used Terraform modules to create a second, mirrored bid queue for each bidder process.
  2. Sent the bid payload to both queues. Ran the second one in parallel with the first.
  3. Used a runtime flag in LaunchDarkly to flip the agent or lambda one on. If the flag was off, it was a no-op.
  4. Decommissioned the Lambda.

Once the initial shared code updates were finished, the entirety of flipping one of our serverless processes was maybe ~3-5 hours of work. Spread among 4 engineers, it was a relatively low cost to migrate.

Was This Worth Doing? Likely not. When resources are tight, migrations are a bad idea.

2. Spot Freight Live Board with Instant Bidding

Utilizing the bidding architecture above, we put together a real-time spot freight board that allowed users to instantly bid a better price. In a company where a lot of our automated bids missed the mark for a variety of reasons (and sometimes wildly so), the thinking was that we could more easily enabled account managers to make connections with shippers & facilities and haggle for freight with our available margin.

Imagine this is slightly downstream of the architecture above:

Reflecting on it, this tool was incredible in that it automated a lot of work for the account team, but I don’t think users were incentivized to take advantage of it. Spending an hour trying to win a single load didn’t translate into meaningful motivation, nor did the customer relationships garner enough future freight to justify the time investment. Sales usage never materialized and the tool was largely unused half a year later.

Was This Worth Doing? Maybe, but probably not. It was cheap. It didn’t take very long to develop this app (four people over the course of about three weeks) and it was based on reasonable industry trends. Might have been more useful earlier on in the Convoy lifecycle.

3. Rules Engine Conflict Detection

There’s plenty written on this subject elsewhere, but I got to implement conflict detection for our freight decision rules engine. This is one of those classic Martin Fowler-type enterprise patterns, so there was a lot of prior art available online.

This was entirely under-the-hood in our decisioning system, but it prevented some wonky unintended consequences from happening.

Example rule, in pseudo-yaml:

rule_1:
  criteria:
    shippers: ["ABC"]
    dropoffLocations: ["SEATTLE_WA"]
  outcome: Do not bid

However, adding another rule like this could create conflicts:

rule_2:
  criteria:
    shippers: ["ABC", "DEF"] # <-- includes the same 'ABC' shipper
    dropoffLocations: ["SEATTLE_WA"]
  outcome: Definitely bid # <-- Egad, that's bad!

What would happen here if we received an opportunity to bid on a shipment from shipper 'ABC' in SEATTLE_WA? It was undefined behavior as far as the system was concerned, since it wasn’t explicitly designed to handle this.

Solving conflicts

The rules for solving conflicts were surprisingly simple, but took a while to decipher:

  1. Rules only can conflict if they contain the exact same set of criteria fields. If rules are either broader (less criteria) or more specific (more criteria), they don’t conflict. Example: disallowing all freight from a specific shipper (broad), but allowing that shipper’s freight for specific geographic areas (specific).
  2. All criteria field values were arrays. For a given criteria field, the set of values could not overlap:
    • Set intersection? Conflicting. Same set, subset, and superset all conflicted.
    • Disjoint sets? Not a conflict. Shipper criteria values ['ABC'] and ['DEF', 'GHI'] aren’t conflicts.
  3. Every criteria field between two rules conflicted. If you had overlapping shippers, but different geos, it’s not a conflict.

Was This Worth Doing? Absolutely. System correctness usually pays reliable dividends, in my experience. It took me about three weeks (with some major help in code reviews by the incredibly talented duo of Sam Sonne and Alex Demeo) and prevented major headaches in the holiday weekend (July 4th timeframe) just four weeks after. The conflict dialog popped up several times when users tried to enter in rules for that event, and we ended up helping them make more correct decisions.

4. Things Going Bump in the Night

As with any software, there were many times where things went wrong. I’ve got a masochistic enjoyment of jumping in when that happens. I, quite unabashedly, love this stuff.

Were These All Worth Doing? Yes, but that answer is a little bit of a cop out. Removing developer randomization was one of the best ways to squeeze out the remaining person-hours from our dwindling crew.

Here’s a quick look into a few of them.

1 - The Midnight Fuel Latency Spike

Each night, around 1:30AM, all alarms would go off and the system that made decisions whether to accept/decline freight would start firing 500s left and right. What was it?

  • One shipper had a massive batch process that would make all of their next day’s volume available at the same time, roughly somewhere between midnight and 1AM.
  • For the decisioning system, there were multiple downstream systems used to derive the info for it: pricing, fuel costs, etc. Sometimes evaluating these would require multiple downstream calls, so there was an amplification effect occurring. Hundreds would become thousands quickly (a problem for a different write-up).
  • For the system that determined fuel costs, the RDS Postgres instance would show 100% CPU usage for the periods that it was being queried. The tables already had seemingly optimal indices in place.

Solution? Redis cache. The fuel data was relatively static, so caching some of the initial queries alleviated the RDS CPU spike.

2 - End-to-End Tests Flaking Out

Existing E2E tests would rely on test state from other tests in an unreliable manner. It created race conditions and tests would be flaky, failing with little predictability.

The largely took a form that looked something like:

describe("e2e", () => {
  const apiClient = new FakeAPIClient();

  it("Test 1", async () => {
    apiClient.setValue("First Value");

    // this is fake, but imagine something occurring here that takes time
    await new Promise((resolve) => setTimeout(resolve, 500));
    apiClient.setValue("Second Value");

    expect(apiClient.getValue()).toBe("Second Value");
  });

  /**
   * This test would await the outcome of the first one in a manner that wasn't
   * entirely deterministic.
   */
  it("Test 2", async () => {
    expect(apiClient.getValue()).toBe("First Value");
  });
});

Solution? Isolate test setup/teardown between individual tests. I made each test create its own data in a parameterized manner such that individual test runs wouldn’t clash or hit race conditions. We had a pretty comprehensive test data generation library, but these tests long predated it.

3 - CircleCI Security Incident

Sometimes you need a good crisis to test your systems. On Jan 3, 2023 CircleCI had a breach that required rotating all tokens stored in the service as passwords. This meant we effectively had to relaunch every single system that we owned.

Solution?

  • Create new passwords/tokens/etc for all systems in CircleCI, many of which were 3rd-party (example being the shipper websites above).
  • Revoke all of the existing tokens where possible.
  • Lots of systems had AWS state that had drifted from its Terraform definitions. Had to adjust those back to not undo changes in the running system. By running terraform plan and making adjustments to the TF code, it was possible to adjust these incrementally.
  • Relaunch all of the systems in a way that allowed for as little service disruption as possible. Downstream systems were updated prior to upstream ones, and relaunching was done at low-traffic periods. The service used canary deployments via Spinnaker, which helped immensely here.

And Lastly, A Coda for Convoy

I was holding out hope against hope with Convoy’s rapid demise.

If there’s one true sin Convoy committed, it was over-hiring. Once that inflated balance sheet dried up the funding, there wasn’t any going back. Still, this doesn’t strike me as an obviously fatal mistake. Plenty of startups at around the same ARR have similar or higher headcounts. It’s obvious that the company was built on VC float, but it also seems like it wasn’t that unreasonable to assume that it’d be possible to raise similar cash later on. Don’t take your investment advice from me, though.

With my hindsight-colored glasses securely on, some of my priorities for where I spent time seem a little capricious in retrospect. At the time, the economics of it made sense. We aggressively minimized effort where the time investment made no sense. We prioritized things that delivered margin fastest. In that regard, we were successful.

I’d like things to have ended differently, of course. When you see Convoy on the list of the most valuable failed startups, give a little hat tip this way towards the few who went down with the ship.

I’m (barely) on twitter or bluesky if you found this useful.