You face plenty of stress running a team in software development. You want your team not only to survive but to thrive under pressure.
It’s hard to come up with even remotely viable solutions for this, let alone to make it work.
Building your engineering team to be antifragile sounds great in theory, but is it even possible to pull off?
Adam Wolff is currently VP of Engineering at Robinhood and former head of Engineering at Facebook. He’s built antifragile teams that did wonders under pressure. Implementing the principles of antifragility in practice led to the success he’s most proud of in his career so far.
Adam Wolff is VP of Engineering at Robinhood. Robinhood offers financial services that let users trade US stock equities, options, and crypto. They’re expanding while continuously building new financial products.
Before coming to Robinhood, Adam worked at Facebook for eight years. Facebook’s culture and his experiences there shaped his views about engineering management. Facebook runs a highly antifragile organization.
The main idea behind the book Antifragile by Nassim Taleb is that you want to do more than survive stress: you want to profit from it. You should aim to set yourself up so the bigger the pressure, the more you benefit.
Black swan events are a key concept. You can’t predict these events, but they put a lot of pressure on your team. Everyone encounters these sources of massive stress at least several times in their life.
A team can be in one of three states of fragility. It can be fragile, where any stress can be fatal to the team. The bigger the distress, the bigger its negative impact.
A team can be robust. When robust teams face stress, they tend to survive but return to the same state they were in before.
Antifragile teams get better under stress. A common example for this is muscles because they get bigger and stronger from stress.
You can’t always win and have a positive response to shock. Antifragility doesn't mean that when the shock happens, you immediately benefit. It means that you survive the shock and grow stronger for it.
You need to remember that antifragility and resilience don’t always feel good. They can feel crappy. The point is that as long as you make an effort to learn and grow, you will benefit and succeed.
I used to run a startup called Sharegrove, which suffered from the most common problem engineers face: it was a product no one wanted to use. You have a great idea, you build it, put it out there and wait for the users to roll in.
This is the lack of product-market fit.
This puts stress on you and your engineering team, and you have to keep changing things to stay alive. Sometimes you need to pivot your whole startup from one idea to another, because what you expected to work, doesn't work.
If you get everything right, you’ll face a lot of stress coming from success. Robinhood faced this earlier this year, when we saw amazing growth in the engagement with our platform. We started scaling up engineering to meet the increased demands.
The internal communication at Robinhood relies on Slack, so it was scary to see an outage with them. They must have been under a huge load and did what they could to meet the demands, but it's hard to anticipate spikes, and it’s even harder to recover from them.
There are extrinsic forces that create shock or stress for engineering teams. The current pandemic is a great example: COVID-19 forced engineers out of the office. It pushed people to figure out how to conduct a white board meeting remotely, or how to organize their days and weeks, measure productivity, and so on.
This has been a huge shock to the system, and the technology industry is responding well.
Preserving optionality has a lot of implications on running an engineering team; some are counterintuitive. This is the biggest idea I took away from Antifragile. The most counterintuitive thing I’ve learned is that not committing to a course of action is valuable.
It's an unusual idea, because you need to make plans in software development. Organizing a team to build a huge project over months or years is hard, and if you work without a plan, you won’t get anywhere. However, when black swan events happen, the less rigid your plan is, the more options you have to respond.
I learned a way to make it work in practice from Facebook: the trick is incrementality.
I call a plan bad when it requires us to go in a direction for three months where everything will get worse, but it’ll all work out in the end. It’s bad, because I can’t tell if it's working.
I hate when doing something for a month doesn’t seem to work, and someone says it’s because you’re not doing it hard enough. Sometimes they’re right, and it’s the right direction, but you just need to wait or invest more.
My point is that a plan that takes six months with checkpoints along the way may be better than a plan that takes you three months, but you don't know how you're doing until the end. Especially because you often need to redirect every few weeks along the way.
It’s tough for engineering managers to accept, but we have to be careful about how much we invest into planning. When things go wrong, we don’t want to allow it to happen the next time, so we want to make a better plan. The phrase “plans don't survive contact with the enemy” has been true in my experience, especially at a startup like Robinhood.
It’s hard to find balance in software engineering in general. You need to have a plan, because aiming to complete a milestone is valuable. It also allows you to hold yourself accountable to that.
Knowing what the next increment should be is valuable. You get in trouble when you look too far ahead or create too many plans.
The project I'm most proud of working on is “Relay Modern.” It was at Facebook, where I was responsible for a team that was working on GraphQL. The project wasn’t doing well for a long time; it was slow, complicated and it didn’t bring much value.
We made a decision to rebuild it. We called the old version Relay Classic and the new version Relay Model. The team made a brilliant decision to sort out compatibility between them.
Usually when you build a new API, you throw away the old one and tell everyone they have some time to migrate. Other times, you take the old version and build forward compatibility into it, to make it easier for teams to switch incrementally. This team went with neither of this.
They took the new version, made it more flexible than Relay Classic, and built backwards compatibility into it. You could adopt Relay Model without changing your APIs, and you could slowly fix every call site to gain the benefits of Relay Model. We managed to preserve optionality for the developers, and the team migrated a gigantic code base at Facebook while everyone kept moving forward.
The idea of via negativa is to remove things. It’s about simplicity. Engineers intuitively understand that it’s better to take something away from a system than to add to it.
One of the best ways to make something antifragile is to make it simple.
The more complex something is, the more fragile it is. It’s intuitive: you know when an intricate crystal falls, it’ll break. A box of sand is antifragile; under enough pressure, it can even turn into a diamond. It's simple and low in entropy.
With your engineering team, you can start taking things out. This is an important engineering principle. The best systems are the simplest systems, but simple doesn’t mean easy.
As VP of Engineering at Robinhood, I had to learn to say no. Robinhood has a more complicated development pipeline, and the products have complicated requirements as well. We often have months of legal work before we can ship even a simple feature, so it’s important for engineering leaders to push back on overly ambitious ideas from the business side.
Engineers love new technologies, but you have to be careful, because they tend to be less effective than you expect. I love new database technologies. I’d personally get rid of PostgreSQL DB, which is from 1970, yet we made the decision to not only keep it, but double down on it. It brings industry defining guarantees around transactionality, and that’s what Robinhood needs, so I’m saying no to myself.
This doesn’t mean your team can't innovate; it means that you have to be careful.
We use innovation tokens for our projects, and it's great to use a couple of them, but trying to innovate in every area simultaneously is a bad idea.
The third key concept is that you want the individuals to have skin in the game. It’s surprisingly difficult to implement for an engineering team, but it is essential to make everyone feel empowered and accountable.
This idea in the book may be a rip on academia, but he makes good points. If you can sit and pontificate, it doesn’t matter whether you’re right. It’s unlikely you’ll make a good prediction in this situation, because there's nothing to correct you.
If I can succeed without helping the business, that makes the organization fragile. You want to line your teams up, so the highest level in leadership aims to do the same thing as individual contributors on the lower levels.
We had an interesting example of this at Robinhood: we recently introduced the position of tech leads. They operate beside engineering managers, who tend to be very insightful in technical matters.
Our situation is special, because we have mostly hired young engineers, brilliant fresh grads from top schools. We believe it’s important that a significant part of our engineering managers are trained internally.
The best individual contributors tend to transition from engineer to engineering manager. These people are familiar with the technical side of the work, so they tend to take responsibility for technical decisions, beyond the business impact of their team.
We need tech leads to have skin in the game, to provide them with both empowerment and accountability. They need authority and levers to pull; otherwise, it’s not fair to expect them to be responsible for the technical direction.
What I don't like about agile is that it's become more like a cult than a set of ideas. It has great basics, like the team should get together every day to discuss what they're doing, or to run development in a few weeks of increments. Agile also preserves optionality well if development doesn’t go smoothly.
What I sometimes miss from agile software development is the longer arc.
Agile can be surprisingly short-sighted. You set a goal, and you work towards that in two-week increments. It lacks the moments where you consider a black swan event and the monthly checkpoints where you make sure if you’re still building the right product.
You may be adding buttons and responding to customer requests, but are you actually going anywhere? Over-planning is bad, but you still need a plan. Agile development does many things right, but I think you need to add more to it.
You can artificially create problems to do chaos testing. We periodically put shocks in our system to see how we respond. Ideally, the bigger the shock, the more you benefit.
We don’t want to randomly fire five people to see if the team is okay with that. But we can randomly turn off 10 load balancers or fill up the disk to see what happens.
The idea of chaos testing is to simulate failure conditions that might happen in production.
You can time this when you've got everyone on deck and ready to respond. This way, you understand the root causes before you even begin, so you can observe what the failure conditions are. If you don't do this yourself, it will still happen, but you won’t see it coming.
A similar idea is load testing. The idea behind this is that you don’t want your users to set a record for your maximum load. You want to set the record yourself at a time when there aren't many people on your site. If something breaks, you fix it.
You can test antifragility by knocking something off the shelf, and if it breaks, you better replace it.
Measurement is key to antifragility, but there are no clear roadmaps. Before I came to Robinhood, I hadn't realized how much I took the measurements in place at Facebook for granted. Tech people often build a new feature and only start collecting data once it’s in production, only to realize something important is missing.
Then you have to go through the process again. When you start measuring, you often realize you’ve been looking at the wrong metric. It can take weeks or months to get a metric right.
It always seems like you can do something more impactful to increase antifragility, but observability is key. Showing its values to both the business and the engineering side is an important role for engineering leaders.
If you don't know what to do as an engineering leader, improve the release process. Keep track of the time it takes from proposing a change to putting it into production, and aim to shrink that window.
Continuous release should be the goal of every engineering team. You may intuitively disagree, but it’s the safest way to build software. It’s hard to push a bad change through, and even if it happens, you’ll know what’s wrong.
The more you rely on one engineer to do some magic before the release, the more fragile you are.
Release velocity is a fundamental metric of engineering productivity. Even if your team has no idea whether a change will be good, as long as you observe the results, you can keep whatever works and roll back the rest. This leads to progress.
This is how Facebook operates. It's a massive random experiment generator, and it’s great at finding the local maxima for anything. This approach has problems too, because you end up with complicated implementations, and it’s hard to move off the local maximum, but it’s a great mechanism for improving your team and product.
The idea of OKRs has been around, but Google institutionalized it. You pick an objective, for example, to grow the number of weekly active users. Then you check key results, like how many users visit your site five days out of seven.
The interesting thing about OKRs is that achieving your objective isn’t the highest priority. The key is to look at what you expected to make happen six months ago and see if you were moving in the right direction. You could be wrong about what to measure and about what the key results should be, or maybe the metric is busted, and you’ve been flying blind.
This helps you develop sensitivity and find the right level for you to make plans. This is different for a 10-person startup and a 10,000-person engineering organization.
My dream is doing workflow measurement. Robinhood isn’t ready for this yet, but we're working towards it.
My goal is to get every single person at Robinhood to track their work in Jira. There are a lot of interesting metrics you can generate this way, and I’m not talking about the number of bugs fixed. These metrics are terrible, because there are variables you can’t measure.
Good metrics may be:
Staying organized is necessary, and a tool like Jira can help a lot, especially when you’re managing a remote engineering team.
Making this happen has been a bigger challenge and more work than I imagined. The first step is convincing everybody at the company that this is important to invest in. A lot of people naturally feel like it’s not important and want to focus on their work, but engineers know that the second- and third-order things help to give us control.
The most fragile thing you can do is to direct your team in detail.
The leader has the least information about the details, and you have to consider what happens to your team when you’re not there. Your job is to set direction. Beyond that, the more you help your team identify what to do for themselves, the better you're doing.
The most antifragile thing you can do is to remove yourself and let your team work.
It takes courage, because we instinctively want to make ourselves important. We're compensated well, so we want to justify our role by doing more.
It’s tough to let go of this and to realize that the best you can do is to help your team fend for themselves. You can’t be absent, even though I like the phrase, “Leadership is what happens when your back is turned.”
Being like a program in the sense that people know what I’d say is a sign of good leadership.
This is why leaders, especially in large organizations, find themselves repeating the same message. It's not good if you're just saying the same thing over and over; it's good if you see something, you give the reaction and then you adjust your message and say it again.
We had big decision meetings at Robinhood for a while. We got 20 people together and 15 pages of documentation, and the team would bring a recommendation. The point was to come out with a decision.
It's a bad way to make decisions, because the higher you go in the corporate ladder, the less likely that people have the relevant knowledge to make the right decision.
My idea is that the more a decision hangs in the balance, the less important it is. If there was an obvious right or wrong choice, you’d know that already, so you pick from two roughly equal options.
You still want to preserve optionality, just in case. You don’t have enough information, and there’s a chance that you're heading in the wrong direction. The bigger the decision is, the more I look at the framing and the key factors of the decision.
A trick I use for this is nitpicking the language we use to describe a decision, because neutral phrasing can help remove emotions.
An engineering leader has to help people gain confidence, see clearly, and preserve optionality, so they can explore a path and turn back if necessary. It feels bad when you have to commit to something that isn’t clear enough. Sometimes you have to, and then it’s the leader’s job to take responsibility for the decision.
The bottom line is that your job is to continuously try to put yourself out of a job. It’s hard, because you hopefully like your job, but if you’re doing it right, you aren’t necessary.
I don’t have a universal answer, but I’ll tell a story from Robinhood.
Through 2019, we were running as a matrix organization, which means we were set up in functional teams. There was an iOS team, an Android team, and so on. This made it easy to place people in the organization and to share knowledge.
The bad part was that you had to discuss every project with 15 different teams. Also, the project leads had to hammer out technical agreements across a bunch of different teams.
At the end of last year, we reorganized into a vertical organization. This means that teams are set up to solve business problems. Instead of an iOS team, now we have a cache management team including iOS developers, Android developers, data scientists, designers, and so on.
It’s more antifragile than having the IOS team and the Android team individually prioritizing the tasks from the cache management effort. It also simplifies allocation, because we can add engineers to the cache management team when they’re falling behind.
This has been successful so far.
There's a simple conclusion for an organization to be antifragile which is also a truism: You invest in people, not in ideas. Things change, so the more flexible and capable your people are, the better you'll do.
The main thing engineering managers and leaders need to do is hire, retain and develop their talent. I consider this my number one job. Recruiting engineers is essential, and you also need to help your staff improve, which ranges from providing training to helping them confront limitations they place on themselves.
Any investment into your people is a good investment. It makes your team antifragile.
At Robinhood we look for grit in the interviewing process. There's often a moment at a job interview when you think it’s hopeless. Especially in technical tests, where you think you can’t finish on time.
We're looking for people who double down and try something different at that moment. Even asking for help can be good, because we often gain strength from one another.
There's a famous story of this principle in action. Walt Bettinger, the CEO of Charles Schwab, invites candidates to breakfast, gets there early, and tells the waiter to mess up their order. Then he gets to see how this person handles that problem.
It’s a great example of grit and antifragility. You face disappointment, not immediate success. Antifragile puts low value to plans, so you have to be nimble and respond to what goes wrong.
This is an example of how you can test for antifragility.
The worst thing you can do in our interviews is give up. Another red flag is when a candidate does something wrong and tries to convince themselves that it was right. We don't want these qualities in our engineering team.
🚀 Need developers for your team or project? Hire our experienced Angular or Node.js developers! Click here for a FREE consultation.
About the author:
Gabor Zold is a content marketer and tech writer, focusing on software development technologies and engineering management. He has extensive knowledge about engineering management-related topics and has been doing interviews with accomplished tech leaders for years. He is the audio wizard of the Level-up Engineering podcast.