OM76

Common Frameworks, Read Through Aware, Capable, Able, Willing (ACAW)

Ani Achugbue 14 min read

A few years ago I joined a fairly large organization right as they were adopting the 9-Box framework. The new Head of HR and leadership team was visibly excited because many of them had used a similar system at previous companies. For the leaders who knew it well, the calibration meeting was a playground to shine in front of the CEO. For the rest of LT who hadn’t used it before, the prospect of talking openly about their team members’ performance in front of peers, instead of in private with HR, was something to dread.

What became clear in the first calibration meeting was that three voices were always fighting for dominance in the room.

  • The CEO, who relied on their own gut about anyone or any team that came up. Regardless of what evidence was presented, they always returned to a personal feeling of where someone belonged. Usually they had a few champions they preferred, and everyone else was compared to those select individuals.
  • The HR team, focused on process and procedure since they didn’t actually know the ins and outs of most of the roles. When the discussion happened in the open, the HR team tended to side with the CEO because they had no other way to push back that didn’t come down to “they said, we said.”
  • The manager, team lead, or department head, stuck in the most uncomfortable position because their high-performing leads, the ones whose teams clustered in the upper-right corner of the grid, were forced to push some of their own people into less performant boxes just to show they were “being objective.”

The process created fast alignment inside the room. The results outside the room were a mess.

Inside the meeting, conclusions and decisions came easily. Alignment was reached quickly. Outside the meeting, most employees couldn’t make sense of where they landed or why. Top performers by every measure of their actual job were called average. Underperformers in obvious ways were considered great because of their visibility to the CEO or their relationships with leadership.

As a manager, the hardest part was the conversation that followed. Someone would ask, “What do I need to do to get a raise this cycle?” and I had no honest answer for them. The truth was that they weren’t one of the people the room had decided to favor. Not because their work was lacking. Because of how the room had read them.

That’s when I came back to the four questions. Aware. Capable. Able. Willing. The 9-Box framework I had been forced to use wasn’t asking any of them. It was asking a different question entirely: “Where does this person sit on a 3x3 relative to everyone else in the room?” Beyond the 9-Box framework, I thought about all the other evaluation structures I had used and saw a pattern. Many of them suffered from similar issues. Instead of providing guidance for evaluating individuals and direction on how to grow them, most focus on comparative metrics to ease the manager assessment load.

9-Box Grid

McKinsey designed the 9-box in the 1970s for General Electric’s business units, plotting industry attractiveness against competitive strength. HR repurposed the structure under Jack Welch by swapping the axes to performance and potential. No one is credited as the inventor of the people-evaluation version, which turns out to be the first useful clue.

The mechanics are simple. A 3x3 grid. Performance on one axis, potential on the other, each rated low, moderate, or high. The nine cells get labels: Star, Core Player, Risk. Each cell carries a default talent move (accelerate, retain, develop, exit). A calibration meeting argues placements.

Now read it through an ACAW lens. “Performance” is one rating that bundles all four questions together. A high-performance rating could mean any combination of aware, capable, able, and willing. A low-performance rating could mean failure on any one of those four. The grid cannot tell you which. It produces a single number where ACAW would produce four.

That single number quietly steers what gets measured. Faced with rating “performance” in one score, evaluators gravitate to what’s easy to count: tickets closed, calls completed, points shipped, deals booked. The four questions of ACAW would push a manager to ask about coordination, judgment, and conditions. The 9-box asks none of that. It rewards the legible activity over the harder-to-see outcome. Teams optimize accordingly. The cross-functional work that produces the actual result gets quietly devalued, because no column on the grid is built to count it.

“Potential” is more interesting. Potential is a guess at future ACAW with no current evidence. Gartner’s research found that only about 15% of high performers are also high potential. That number is what you’d expect when you ask managers to predict future ACAW from current ACAW: they default to the data they have, plot performance twice, and call the second plot potential.

What the 9-box skips entirely is the third question, Able. Conditions are invisible on the grid. A person blocked by their environment plots the same as a person who isn’t capable. The framework gives you no way to see the difference, so the difference doesn’t get acted on.

A second problem surfaces in calibration meetings. The 9-box forces managers and leaders to argue placements across functions they don’t operate inside. A head of sales weighs in on a senior engineer. A head of engineering weighs in on a senior account manager. In practice, each leader falls back on the proxy they understand from their own function: tickets closed, calls completed, deals booked. A senior person whose actual job is to elevate the team, whose value lives in the work that doesn’t show up as their own individual output, reads as low-performing through someone else’s lens. The grid doesn’t surface this distortion. It absorbs it.

The verdict: the 9-box is a bucketing tool, not an evaluation. It produces categories managers can act on without ever surfacing the diagnosis underneath. Used after a real evaluation, the grid is a placement device. Used as the evaluation, it launders bias as analysis. Gallup found that 64% of large companies use the 9-box and 9% think it works.

Top Grading

Industrial psychologist Brad Smart developed Top Grading while consulting for Welch’s GE in the 1980s and 1990s. He codified it in his 1999 book Topgrading. His son Geoff Smart co-developed it, founded the consulting firm ghSMART in 1995, and produced a leaner version in Who: The A Method for Hiring (2008). Lineally, Top Grading is the front-end-of-the-funnel cousin of Welch’s vitality curve.

The mechanics are heavy. A 12-step hiring process organized around a Job Scorecard, a Career History Form, and a 2-to-4-hour tandem CIDS interview that walks the candidate chronologically through every job they’ve ever held. Candidates are sorted into A, B, and C players, where A means top 10% at the price you’re paying. The framework also includes TORC, the Threat of Reference Check, where the candidate must personally arrange reference calls with all former bosses as a condition of advancing.

Read through ACAW, the framework’s central move is to treat A, B, and C as intrinsic properties of the person. The four questions of ACAW say something different: there is no such thing as an A player in the abstract. A person is aware, capable, able, and willing in this role, in this team, with these conditions. Move them, and the answers can change.

Boris Groysberg’s Chasing Stars (Harvard Business School, 2010) made this concrete. He tracked star Wall Street analysts as they moved between firms and found their results were firm-specific. The “A player” they had been at Goldman didn’t port to Merrill. What ported was their name. Their performance dropped because the conditions changed. In ACAW terms, “A” wasn’t a property of the analyst. It was a property of the system the analyst had been operating inside.

The same observation lands harder in software development for enterprise. The A-player engineer who shipped at Google doesn’t necessarily ship at a 50-person startup. The system around them (code review culture, deployment infrastructure, team composition, technical debt level) was doing as much of the work as they were. Disciplines and functions exist inside an enterprise software organization for a reason. No one ships alone. The A-player myth treats output as the property of the engineer when it’s actually the property of the system.

What Top Grading skips, then, is the third question, Able. By treating “A player” as identity, the framework makes environmental issues invisible. A B player who would be an A player with different conditions reads identical to a B player who couldn’t reach A under any conditions. There’s no diagnostic for the difference. There can’t be: the framework refuses to look.

TORC adds a wrinkle. It’s an attempt to verify the fourth question, Willing, before the hire. The candidate has to demonstrate they’re willing to subject themselves to scrutiny. In practice, TORC filters for compliance rather than honesty, and it assumes a candid-reference culture that U.S. employers have legally retreated from for two decades. The signal it actually surfaces isn’t willingness. It’s docility.

The verdict: Top Grading treats hiring as the entire evaluation problem. The framework assumes that if you sort A, B, and C correctly at hire, ACAW takes care of itself afterward. The framework breaks the moment conditions change, which they always do. By treating people as types, it can’t see that the same person ranges across all four ACAW answers depending on context. GE itself paid a $500 million settlement on a class action alleging the A/B/C taxonomy was biased against women.

Stack Ranking

Stack ranking is the framework most directly associated with Jack Welch. He introduced GE’s “20-70-10” vitality curve in the early 1980s: top 20% rewarded, middle 70% retained, bottom 10% fired annually. The press called him “Neutron Jack.” Forced distribution is older than Welch, but Welch’s version is the one that spread through corporate America.

The mechanics. Corporate sets a distribution target assuming a bell curve, cascades the target down the hierarchy, and forces it again at every level. Calibration meetings pit managers against each other to “sell” their reports into top buckets. An employee’s rating depends on the manager’s political skill as much as the work. The bottom bucket goes onto a Performance Improvement Plan or out the door.

Read through ACAW, stack ranking is the strange one in this article. It doesn’t ask any of the four questions. It asks where you sit relative to the people next to you.

Take this seriously. A team where every member is aware, capable, able, and willing still has to be ranked top to bottom. The forced distribution manufactures low performers regardless of actual performance. A team where nobody is aware, capable, able, or willing still has top performers, relative to each other. The framework cannot see the absolute quality of the work. It only sees the gradient.

There’s a deeper problem with the framework’s premise. Stack ranking assumes everyone should be performing at the top. Healthy teams don’t work this way. Research on team composition consistently finds that effective teams need a mix of capabilities, not a homogeneous top tier. Belbin’s team-roles work and Reynolds and Lewis’s HBR research on cognitive diversity arrive at the same conclusion: a team of all “A players” underperforms a balanced team because no one is doing the connector work, the maintenance work, or the calm-when-things-explode work. Stack ranking can’t see any of this. It only sees the gradient.

Stack ranking has produced repeated age-discrimination class actions. Ford settled for $10.5 million in 2001. Goodyear dropped the practice in 2002 amid similar suits. Conoco and Microsoft were also sued. The pattern is consistent: a framework that ranks people on attributes it cannot define ends up filtering on attributes it shouldn’t.

The verdict: stack ranking evaluates relative position more than the diagnosis underneath that position. It’s a relative-position calculator that assumes the underlying evaluation has already happened somewhere else. In practice, that evaluation never happens. The ranking is the evaluation. The strongest possible signal is GE itself: Welch’s own successor, Jeff Immelt, walked away from Welch’s most famous management invention in 2015 and replaced it with PD@GE, a continuous-feedback application.

360 Reviews

Multi-rater feedback traces to the German Reichswehr around 1930, where psychologist Johann Baptist Rieffert built it as an officer-selection method. Civilian use begins at Esso in the 1950s. The modern instrument took shape in the late 1970s as a leadership-development tool, with the Center for Creative Leadership in Greensboro, North Carolina becoming its academic home. The term “360” went mainstream in the 1990s, propelled by Welch’s GE adoption and the rise of cheap desktop survey software.

The mechanics. Ratings come from manager, peers, direct reports, self, and sometimes external stakeholders. The instrument uses a competency framework with Likert scales and open comments. Peer and subordinate input is typically anonymized and aggregated. The results are best delivered through a coached debrief. The central design choice is whether the 360 is used developmentally (helping the person improve) or evaluatively (feeding compensation and promotion). Most research supports developmental use only.

Read through ACAW, 360 reviews try to gather data from multiple angles. The problem isn’t the multiple angles. The problem is that the framework doesn’t anchor the angles to specific questions tied to specific responsibilities. Each rater is asked something like “rate this person on collaboration on a scale of 1 to 5,” which means each rater silently answers a different question. One rater scores the person against their own model of collaboration. Another rater scores against a different model. The numbers come back looking comparable when they aren’t.

The empirical finding here is the most damning in the entire article. Scullen, Mount, and Goff published a study in the Journal of Applied Psychology in 2000 showing that 53 to 62 percent of the variance in 360 ratings comes from idiosyncratic rater effects, and only about 25% from actual ratee performance. Translated: a 360 measures the rater more than the ratee. Marcus Buckingham built two HBR pieces on this finding, The Fatal Flaw with 360 Surveys in 2011 and The Feedback Fallacy (with Ashley Goodall) in 2019, arguing that humans can only reliably rate themselves, not others’ abstract behaviors.

ACAW disagrees with the strong version of Buckingham’s claim. Humans can rate a specific question about a specific responsibility in a colleague. What we cannot reliably do is rate abstract competencies on a Likert scale. The four questions of ACAW give a 360 something it currently lacks: an anchor.

Imagine a 360 where every rater is asked, for each major responsibility on the person’s job description, four yes-or-no questions. Was this person aware of this responsibility? Capable of doing it? Able to do it under the conditions you observed? Willing to do it? With those questions, the multi-rater structure becomes useful. The manager, the peers, and the direct reports might disagree on which question failed, and that disagreement is itself a valuable signal. Without those questions, you get 62% rater idiosyncrasy.

The verdict: 360 reviews are unanchored. The framework is a delivery mechanism without a question. SHRM data from 2024 shows that 79% of workers would opt out of 360 reviews if they could and 74% see the results as unfair. Those numbers describe a framework whose users do not trust the output. Pair the 360 with ACAW’s four questions, and the multi-rater structure starts working. Without that pairing, the 360 measures who showed up to fill it out.

Annual Performance Reviews

Annual performance reviews are different from the other four frameworks in a structural way. They have no single inventor. They emerged.

The traceable threads run back to Robert Owen’s color-coded performance cubes at New Lanark in the early 1800s, U.S. military fitness reports (the Navy used 48 different formats between 1865 and 1956), and most importantly Walter Dill Scott’s WWI Army rating scales (1917-1918), which rated millions of soldiers on five-point trait scales and migrated into civilian industry through the 1920s and 1930s. Frederick Taylor’s scientific management provided the philosophical scaffolding (measure, rank, optimize) but not the mechanism. The decisive period is post-WWII consolidation in the 1940s and 1950s. Adoption hit roughly 60% by the 1940s and 90% by the 1960s. The Performance Rating Act of 1950 locked annual reviews in for the U.S. federal government. Welch’s vitality curve at GE in the 1980s made the comp-linked, forced-distribution version dominant in the private sector.

The mechanics today are familiar. Self-assessment. Manager-written review. Calibration. Rating. Delivery conversation. Signed acknowledgment. Compensation linkage. The cycle runs annually, biannually, or quarterly depending on the company. Many shops have added a “mid-year check-in” pattern as a relief valve.

Read through ACAW, annual reviews are a container, not a method. The container itself doesn’t enforce structure. The same review form gets filled with rigorous diagnosis at one company and recency-biased impressions at another. The 62% idiosyncratic rater effect that haunts 360 reviews shows up here too: without anchoring the questions to specific responsibilities, the manager defaults to writing about their own model of the person.

Compensation linkage changes the incentives of the evaluator themselves. The manager is no longer answering only whether the employee performed well. They’re also calculating budget impact. Saying “yes, willing” triggers a comp bump they may not have budget for, so the honest answer gets shaded toward what the budget can absorb. The framework changes the question being asked. ACAW would call this a fourth-question (Willing) corruption, except the corruption belongs to the manager, not the employee. The manager is being asked the wrong question by their own organization.

The high-profile defectors tell a useful story. Adobe moved to “Check-In” in 2012. Deloitte redesigned theirs in 2015 (documented in Reinventing Performance Management by Buckingham and Goodall). GE replaced annual reviews with PD@GE the same year. Accenture also dropped ratings in 2015 across its 330,000 employees. Microsoft killed stack ranking in 2013 but kept a modified review cycle, which is a useful reminder that “killing the annual review” usually means dropping the ratings, not the cycle. The 2014-2018 wave of “death of the annual review” articles reflected real institutional movement. Then the wave receded. Many companies that dropped ratings re-introduced them, because compensation decisions still needed inputs.

The Adobe move-away is worth a mention. Adobe published that their old annual review process consumed roughly 80,000 hours of manager time per year, equivalent to 40 full-time employees. That’s the hidden cost of a container without a method.

The verdict: the annual review is the most flexible of the five frameworks and therefore the most failable. With the four ACAW questions as the spine, an annual review can be a useful periodic checkpoint. Without the questions, it becomes a recency-biased writing exercise that costs a company tens of thousands of hours and produces evaluations correlated more with rater identity than ratee performance. The framework you have probably is an annual review. It can become a real evaluation by changing what gets asked inside it. That’s the cheapest fix in this entire article.

The Pattern

Five frameworks. Five distinct ACAW failure modes.

  • 9-Box Grid: conflates all four questions into “performance,” guesses at future ACAW with “potential,” skips Able.
  • Top Grading: treats answers as identity. A, B, and C are properties of the person. ACAW says they’re properties of the situation.
  • Stack Ranking: doesn’t ask the four questions. Substitutes a different question entirely (where do you sit relative to your peers?).
  • 360 Reviews: gathers multiple perspectives without anchoring them. Each rater silently answers a different question.
  • Annual Reviews: a container, not a method. Whether the questions get asked depends on the manager.

Read together, the pattern is hard to miss. Frameworks aren’t bad in themselves. They’re bad when the underlying questions aren’t being asked. The labels they produce (Star, A player, top 20%, “exceeds expectations”) have an authoritative feel because the framework had a process. The process isn’t the same thing as the diagnosis. The diagnosis is the four questions. The framework is whatever delivery mechanism you happen to have inherited.

What’s Changing Underneath

The five frameworks above share a quiet assumption. They all assume that performance is mostly visible. They assume there’s a stream of countable output (tickets closed, calls completed, points shipped, deals won) that an evaluator can read to infer how someone is doing. That assumption has been weakening for years. It is about to break.

AI is absorbing the measurable tier of knowledge work. The tasks that produced the clearest performance signal, the ones that fit neatly into a spreadsheet, are exactly the ones being automated first. What’s left for humans to do is the work that wasn’t legible to begin with: judgment under ambiguity, cross-function coordination, mentoring, the calm presence that keeps a team from fragmenting under pressure. None of these produce a tidy weekly number.

A framework that compresses performance into a single score will read this transition as a decline. The senior IC who used to ship 40 tickets a quarter and now ships 8 because they spend their week unblocking three other teams will plot lower on every grid in this article. The PM who replaced 20 hours of status reporting with a five-minute prompt and used the saved time to negotiate a cross-team launch will look less productive on paper. They are not less productive. They are producing the kind of value the framework was never built to see.

Read through ACAW, this isn’t a measurement problem. It’s a question problem. “How many of X did this person produce?” was always a proxy. AI is making the proxy useless faster than most evaluation systems can adapt. The four questions don’t depend on counting anything. Aware of the responsibility, capable of doing it, able under the current conditions, willing to do the work. Those answers hold whether the output is 40 tickets or three high-stakes decisions a quarter.

Every framework in this article will fail harder as this shift accelerates, unless the underlying questions get asked. ACAW isn’t a replacement. It’s the way to keep your evaluation honest when the metrics stop meaning what they used to.

Closing

If you came to this article hoping I’d recommend a framework to replace the one you’ve inherited, you have the wrong author. The framework isn’t the problem. The questions you’re not asking are.

Most evaluation frameworks can be repaired by anchoring them to ACAW. A 9-Box that names which question failed for each cell becomes useful. A 360 that asks four specific yes-or-no questions per responsibility becomes useful. An annual review with the four questions as the spine becomes a real evaluation. Stack ranking and Top Grading are harder to repair, because their core moves (rank against peers, sort into A/B/C identities) actively contradict the four questions. For those two, the most honest move is to use them only when the underlying ACAW evaluation has already been done, and never as the evaluation itself.

The next time you sit down to use whichever framework you have, ask the four questions for each major responsibility on the list. Aware. Capable. Able. Willing. Yes or no on each. The framework gives you a label. The four questions tell you whether the label is real.

Sources

Organized by section. URLs verified at time of publication.

9-Box Grid

  • Designing a Bias-Free Organization — Iris Bohnet, Harvard Business Review, 2016. The structural argument that bias must be designed out of evaluation processes.
  • 9-Box Model — Gartner HR Glossary. Source of the “only ~15% of high performers are also high-potential” finding (via the former Corporate Leadership Council research).
  • 9 Box Grid: How To Use It for Talent Reviews — AIHR. Source of the Gallup 64%/9% effectiveness gap.
  • Vitality Curve — Wikipedia. Reference for distinguishing the 9-box from Welch’s 20-70-10 forced ranking.

Top Grading

Stack Ranking

  • Microsoft’s Lost Decade — Kurt Eichenwald, Vanity Fair, August 2012. The foundational journalistic indictment.
  • Don’t Rate Your Employees on a CurveHarvard Business Review, November 2013. Statistical and managerial argument against forced curves.
  • Why GE had to kill its annual performance reviews — Max Nisen, Quartz, August 2015. Documents GE’s abandonment of stack ranking under Immelt and the move to PD@GE.
  • Vitality Curve — Wikipedia. Reference timeline including the Ford ($10.5M, 2001), Goodyear (2002), Conoco, and Microsoft lawsuits.
  • Management Teams: Why They Succeed or Fail — Meredith Belbin, Butterworth-Heinemann, 1981 (3rd ed. 2010). The foundational team-roles work; argues effective teams require a mix of nine roles, not a stack of all-stars.
  • Teams Solve Problems Faster When They’re More Cognitively Diverse — Alison Reynolds and David Lewis, Harvard Business Review, March 2017. Empirical study showing cognitive diversity predicts speed of complex problem-solving.

360 Reviews

Annual Performance Reviews

Glossary

  • ACAW: Aware, Capable, Able, Willing. The four-question evaluation framework introduced in Article 1.
  • CCL: Center for Creative Leadership. Greensboro-based research and education institution that built the academic foundation of modern 360-degree feedback.
  • CIDS: Chronological In-Depth Structured Interview. The 2-to-4-hour interview at the heart of Top Grading.
  • GE: General Electric.
  • HBR: Harvard Business Review.
  • HBS: Harvard Business School.
  • HR: Human Resources.
  • JAP: Journal of Applied Psychology. Source of the foundational Scullen/Mount/Goff (2000) finding on rater idiosyncrasy in 360 ratings.
  • LT: Leadership team.
  • PD@GE: Performance Development at GE. The continuous-feedback application that replaced GE’s annual reviews and stack ranking in 2015.
  • PIP: Performance Improvement Plan. The formal document used to track an underperforming employee, often the off-ramp before termination.
  • SHRM: Society for Human Resource Management. The largest U.S. HR professional association; publishes employee-perception research.
  • TORC: Threat of Reference Check. The Top Grading practice that requires candidates to personally arrange reference calls with all former bosses.
  • WWI: World War I.