By Whitney Coggeshall
In my last article, I argued that skills are messy, contextual, and far harder to define and measure than most people realize. I made the case that while skills-based learning and assessment holds great promise, the reality is that defining what counts as a skill, designing ways to teach it, and proving someone actually has it are all enormously complex challenges.
That article was meant to shine a light on the messiness and the opportunity, and I will admit it left you with more questions than answers. The intention of this article is to take the next step by providing a few frameworks that can get us closer to the answer. If skills are as complicated as I argued, how do we move forward? How do we design and evaluate skills in a way that people can trust?
In my view, the best way forward is to rely on the frameworks that learning design and measurement already offer us. First, we need a design framework that makes our reasoning explicit. Second, we need a way to test and monitor whether that framework holds up in practice.
To understand what I mean, hang on for a minute while I walk through a (hopefully) useful analogy. Imagine, for a moment, that we are building a bridge.
Now imagine trying to build a bridge without a blueprint, throwing it together piece by piece much as my five-year-old does with his building blocks. When you inspect that bridge, two outcomes are possible. Maybe you get lucky and, despite the lack of planning, the bridge holds. More likely, it fails the first inspection entirely. A bridge built without a plan is much more likely to collapse than succeed. A blueprint does not guarantee success, but it dramatically increases the likelihood that the bridge is constructed in a way that can withstand real conditions and provide lasting benefits to society.
Another option is to build a bridge using a detailed blueprint and then put the bridge through stress tests. Again, two outcomes are possible. The tests could succeed, which is the happy path. Or, despite the planning, the bridge could still reveal weaknesses. A well-thought-out blueprint has a high likelihood of success, but inspections may still find problems. Even then, the consequences are usually far less catastrophic. A bridge built with a logical plan is unlikely to collapse completely. Instead, the issues that appear are more likely to be smaller, more manageable flaws, such as a crack in the pavement or a railing that needs reinforcement. These are problems you can fix after construction, strengthening the bridge without starting from scratch.
Similarly, when designing a performance task, simulation, or other authentic assessment, it is possible to get it right without a clear design framework, but to me that feels risky. A stronger approach is to maximize the probability of success by starting with a clear framework and then running the right kinds of tests to see whether the framework worked as intended. Fortunately, we do not need to invent these approaches from scratch. We can draw on existing frameworks in learning design and measurement that give us both the blueprint and the inspections we need to build trust.
The blueprint: evidence-centered design
If the blueprint is the design framework in our bridge analogy, then in assessment design that role is played by Evidence-Centered Design (ECD). ECD is a structured way of making sure we are explicit about what we are trying to measure, what will count as evidence of it, and what kinds of tasks will draw out that evidence.
At its core, ECD forces us to connect three things: the claims we want to make about a skill, the evidence that would support those claims, and the tasks that will give learners the chance to produce that evidence. To use an example from my first article, a task can be designed to determine the extent to which someone can synthesize information in the finance field. The evidence might be that they can take data from multiple sources, such as earnings reports, market news, and macroeconomic indicators, and integrate it into a coherent investment recommendation. The task could present the learner with a mix of raw financial statements, analyst reports, and current events, and ask them to produce a short memo advising whether to buy, sell, or hold a stock.
This kind of design discipline is especially important in assessing skills, because these assessments are messy by nature as I described in my previous article. Simulations, role-plays, and projects often generate highly variable performances where two people might succeed in very different ways, or demonstrate strengths in one context but not another. Without a clear framework, it is easy to mistake surface-level behaviors for deeper skill, or to let scoring drift into subjectivity. ECD helps prevent this by making assumptions explicit up front. It forces us to define which behaviors actually count as success, why they matter, and how they will be observed and scored.
In practice, there are a few ways to implement ECD. One option is to draw from the research literature, which can provide a useful foundation. But often the skills we are trying to measure, and the very specific contexts in which they are applied, do not fully generalize. Another option, and often the most efficient and practical, is to involve subject matter experts directly in the design process. This not only ensures the assessment reflects authentic tasks people encounter in the real world, but also builds stakeholder buy-in when practitioners see their expertise shaping the outcomes. For skills-based assessments to be credible, we need industry practitioners to help identify tasks people often encounter in the field, and which behaviors actually indicate success. In a simulation for IT troubleshooting, for example, experts might agree that “success” requires a learner to systematically test hypotheses and document steps clearly, rather than just guessing commands until something works. ECD provides the structure to make those expectations explicit and tie them to scoring.
This is what makes the approach scalable. Whether the setting is a classroom exam, a workplace certification, or a professional licensure task, ECD keeps the reasoning transparent. It ensures that success on an assessment is not just about memorizing facts or gaming the system, but about demonstrating the kinds of performances that experts and research agree represent real skill.
Like a blueprint for a bridge, ECD does not guarantee the structure will hold once it is in use. But it dramatically increases the odds by making the logic transparent and by surfacing the assumptions we will later need to test.
Inspection and maintenance: validity arguments
If ECD gives us the blueprint, then the equivalent of inspection and maintenance is what measurement experts call a validity argument. A blueprint might look elegant on paper, but the only way to know if it holds up is to build it and inspect it under different conditions and continue to maintain it over time. In the same way, a validity argument is the ongoing process of checking whether the assumptions in our assessment design actually work in practice.
Building that argument requires different kinds of evidence falling into these five areas:
- Content. Do the tasks actually reflect the domain we care about? Much of this is built into the blueprint itself. ECD already forces us to define claims, evidence, and tasks in collaboration with domain experts. Validity evidence here is about confirming that those design choices still make sense in practice — for example, whether subject matter experts continue to agree that the scenarios are realistic and the scoring rubrics capture authentic success.
- Response process. Are learners engaging with the task in the way we intended? Here, qualitative evidence is critical. We might observe participants, conduct interviews, or run think-aloud studies to check whether people are approaching the simulation or role-play as designed. If they are not, that could signal a problem with the task design, the learning design, or even the user experience that makes the intended path confusing or unintuitive.
- Internal structure. Do the scores behave consistently? This is where quantitative data comes in. Key questions include whether raters agree with one another, whether rubric categories are working as expected, and whether scores generalize across different tasks. For example, in a role-play, if one rater interprets “effective teamwork” very differently from another, the scoring guidelines need refining. Or if a rubric category like “clarity of communication” is so broad that nearly every participant gets the same score, it may not be distinguishing skill levels effectively.
- Relations to other variables. Do scores connect with other indicators of performance? This could mean showing that high scorers on a simulation also perform better on the job, or that scores correlate with other trusted measures of the same skill. Without this kind of evidence, we cannot be confident that the performance in the assessment really transfers to the real world.
- Consequences. Are the results being used fairly and yielding the desired outcomes? This means checking whether the assessment disadvantages certain groups or creates unintended side effects. It also means ensuring scoring is culturally sensitive, since behaviors that signal communication or leadership in one culture may not look the same in another.
Together, these different checks form the inspection and maintenance we need for confidence. Even with a strong blueprint, we need to inspect whether the design is working as intended, whether scores behave consistently, and whether the outcomes are fair and meaningful. Validity arguments are not a one-time hurdle but an ongoing process. They help us identify flaws before they become failures, strengthen weak points, and maintain the trust of everyone who relies on the assessment.
Added inspections with AI
New technologies like AI make it possible to design assessments that feel more authentic, more scalable, and more responsive than ever before. AI can score essays in seconds, generate dynamic simulations, and provide instant feedback in ways that human raters alone could not. These innovations promise richer tasks and faster results, but they also introduce new complexities that must be inspected carefully.
When AI is part of the system, the validity argument needs to go further. In addition to the typical checks, we need evidence in areas such as:
- Scoring alignment. Do AI-generated scores match expert human judgment? Studies should compare AI outputs with ratings from trained experts to confirm that the AI is recognizing the same features of performance.
- Bias and fairness. Is the AI scoring consistently across different demographic, cultural, or linguistic groups? Algorithms trained on historical data can reproduce or even amplify inequities, so bias audits are essential.
- Transparency. Can learners, educators, and employers understand why the AI gave a particular score? Black-box systems risk undermining trust if scoring decisions cannot be explained in human terms.
- Robustness. Does the AI continue to perform reliably when tasks, contexts, or populations change? A model trained on one kind of simulation data may not generalize well to another, so testing across contexts is critical.
- Ongoing monitoring. AI systems are not static, as they can drift as data changes. Continuous monitoring is needed to ensure that performance remains accurate and fair over time.
AI does not change the fundamental need for blueprints and inspection, but it raises the stakes by introducing new joints, moving parts, and vulnerabilities that require closer attention. Done carefully, AI can expand what is possible in skills-based learning and assessment. Done carelessly, it can undermine the very trust we are trying to build.
Assessment then learning
Throughout this article I’ve focused more on skills assessment than on teaching skills. That may feel counterintuitive, since we usually think about teaching a skill first and assessing it second. But with something as complex as skills, starting with assessment makes sense. If we cannot agree on how to measure a skill, we cannot design effective ways to teach it.
This idea is not new. In curriculum design it is often called Backward Design, where you begin by defining what success looks like and then plan instruction to help learners get there. Skills-based learning benefits from the same logic. Focusing on assessment first forces us to wrestle with the messy questions of what counts as evidence of skill, how we will capture it, and how we will know it when we see it.
Once we have a clear framework for measuring skills, we can align learning strategies to move the needle in meaningful ways. In other words, measurement gives us the target, and teaching becomes the path toward hitting it. We need to know what kind of bridge we’re building and how we’ll test it before we worry about guiding people across.
That is why beginning with skills assessment matters. It is not the whole story, but it is the foundation. Without clarity on how to assess skills, skills-based learning risks becoming another buzzword. With it, skills-based learning can become something learners, educators, and employers get behind.
You may also be interested in
Explore our programs and certificates
CFA Institute offers a diverse range of programs and certificates designed to meet the needs of finance professionals across various career stages and specializations.