February 23, 2026

AI adoption metrics are lying to your organisation

Many organisations are making AI usage visible before they can demonstrate real value. This article examines two cases to show why “adoption” is not one thing, and why evidence of use, value, and organisational capability must be treated differently.

I’ve been reflecting on a recurring theme lately, a pattern that keeps surfacing in conversations about AI adoption. It seems many organisations are making usage visible long before they can actually demonstrate real value. And the more I look at it, the more I think this isn't accidental, it's structural.

The issue isn't that adoption is "fake", it's more subtle than that. We’ve started using the word "adoption" as a bit of a catch-all, bundling together things that are fundamentally different: usage metrics, task-level wins, system-level value, and organisational learning. We label them as if they were the same thing, but they really aren't.

And this isn’t just what appears in formal case studies; it’s what practitioners are describing in real operating environments. There is this immense pressure to show momentum, so dashboards track activity, simply because activity is easier to measure. Meanwhile, the harder work, redesigning workflows, clarifying accountability, and building capability, is often delayed or sidelined. The bottlenecks aren't just technical; they are deeply organisational.

And that is precisely where the problem gets misframed. The question is no longer whether to adopt AI; since licences can be bought, pilots can be launched, and usage can be encouraged or even enforced. However, the more uncomfortable and more important question is this: what kind of logic is being woven into the organisation, and what exactly are we accepting as evidence that it’s working?

It matters because it's becoming far easier to manufacture the appearance of progress than to deliver demonstrable value. Furthermore, it matters because organisations tend to conflate three very different things: actual learning, its documentation, and its institutionalisation. An AI programme can produce the first, and occasionally the second, while still failing entirely at the third.

This pattern is not limited to one firm or one department, across public and private settings in the US, UK, Australia, and Germany, what changes is the adoption logic: some institutions optimise for visible mobilisation, others for bounded evaluation and legitimacy, and a smaller set try to sequence adoption as capability-building. The structural challenge, however, remains the same.

Two cases, one shared pressure

I'm writing this because I came across two cases I found especially revealing, one in the private sector and one in the public sector, both of which make the same pressure visible through different mechanisms.

Accenture is a useful case in point, having reportedly linked leadership progression and promotion discussions to the regular use of internal AI tools, with adoption tracked through signals such as login activity. What makes this case so telling is how explicit the organisational message becomes, where visible AI usage is treated as evidence of alignment. That may accelerate uptake, but it also exposes a familiar tension, when a signal is easy to count, it can begin to overshadow the value it is supposed to indicate.

The UK government's Department for Business and Trade (DBT) ran a Microsoft 365 Copilot pilot and formal evaluation to understand benefits, risks and limits before making broader deployment decisions. The evaluation covered the period from October 2024 to March 2025, including 1,000 licences allocated for a three-month pilot running from October to December 2024, and used mixed methods including usage data, diaries, interviews and observed tasks. For public sector leaders navigating similar decisions, where accountability, legitimacy, and governance constraints shape every deployment choice, this case is particularly instructive, not as a template to copy, but as a window into what careful institutional experimentation actually looks like.

These are both AI adoption stories, certainly, but they are not the same kind of story. One is performance-linked adoption, where visible uptake is accelerated through incentives and career relevance. The other is evaluation-led adoption, where legitimacy and bounded evidence are built through careful experimentation. They use different mechanisms, yet they both run into the same wall when asked to justify value beyond what appears in a dashboard.

Four layers of evidence and why “adoption” is not one thing

Before examining each case in depth, it helps to separate AI adoption into four distinct layers of evidence. This isn't a universal taxonomy, but a practical one, shaped by observing how adoption unfolds across different organisational contexts. It makes the failure modes much easier to spot.

Evidence of use: At the most visible level, the question is simple: are people using the tool at all, and if so, how often, and in what contexts?
Evidence of task-level benefit: The next layer moves from activity to immediate utility, does the tool improve specific tasks, such as time saved, drafting quality, summarisation, or preparation?
Evidence of system-level value/ROI: This is where the bar gets higher. Do task-level gains translate into measurable improvements at workflow, team, departmental, or organisational level?
Evidence of organisational learning/capability retention: Beyond usage and performance, there is a longer-term question: is the organisation converting usage and experiments into durable capability, standards, playbooks, governance, role clarity, and better future decisions?

The first layer is the easiest to generate and, unsurprisingly, the easiest to misread. This is exactly where Goodhart’s Law applies. When a measure becomes a target, it begins to lose its reliability as a measure. Once usage signals such as logins, prompts, or “regular use” are treated as performance targets, organisations can raise evidence of use while weakening its link to real value. A dashboard can improve while the signal itself becomes less trustworthy.

This is why the question “are people using it?” ultimately stops being the most useful one. Better ones are: what are they using it for, what is it actually improving, and what capability is being built and retained in the process.

Case 1: Private-sector adoption under performance pressure (Accenture)

Accenture is useful here not simply because it is “using AI”, but because it makes a particular adoption logic visible. This isn't a task-level productivity experiment; it is an organisational mechanism that links visible AI usage to performance and career relevance. In that sense, it targets a real adoption bottleneck: not model capability, but rather organisational uptake, by making AI usage legible within the performance system.

The mechanism is relatively clear. Visible incentives tend to increase behavioural compliance, so if AI usage becomes relevant to progression, usage is more likely to rise, especially among senior staff and decision-makers whose behaviour shapes wider norms. From an execution standpoint, this can be coherent when the immediate priority is speed of alignment. But that coherence depends on what the organisation is actually trying to optimise; while it solves for visible mobilisation, it is not, by itself, evidence of value creation.

However, what the mechanism does not guarantee is the chain beyond the login. Ultimately, for it to produce real value, rather than just visible adoption, tools would need to demonstrably improve relevant tasks, workflows would need to be redesigned to absorb them, and managers would need to distinguish meaningful use from performative compliance. Without those conditions, the risk is Goodhart's Law in practice: the metric improves while its relationship to the underlying objective weakens. A team logs in more without improving decision quality, a manager absorbs more verification work invisibly, a dashboard shows momentum while the operational case remains ambiguous.

Beyond these practicalities, there is also a professional identity dimension that is easy to underestimate, particularly in senior roles. Resistance is not always resistance to technology; it can be resistance to degraded judgement, unclear responsibility, or pressure to signal tool use before the organisation has defined how AI should be used without lowering the standard of work. If an AI-assisted output fails, responsibility does not sit with the login metric; it sits with the person who signed off on the work, and ultimately with the firm.

These dynamics may carry consequences that adoption metrics are not designed to capture: reputational exposure at the individual level, erosion of professional confidence in roles where judgement is the core deliverable, and in some cases a broader cultural tension between the pressure to adopt visibly and the absence of clear standards for what trustworthy use actually looks like. Whether those consequences materialise will depend heavily on how the organisation manages the transition, but they represent a class of risk that sits outside the reach of the dashboard and deserves to be named.

In that sense, the adoption mechanism may accelerate visible tool use before the organisation has stabilised the conditions required for trustworthy use. That does not make the approach irrational, but it does make its inherent trade-offs visible.

Case 2: Public-sector adoption under evaluation pressure (UK Government/DBT)

The DBT Copilot evaluation makes a different adoption logic visible. Rather than linking usage to progression or performance signals, the intervention is structured as a bounded institutional experiment designed to assess risks, benefits, and deployment conditions before wider rollout. In that sense, the case is not primarily about forcing uptake; it is about producing decision-relevant evidence under public-sector constraints.

That distinction matters particularly for government and public institutions, where the cost of moving too fast is not just operational, it is political and reputational. The DBT approach reflects a logic many public sector leaders will recognise: you cannot defend a broad rollout without a defensible evidence base, and a well-designed pilot is how you build one. Politically, this also shifts the conversation, from an abstract debate about AI to a discussion grounded in what the evidence actually shows, what counts as sufficient proof, and under what conditions scaling is justified.

The evaluation's value lies in what it reports honestly: task-level effects in some areas, uneven performance across tasks, and real accuracy concerns. It also records an important limitation, although users reported time savings in some activities, the evaluation did not find robust evidence that these translated into measurable productivity improvement at overall departmental level. These results should not be framed as a failure of the pilot. Rather, they serve as a reminder that task-level benefit does not automatically become system-level value and that a useful pilot is not a technology demonstration, but a decision instrument.

These dynamics carry their own class of institutional risk, one that sits outside the evaluation framework itself. This is because a rigorous and honest report creates external exposure: findings that are mixed by design can easily be read as failure by those who were never invested in the process. The more transparent the evaluation, the more visible the gap between what was learned and what can be claimed as transformed. And in public sector contexts, documented learning faces a structural vulnerability that private sector pilots rarely encounter, it must survive budget cycles, political transitions, and organisational restructuring to become durable capability. Whether the DBT findings were later embedded in workflow redesign and governance, or whether they remained largely contained within a well-produced document, is a question the public record does not answer. That uncertainty is not a criticism of the pilot; rather, it is a reminder that producing decision-relevant evidence and institutionalising it are two distinct acts, and only the first is guaranteed by good evaluation design.

It is here that the DBT case diverges most sharply from Accenture, though not in the way one might expect. The private-sector case risks treating momentum as proof, the public-sector case risks treating documentation as transformation. A well-produced evaluation report is not the same as institutionalised learning. What remains less visible is whether those findings were later embedded in workflow redesign, governance, and capability development or whether they remained largely contained within the report itself. The pilot provides the evidence to evaluate, but it does not, by its very nature, redesign the system of work. That is the boundary condition both approaches share, even when the mechanisms look nothing alike.

What both cases reveal when the question changes

What makes these two cases useful together is that they surface the same pressure through different institutional logics. In both settings, it's relatively easy to point to signs of progress, people are logging in, tasks are being tested, users report satisfaction, some activities take less time. What is much harder is to demonstrate that these signals lead to system-level value, durable capability, and better institutional decisions over time. The conditions needed for trustworthy value creation don't always line up with the incentives that drive visible adoption.

The trade-off is not simply "fast adoption" versus "careful adoption." It is a question of which tension each organisation is choosing to resolve first, and which one it is quietly accepting as a residual risk. Both choices can be coherent, but neither removes the need to answer the same later question: what value is being created, for whom, under what conditions, and at what hidden cost.

Which decision routes actually open up?

If both cases point to the same structural challenge, then the practical question isn't whether to adopt AI, but rather which route an organisation is choosing and what it’s willing to trade off to get there. Three distinct routes tend to emerge. They aren't mutually exclusive, but they represent very different decision logics, each with its own failure mode.

The first route prioritises visible uptake. It optimises for speed of alignment and the organisational signal that something is changing. The risk is Goodhart's Law in practice, logins rise, but task quality and net time savings don't follow. The warning signs usually appear early, with rising usage metrics, weaker evidence of better decisions, and a growing verification burden pushed onto individual staff. This route is coherent when the immediate priority is mobilisation. It is not, by itself, evidence of value creation, and should not be treated as such.

The second route prioritises bounded evaluation. It optimises for legitimacy and the quality of the deployment decision, building a credible basis for judging where tools help and where they fall short. This is closer to the public-sector logic. The common failure here is pilot theatre: rich reports, no clear scaling threshold, and task-level insights that never translate into workflow change. It only works if the pilot is designed from the start as a decision instrument with a clear threshold for scaling or stopping.

The third route treats adoption as staged capability-building. It sequences mobilisation before value creation deliberately, selecting workflows with real friction, building accountability into use from the start, and only then scaling. It optimises for operational value and sustainable adoption rather than visible momentum. The cost is that it demands significant managerial discipline and protection of design time, and the risk is fragmentation, with small wins appearing in different teams without ever adding up to a shared capability. The evidence points in a consistent direction, organisations that approach adoption as capability-building, redesigning workflows rather than layering AI onto existing processes, are the ones beginning to show enterprise-level impact, though they remain a minority.

A fourth route is emerging, primarily in public-sector contexts where governments have both the regulatory capacity and the political mandate to pursue a governance-first approach to AI adoption. Rather than treating governance as a guardrail added after deployment, or as a lesson learned from a pilot, this route establishes institutional governance infrastructure as the precondition for any adoption at scale. Australia’s National AI Plan offers one of the clearest examples, with Chief AI Officers being mandated across agencies to support adoption, capability and governance, a dedicated AI Safety Institute to define testing and documentation standards, and an explicit message to organisations that regulators will ask not just whether AI is being used, but how it is being governed. Germany's positioning under the EU AI Act follows a similar logic, where regulatory compliance is not the endpoint but the starting condition.

In the private sector, this route appears only partially and in industries where governance infrastructure already existed before AI arrived, such as banking, healthcare, defence, where the question was never "should we govern this?" but "how do we extend existing frameworks to cover it?". The benefit is that it builds legitimacy and public trust at scale before reputational or operational failures force the issue. The cost is a slower pace, because governance-first adoption takes time, needs sustained political will, and can leave behind frameworks that remain in place even after the tools have changed. The risk is a different kind of Goodhart problem, where compliance with governance requirements becomes the signal of responsible adoption, rather than evidence that the governance is actually shaping how AI is used in practice. This route is coherent when the priority is long-term institutional trust and the organisation or government has the mandate and capacity to build infrastructure before use. It is not a realistic starting point for most private sector organisations, and it should not be mistaken for one.

None of these routes is inherently the right one. The more useful question is which route fits the organisation's current priorities and risk appetite, and whether leadership is being honest about what it can really prove at that stage.

The ultimate trap is claiming system-level value when you’ve only generated evidence of use, or claiming institutional capability when you’ve only produced a few reports. That is precisely where a weak adoption narrative turns into a costly and hard-to-reverse mistake.

What this analysis actually enables

If this analysis is to be genuinely useful, it needs to do more than sharpen the diagnosis. It should help an organisation make a clearer decision about how it wants to adopt AI, what evidence is enough at each stage, and what it can no longer claim to have proven.

In practice, support and resistance will shift depending on what is being prioritised. Senior leadership and transformation teams tend to favour routes that generate quick, visible movement, speed signals alignment. In practice, digital, data, and innovation functions tend to push for bounded evaluation and learning, whereas, operations, quality, and risk teams tend to resist anything that moves faster than the evidence. This resistance, in that context, is not an obstacle to innovation, but a defence of quality, accountability, and genuine capability. It deserves to be understood as a strategic signal rather than a problem to be managed away. Different roles are managing different risks across different time horizons, and recognising that distinction changes how the conversation inside the organisation needs to be structured.

For that reason, this is less a one-off choice and more a governance loop, one that needs revisiting as the evidence evolves. A route may make sense at the beginning, when the priority is simply to get people moving, but it can stop making sense if usage goes up while the quality of decisions remains unclear. Useful review cycles tend to follow three horizons: early signals at two to four weeks, workflow evidence at eight to twelve weeks, and scaling or stop decisions quarterly.

What should trigger an unplanned review is usually visible before it becomes critical. Such indicators include: rising usage without task-level benefit, task wins that never improve the workflow, growing rework or escalations, performative use, quality incidents, or a governance cost that is quietly exceeding the benefit being claimed. The point is not to keep debating, but rather to have the maturity to pause and reassess when the evidence no longer supports what the organisation is claiming.

If no route is chosen and the organisation remains in deliberate ambiguity, the first tension to become unsustainable is the gap between adoption expectations and evidence of value. Pressure to "do AI" continues yet without a clear criterion to distinguish use from benefit from systemic value, the outcome is inevitably fragmented deployments, inconsistent metrics, optimistic narratives sitting alongside results that are hard to compare, a pattern that tends to harden into organisational fatigue within two to four quarters, or sooner if a quality or reputational incident occurs in high-stakes work.

In many organisations, the gap between what looks like progress and what is actually improving the work can be bridged for a while with optimism and momentum narratives , until eventually the tension hardens. Either staff are pushed to perform adoption signals without stable conditions for trustworthy use, or leadership is left defending investment decisions with evidence too weak for the claims being made. In both cases, the pressure shifts from innovation to credibility.

That is why the status quo is never neutral. Even choosing not to decide is still a decision about which risks are allowed to accumulate, who is expected to carry them, and what kind of evidence will be accepted by default.

The evidence that actually matters

The most useful thing an organisation can do before committing to a route is gather a small set of evidence honestly: a workflow baseline, time, quality, rework, errors, escalations, approximate cost, before any deployment. Task-level comparison with real samples, not only self-reported perception. Traceability of impact, meaning whether task-level wins translate into workflow improvement or remain local savings. The full cost of adoption, including licences, training, verification, support, governance, and rework. And signals of sustainability: persistent useful usage beyond the initial peak, stable quality, and clear accountability.

Without that evidence, organisations risk deciding by hype, political pressure, or metrics that have already been Goodharted into unreliability.

This analysis does not produce a correct answer. It produces something more useful: an explicit decision about adoption logic, a way to revisit that decision, and an early warning about what the status quo is quietly costing. Without that, an organisation doesn't just move slowly, it moves without learning what it actually needed to learn.

If you are navigating AI adoption decisions in a private or public-sector context and want to assess which route makes sense for your organisation, get in touch here.

Gianina advises organisations on strategy, governance, and complex change through her advisory practice. She works across private and public-sector contexts, helping leaders navigate uncertainty, make better decisions, align organisations around shared priorities, and build long-term capability.

Contact