Two Insights from Tao’s Blue/Red Teams Metaphor: Software Testing’s Future and AI as a Coach

Terence Tao’s blue and red teams post crystalised several insights about software testing and a different class of AI product that I have been building but hadn’t found the language to articulate until now.

He begins by describing the role of blue and red teams. Blue teams are builders who construct and defend orders from chaos, while red teams are hunters and invaders who find the weakest link in a coherent whole and exploit it.

In the field of cybersecurity, a distinction is made between the “blue team” task of building a secure system, and the “red team” task of locating vulnerabilities in such systems. The blue team is more obviously necessary to create the desired product; but the red team is just as essential, given the damage that can result from deploying insecure systems.

The nature of these teams mirror each other; mathematicians would call them “dual”. The output of a blue team is only as strong as its weakest link: a security system that consists of a strong component and a weak component (e.g., a house with a securely locked door, but an open window) will be insecure (and in fact worse, because the strong component may convey a false sense of security).

His observation about the human dynamics of red teams is particularly insightful:

Dually, the contributions to a red team can often be additive: a red team report that contains both a serious vulnerability and a more trivial one is more useful than a report that only contains the serious issue, as it is valuable to have the blue team address both vulnerabilities.

Two unexpected insights about QA and testers emerged:

1) Red teams compound faster. Once there’s a vulnerability, subsequent exploit attempts can build upon it. This can be quite different from blue teams, where each new feature or component is a fresh start with a clear boundary from other neighboring components.

2) Unconventional thinkers are better suited for red team roles.

Today, in most software organizations, testers are treated as second-class citizens and are not given the same respect as developers. Research shows testers typically earn 25-33% less than software engineers with comparable experience. In worse yet common cases, testers are brought in as an afterthought to clean up after development is largely complete.

As AI’s code generation capabilities advance, testers may become far more critical than they are today. Finding inconsistencies and ambiguities in software would provide high-leverage positive impact on the system’s integrity, creating far more business value than just finding isolated bugs that developers might overlook.

This also requires a shift in recruiting and hiring testers. Instead of manual laborers content with repetitive tasks, we need people with explorative and inquisitive mindsets.

Tao then applies this framework to AI products—an insightful perspective coming from a mathematician rather than a software engineer:

Many of the proposed use cases for AI tools try to place such tools in the “blue team” category, such as creating code, text, images, or mathematical arguments in some semi-automated or automated fashion, that is intended for use for some external application. However, in view of the unreliability and opacity of such tools, it may be better to put them to work on the “red team”, critiquing the output of blue team human experts but not directly replacing that output; “blue team” AI use should only be permitted up to the capability of one’s “red team” to catch and correct any errors generated. This approach not only plays to current AI strengths, such as breadth of exposure and fast feedback, but also mitigates the risks of deploying unverified AI output in high-stakes settings.

In my own personal experiments with AI, for instance, I have found it to be useful for providing additional feedback on some proposed text, argument, code, or slides that I have generated (including this current text). I might only agree with a fraction of the suggestions generated by the AI tool; but I find that there are still several useful comments made that I do agree with, and incorporate into my own output. This is a significantly less glamorous or intuitive use case for AI than the more commonly promoted “blue team” one of directly automating one’s own output, but one that I find adds much more reliable value.

This suggests a new category of AI products focused on coaching and feedback rather than direct output generation.

AI’s capabilities remain frustratingly jagged—brilliant at some tasks, unreliable at others. But perhaps that’s exactly why red team AI works: it sidesteps AI’s weaknesses while amplifying what it does well. Maybe the companies building critique tools today might discover a more constructive path through the AI landscape than those chasing perfect generation.