The decision moment
The AI is 94% sure
a $2M lamp is about to fail.
The engineer isn't.
3 sensors correlated
I designed the system that makes this decision possible.
The Problem
Sarah's Monday morning
Sarah is a field service engineer (FSE) with 7 years of experience. Her shift starts at 8 AM. She opens her inbox to 47 alerts that fired overnight. She doesn't know which are critical, which are noise, or which five are the same root cause repeating. She spends 45 minutes manually triaging before she even walks to a tool.
Meanwhile, production waits. That gap between having data and knowing what to do with it was what I set out to solve.
Discovery
Going forensic
12 interviews across four roles, a week embedded in the fab, 468,000 alarm records analyzed, and a fragmented tool landscape audited.
Three users, one platform
V1 Foundation
Building the data layer first
V1 was a unified fleet monitoring platform with health-gradient tiles, a Query Builder replacing a 4-6 hour data team cycle with self-serve access, and drill-down from Tool to Error. Shipped to four enterprise customers.
V1 solved data access. But it didn't solve the harder problem.
The Pivot
V1 shipped. It wasn't enough.
Real results but GlobalFoundries Dresden was still seeing 210 interrupts per week. Engineers had visibility. They were still drowning. The problem wasn't data access. It was cognitive overload.
The design question: How do you bring AI into a workflow where a wrong answer costs millions in damaged silicon without undermining the engineer's expertise? Our PM wanted full automation. I pushed back: 8 of 12 interviewees explicitly rejected autonomous decision-making. They wanted an advisor, not an autopilot. Built advisor prototypes. Customers validated. Leadership invested.
The System
Three agents, nine screens, three role views
I led the end-to-end design: research, interaction design, prototyping, and validation. The AI/ML engineers built the models; I defined what they should optimize for and how their outputs surface to users.
Three agents, intentionally separated so each agent's reasoning is independently traceable. The design principle: the AI's authority scales with its confidence, and human overrides strengthen the model rather than override it.
The full product: 9 screens, 3 workflows
Role-based views from one dropdown
The same platform serves all three personas through a role dropdown in the context bar. Same data, different information architecture:
Service Requests: 8D lifecycle
Service Requests follow the 8D methodology (the manufacturing standard for root-cause problem solving). The agent pre-populates the problem statement from the Diagnose screen. D7 (Prevent) triggers the Learning Agent's fleet-wide update.
Team
Problem
Contain
Root Cause
Corrective
Validate
Prevent
Review
Configuration: AI governance controls
What should be configurable versus fixed is a design position. I designed the Configuration screen to give operators control over agent behavior without requiring engineering changes:
Design decision: the 65% threshold. Below it, no hypothesis shown (State 3). Above it, diagnosis with uncertainty markers. Configurable per subsystem because a lamp failure and a wafer-in-chamber situation have different risk profiles.
Alert Intelligence
From 400+ alarms to 3 priorities
The Monitor Agent uses ISA-18.2 temporal correlation (the alarm management standard) to compress raw signals:
The FSE at shift start needs one thing: "what happened overnight, what's urgent, where do I start?" The Monitor Agent answers that in 10 seconds with a structured briefing and 24-hour timeline.
Designing for Uncertainty
Five confidence states
From Toshiba repair logs, every incident follows: occurrence → response → repair start → repair complete → return to normal. I mapped these stages to five confidence states each requiring a fundamentally different UI.
Shows hypothesis + evidence cascade
"Accept" isn't the default. The FSE must scroll through the evidence cascade first. Acceptance is informed, not automatic.
Design time: 2 daysShows NO hypothesis. Prevents anchoring bias.
If the agent showed a 30% guess, the engineer would anchor to it. By showing nothing, the engineer approaches fresh. A wrong diagnosis means replacing the wrong part while the actual failure continues damaging wafers.
Design time: 2 weeksDiagnosis blocked missing data channels
Two of four data channels are unavailable. "Accept" is disabled. The FSE sees exactly which channels need restoration.
Agent was wrong structured correction feeds Learning Agent
Not "disagree" or "provide feedback." Structured fields: 15 root cause options, resolution, repair time, parts, and 7 categories for why the agent was wrong. Data the Learning Agent can act on.
1. You're viewing State 1 · High Confidence (94%) click Accept Diagnosis to see the Plan of Action
2. Click State 3 · Low notice: no hypothesis shown (prevents anchoring bias)
3. Click State 4 · Blocked see channel stoplights (Active/Stale/Disconnected)
4. Click State 5 · Correction the structured override form (15 root causes, 7 "why wrong" categories)
5. Click Run Autotest (right panel) watch the 12-point diagnostic run live
The Learning Loop
One correction improves 47 tools
In every KB system I studied, corrections are unstructured feedback. The system doesn't learn. I designed override as input.
diagnosis
structured correction
updated fleet-wide
over time
Real example: From patterns across overrides, the Learning Agent adjusted the lamp threshold from 4,000h to 3,800h across all 47 tools. One FSE's correction improved preventive maintenance for the entire fleet.
Guardrails: Three safeguards prevent bad corrections from cascading fleet-wide: concordance thresholds, configurable staging windows, and contradiction detection.
Query Builder
Query Builder V2: natural language meets structured editing
In V1, engineers manually constructed boolean queries across four data channels (Autotest, FDC, Health Index, Metrology). In V2, the engineer types a natural-language question. The agent translates it into structured, editable field chips each parameter individually adjustable. A "View SQL" toggle shows the raw query. One sentence replaces four manual conditions.
1. Click Run Query watch the agent reason through your question step by step
2. Click View SQL › see the raw query the agent generated
3. Expand a row (click › on L-B09) drill into readings and sparkline chart
4. Switch to Chart tab SPC trace, bar comparison, and configurable chart playground
5. Toggle V1: Boolean Builder tab see what the same query looked like before AI
Results
That $1.2M cost of inaction this is the response.
| Metric | Before | After | |
|---|---|---|---|
| Defect resolution | 6 months | → | 2-3 days |
| Triage time | 45 minutes | → | Under 2 minutes |
| GF Dresden interrupts | 210 / week | → | 50 / week |
| Efficiency | Baseline | → | 30% improvement |
| Agent accuracy | N/A | → | 90%+ top-1 precision |
| Alarm fatigue | 85% ignore rate | → | Eliminated |
| Pre-sales impact | → | 25% conversion · 4 customers |
Tested with 8 FSEs and 2 PMs 80% positive. Key refinement: override path streamlined to be accessible from any state.
Sarah's Monday morning now starts with 3 priorities instead of 400+ alarms. She resolves two before walking to the fab floor.
Methodology: Top-1 precision against 200+ resolved SRs. We tracked precision over recall because a withheld diagnosis (State 3) is a designed outcome, not a failure.
Failure Modes
What happens when the AI is wrong
Designing for failure shaped more of this product than designing for success. Each failure mode was stress-tested during shadow deployment before any recommendation surfaced to FSEs.
Diagnose Agent: confidently wrong at 94%
The evidence cascade shows first-out alarm, downstream signals, and match percentages. The confidence score is context, not a command. Override is always accessible.
Cross-agent failure: cascading errors
The three-agent separation makes this traceable. Each agent logs independently; the Apps Engineer can audit the full chain.
Two additional failure modes (Monitor suppression, Learning propagation) were stress-tested with corresponding detection metrics.
Reflections
What I'd do differently
Suppressed alarm transparency
400+ to 47, but the FSE has no visibility into what was filtered. I'd add a "353 alarms rationalized" view. Transparency about what the AI removed is as critical as what it shows.
Scalability beyond 217 tools
At 5,000+ tools across 12 fabs, the flat tile grid breaks down. I'd move to a fab, zone, bay hierarchy with aggregated health scores.
Accessibility in a fab environment
Designed and validated for cleanroom constraints: WCAG AA contrast throughout, color-blind safe encoding (text labels and directional arrows alongside color, never color alone), 44px touch targets for gloved interaction, ARIA semantics validated with the accessibility team, and monospace signal names sized for arm's length readability.
Design principles
Trust through transparency
Five states acknowledge the agent isn't always right. Override gives FSE authority. The agent recommends never commands.
Override as input, not feedback
Structured corrections enable retraining. A comment field gives text. Structured fields give data the Learning Agent can act on.
AI features feel native
Agent cards use identical styling to every other card. No glowing borders. The AI is a tool, not a feature demo.
The edge cases are the design
State 3 prevents anchoring. State 4 prevents premature commitment. State 5 captures knowledge. The happy path is obvious the edge cases are where decisions matter.