Deployment & Operations
Real-time inference, alert systems, maintenance scheduling, and ROI
Deploying PdM in Production
Deploying PdM in Production
Moving from a Jupyter notebook to a production PdM system involves significant engineering:
- Data pipeline: Real-time ingestion of FDC data from 1,000+ tools, cleaning, feature computation, and storage. Must handle missing data, sensor failures, and recipe changes.
- Model serving: Low-latency inference after each process run (seconds, not minutes). Models must handle multi-chamber, multi-recipe scenarios.
- Alert management: Converting model scores into actionable alerts. Too many false alarms = alert fatigue (engineers ignore them). Too few = missed failures.
- Integration with MES: Alerts flow into the Manufacturing Execution System for maintenance scheduling and wafer routing decisions.
- Model monitoring: Track model performance over time. Equipment changes (new PMs, recipe updates) can invalidate models — requiring retraining or adaptation.
Key Concept: ROI of PdM
A successful PdM system typically delivers 5–15% reduction in unplanned downtime and 10–20% reduction in maintenance costs. For a large fab, this translates to $10–50M annual savings. The ROI is compelling, but achieving it requires strong data infrastructure and close collaboration between data scientists and equipment engineers.
Alert Thresholds, Drift, and the MES Feedback Loop
Alert Thresholds, Drift, and the MES Feedback Loop
A PdM model is only as good as the decisions it triggers. Three operational pieces dictate whether the savings actually land.
1. Setting alert thresholds
Most fabs adopt a tiered alert scheme — typically a Yellow / Orange / Red triage:
| Tier | Trigger | Action |
|---|---|---|
| Yellow | Anomaly score > μ + 3σ on recent window | Engineer notified, no production stop |
| Orange | Predicted RUL < 24 h with >80% confidence | Schedule PM in next available slot |
| Red | Predicted RUL < 4 h or hard sensor limit breached | Tool placed in "PM hold" by MES |
2. Model drift
Equipment evolves: new PMs, new chambers, recipe edits, target swaps. A model trained six months ago can quietly become useless. Monitor drift continuously:
from scipy import stats
def feature_drift(train_dist, recent_dist, alpha=0.01):
"""Return True if the recent feature distribution has drifted (KS test)."""
ks_stat, p_value = stats.ks_2samp(train_dist, recent_dist)
return p_value < alpha, ks_stat
# Concept-drift retraining trigger
drifted, score = feature_drift(
train_dist=feature_history["chamber_pressure_mean"][:30_000],
recent_dist=feature_history["chamber_pressure_mean"][-2_000:],
)
if drifted:
schedule_retrain(model_id="etch_chamber_rul", reason=f"KS={score:.3f}")
3. The MES loop
The output of the PdM system is not a CSV — it is a structured event posted to the Manufacturing Execution System (MES):
- A predicted failure event creates a maintenance ticket in the CMMS
- The scheduler reserves the tool for PM at a low-WIP window
- Wafer routing is rebalanced to peer chambers
- Once PM is closed, the model receives a labelled failure or no-failure event for future retraining
Key Concept: Closed-Loop Feedback
The single biggest lever for PdM accuracy isn't a fancier model — it's a clean closed-loop label pipeline. Every PM ticket should carry the actual failure mode (or "no-fault-found") back to the training set within hours, not weeks. Without that loop, model accuracy decays steadily.
Knowledge Check
Knowledge Check
1 / 2What is the biggest operational challenge in deploying PdM?