Predictive Maintenance

Deployment & Operations

Real-time inference, alert systems, maintenance scheduling, and ROI

Deploying PdM in Production

Deploying PdM in Production

Moving from a Jupyter notebook to a production PdM system involves significant engineering:

  • Data pipeline: Real-time ingestion of FDC data from 1,000+ tools, cleaning, feature computation, and storage. Must handle missing data, sensor failures, and recipe changes.
  • Model serving: Low-latency inference after each process run (seconds, not minutes). Models must handle multi-chamber, multi-recipe scenarios.
  • Alert management: Converting model scores into actionable alerts. Too many false alarms = alert fatigue (engineers ignore them). Too few = missed failures.
  • Integration with MES: Alerts flow into the Manufacturing Execution System for maintenance scheduling and wafer routing decisions.
  • Model monitoring: Track model performance over time. Equipment changes (new PMs, recipe updates) can invalidate models — requiring retraining or adaptation.

Key Concept: ROI of PdM

A successful PdM system typically delivers 5–15% reduction in unplanned downtime and 10–20% reduction in maintenance costs. For a large fab, this translates to $10–50M annual savings. The ROI is compelling, but achieving it requires strong data infrastructure and close collaboration between data scientists and equipment engineers.

Alert Thresholds, Drift, and the MES Feedback Loop

Alert Thresholds, Drift, and the MES Feedback Loop

A PdM model is only as good as the decisions it triggers. Three operational pieces dictate whether the savings actually land.

1. Setting alert thresholds

Most fabs adopt a tiered alert scheme — typically a Yellow / Orange / Red triage:

TierTriggerAction
YellowAnomaly score > μ + 3σ on recent windowEngineer notified, no production stop
OrangePredicted RUL < 24 h with >80% confidenceSchedule PM in next available slot
RedPredicted RUL < 4 h or hard sensor limit breachedTool placed in "PM hold" by MES

2. Model drift

Equipment evolves: new PMs, new chambers, recipe edits, target swaps. A model trained six months ago can quietly become useless. Monitor drift continuously:

from scipy import stats

def feature_drift(train_dist, recent_dist, alpha=0.01):
    """Return True if the recent feature distribution has drifted (KS test)."""
    ks_stat, p_value = stats.ks_2samp(train_dist, recent_dist)
    return p_value < alpha, ks_stat

# Concept-drift retraining trigger
drifted, score = feature_drift(
    train_dist=feature_history["chamber_pressure_mean"][:30_000],
    recent_dist=feature_history["chamber_pressure_mean"][-2_000:],
)
if drifted:
    schedule_retrain(model_id="etch_chamber_rul", reason=f"KS={score:.3f}")

3. The MES loop

The output of the PdM system is not a CSV — it is a structured event posted to the Manufacturing Execution System (MES):

  • A predicted failure event creates a maintenance ticket in the CMMS
  • The scheduler reserves the tool for PM at a low-WIP window
  • Wafer routing is rebalanced to peer chambers
  • Once PM is closed, the model receives a labelled failure or no-failure event for future retraining

Key Concept: Closed-Loop Feedback

The single biggest lever for PdM accuracy isn't a fancier model — it's a clean closed-loop label pipeline. Every PM ticket should carry the actual failure mode (or "no-fault-found") back to the training set within hours, not weeks. Without that loop, model accuracy decays steadily.

Knowledge Check

Knowledge Check

1 / 2

What is the biggest operational challenge in deploying PdM?