ML Models for PdM

Survival analysis, anomaly detection, RUL estimation, and deep learning

Types of PdM Models

Different ML approaches address different PdM questions:

Approach	Question Answered	Methods
Anomaly Detection	Is the tool behaving abnormally right now?	Isolation Forest, Autoencoders, PCA, One-Class SVM
Classification	Will this component fail within N hours?	Random Forest, XGBoost, Neural Networks
RUL Estimation	How many hours until failure?	LSTM, CNN on time series, survival models
Survival Analysis	What's the probability of survival past time T?	Cox PH, Weibull, Random Survival Forests

Key Concept: The Rare Failure Problem

In a well-maintained fab, actual failures are rare (class imbalance: 99.9%+ normal). This creates challenges for supervised learning. Approaches: anomaly detection (unsupervised), synthetic oversampling (SMOTE), cost-sensitive learning, or semi-supervised methods that learn "normal" and flag deviations.

Deep Learning for PdM

Deep learning has shown promise for PdM, particularly for directly modeling raw sensor time series:

1D-CNNs: Convolutional networks applied to sensor time series can automatically learn relevant temporal patterns without manual feature engineering.
LSTMs/GRUs: Recurrent networks capture long-range dependencies across multiple process runs (e.g., slow drift over hundreds of runs).
Transformer-based models: Attention mechanisms can identify which time steps and which sensors are most predictive of impending failure.
Autoencoders: Learn a compressed representation of "normal" equipment behavior. Large reconstruction error = abnormal behavior.

In practice, gradient-boosted trees (XGBoost, LightGBM) on engineered features often outperform deep learning in this domain due to limited training data and the effectiveness of domain-informed features.

Survival Analysis and RUL Estimation

Survival analysis is the statistical backbone of PdM. The central object is the survival function:

S(t) = P(T > t)

i.e. the probability that a component is still alive at time t. The complement is the cumulative failure probability F(t) = 1 − S(t), and the instantaneous failure rate (hazard) is h(t) = f(t) / S(t).

1. The Weibull model — the workhorse

Fab equipment lifetimes are routinely fit with the two-parameter Weibull distribution:

S(t) = exp(−(t/η)^β) h(t) = (β/η)(t/η)^β−1

Shape parameter β	Meaning	Typical fab example
< 1	Decreasing hazard ("infant mortality")	New chamber after install — early-life bugs
= 1	Constant hazard (memoryless / exponential)	Random faults — power supply, sensor failures
> 1	Increasing hazard ("wear-out")	Heater coil, RF generator, focus ring

2. Cox proportional hazards — using covariates

Adds an exponential effect of covariates x on the baseline hazard:

h(t | x) = h₀(t) · exp(β·x)

This lets you say, e.g., "a 10% higher RF reflected power doubles the instantaneous failure rate," without committing to a specific h₀ shape.

3. RUL from a Weibull HI model

Once you have an estimated S(t) and the component has already survived to time t_now, the Remaining Useful Life is the expectation:

RUL(t_now) = E[T − t_now | T > t_now] = ∫_{t_now}^∞ [S(u)/S(t_now)] du

import numpy as np
from scipy.special import gamma

def weibull_rul(t_now: float, eta: float, beta: float) -> float:
    """Remaining useful life under a Weibull lifetime distribution.

    Mean lifetime is eta * Gamma(1 + 1/beta); conditional mean
    given survival to t_now uses numeric integration of S(u)/S(t_now).
    """
    if t_now < 0:
        raise ValueError("t_now must be non-negative")
    # Integrate from t_now to a horizon ~5x mean
    horizon = 5 * eta * gamma(1 + 1 / beta)
    u = np.linspace(t_now, horizon, 4000)
    S = np.exp(-(u / eta) ** beta)
    S_now = np.exp(-(t_now / eta) ** beta)
    return np.trapezoid(S / S_now, u)

# Example: focus ring with eta=1500 RF-hours, beta=2.5 (wear-out)
print(f"RUL at 800 RF-hr: {weibull_rul(800, 1500, 2.5):.0f} hours")
print(f"RUL at 1400 RF-hr: {weibull_rul(1400, 1500, 2.5):.0f} hours")

Key Concept: Censored Data

Most components on the floor right now haven't failed yet — their lifetimes are right-censored. Fitting Weibull/Cox models with maximum likelihood properly accounts for censoring (via the survival contribution S(t) for censored points). Use lifelines or scikit-survival in Python rather than ad-hoc dropping of unfinished runs.

Sensor Data & Feature Engineering

Deployment & Operations

Knowledge Check

1 / 3

Why is anomaly detection often preferred over supervised classification for PdM?

Actual equipment failures are rare, creating severe class imbalance for supervised learningAnomaly detection is always more accurateSupervised learning can't handle time-series dataAnomaly detection doesn't require any data