VM Model Building
Feature engineering from sensor traces, regression models, and rigorous validation for fab-grade predictions
Feature Engineering from Sensor Traces
Feature Engineering from Sensor Traces
The raw data from a process chamber is a collection of time-series traces — one per sensor, sampled at 1–100 Hz over the duration of the process step (often 30–300 seconds). Converting these into tabular features suitable for ML is the most critical step in VM.
Step-Based Segmentation
Semiconductor recipes have discrete steps (e.g., "strike plasma," "main etch," "over-etch," "purge"). Feature extraction should be done per step, not over the entire trace, because the physics and expected sensor behavior differ dramatically between steps.
import pandas as pd
import numpy as np
from scipy import stats
def extract_trace_features(trace: np.ndarray, sensor_name: str,
step_name: str) -> dict:
"""Extract summary statistics from a single sensor trace within one step."""
features = {}
prefix = f"{step_name}_{sensor_name}"
# Basic statistics
features[f"{prefix}_mean"] = np.mean(trace)
features[f"{prefix}_std"] = np.std(trace)
features[f"{prefix}_min"] = np.min(trace)
features[f"{prefix}_max"] = np.max(trace)
features[f"{prefix}_range"] = np.ptp(trace)
features[f"{prefix}_median"] = np.median(trace)
# Shape statistics
features[f"{prefix}_skew"] = stats.skew(trace)
features[f"{prefix}_kurtosis"] = stats.kurtosis(trace)
# Trend features
x = np.arange(len(trace))
slope, intercept, r_value, _, _ = stats.linregress(x, trace)
features[f"{prefix}_slope"] = slope
features[f"{prefix}_r_squared"] = r_value ** 2
# Integral (area under curve — proxy for total energy/dose)
features[f"{prefix}_integral"] = np.trapz(trace)
# Settling features (important for pressure, temperature)
features[f"{prefix}_first_10pct_mean"] = np.mean(trace[:len(trace)//10])
features[f"{prefix}_last_10pct_mean"] = np.mean(trace[-len(trace)//10:])
features[f"{prefix}_settling_delta"] = (
features[f"{prefix}_last_10pct_mean"] -
features[f"{prefix}_first_10pct_mean"]
)
return features
# Example: 50 sensors × 4 steps × 14 features = 2,800 features per wafer
Key Concept: Domain-Guided Features
Generic statistics work, but domain knowledge unlocks better features. For example: in a CVD process, the ratio of precursor flow to carrier gas flow matters more than either alone. In plasma etch, the time derivative of optical emission at a specific wavelength signals endpoint. Always collaborate with process engineers to identify physically meaningful derived features.
Choosing the Right Model
Choosing the Right Model
VM is fundamentally a supervised regression problem with high dimensionality and limited labels. Here's what works in practice:
Model Landscape for VM
- PLS (Partial Least Squares): The workhorse of traditional VM. Handles multicollinearity naturally, interpretable, fast. Often the baseline to beat.
- LASSO / Elastic Net: Automatic feature selection via L1 penalty. Excellent when only a handful of sensors truly matter.
- Gradient Boosting (XGBoost, LightGBM): Captures nonlinearities and interactions. Often the best accuracy, but requires more tuning and care with small datasets.
- Neural Networks: Generally overkill for tabular VM. Exception: when using raw time-series traces (1D-CNN or LSTM) instead of hand-crafted features.
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import LassoCV, ElasticNetCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
import numpy as np
def compare_vm_models(X_train, y_train):
"""Compare common VM model architectures."""
models = {
'PLS (10 components)': PLSRegression(n_components=10),
'LASSO (auto-alpha)': LassoCV(cv=5, max_iter=10000),
'ElasticNet': ElasticNetCV(cv=5, max_iter=10000),
'GBM': GradientBoostingRegressor(
n_estimators=200, max_depth=4,
learning_rate=0.05, subsample=0.8
),
}
for name, model in models.items():
scores = cross_val_score(
model, X_train, y_train,
cv=5, scoring='neg_mean_absolute_error'
)
mae = -scores.mean()
std = scores.std()
print(f"{name:25s} MAE: {mae:.3f} ± {std:.3f} nm")
# Typical results for etch depth VM:
# PLS (10 components) MAE: 0.45 ± 0.08 nm
# LASSO (auto-alpha) MAE: 0.42 ± 0.07 nm
# ElasticNet MAE: 0.41 ± 0.07 nm
# GBM MAE: 0.36 ± 0.09 nm
Analogy: The Right Tool for the Job
PLS is like a reliable sedan — it gets you there safely every time. LASSO is a sports car — fast and sleek but needs the right road. GBM is a pickup truck — powerful and versatile but burns more fuel. In VM, start with PLS as your baseline, then see if GBM's extra complexity buys meaningful improvement.
Validation Strategy for VM
Validation Strategy for VM
Standard k-fold cross-validation can be dangerously optimistic for VM. Semiconductor data has temporal and grouping structure that must be respected.
Why Random Splits Fail
- Temporal leakage: If you randomly split, future wafers leak into training. In production, you only have past data to train on.
- Lot correlation: Wafers from the same lot are highly similar. Random splits put correlated wafers in both train and test.
- Chamber drift: A model validated on data from the same chamber state will look great — until the chamber is cleaned and all sensor baselines shift.
from sklearn.model_selection import TimeSeriesSplit
import pandas as pd
import numpy as np
def temporal_cv_for_vm(X, y, timestamps, n_splits=5):
"""Time-series cross-validation that respects temporal ordering."""
# Sort by time
sort_idx = timestamps.argsort()
X_sorted = X.iloc[sort_idx]
y_sorted = y.iloc[sort_idx]
tscv = TimeSeriesSplit(n_splits=n_splits)
results = []
for fold, (train_idx, test_idx) in enumerate(tscv.split(X_sorted)):
X_train = X_sorted.iloc[train_idx]
y_train = y_sorted.iloc[train_idx]
X_test = X_sorted.iloc[test_idx]
y_test = y_sorted.iloc[test_idx]
# Ensure no temporal leakage
train_end = timestamps.iloc[sort_idx[train_idx[-1]]]
test_start = timestamps.iloc[sort_idx[test_idx[0]]]
assert train_end < test_start, "Temporal leakage detected!"
model = GradientBoostingRegressor(n_estimators=200, max_depth=4)
model.fit(X_train, y_train)
preds = model.predict(X_test)
mae = np.mean(np.abs(y_test - preds))
results.append({'fold': fold, 'mae': mae, 'n_test': len(y_test)})
print(f"Fold {fold}: MAE = {mae:.3f} nm (n={len(y_test)})")
return pd.DataFrame(results)
# Additional validation: test on data AFTER a chamber clean event
# This is the hardest but most realistic test
Key Concept: Chamber Clean Boundary Test
The ultimate VM stress test: train on data from one "chamber life" (period between preventive maintenance events) and test on the next chamber life. If your model survives this, it's ready for production. Many models that look great on random splits collapse after a chamber clean because sensor baselines shift.
VM Performance Metrics
VM Performance Metrics
Fab engineers don't think in terms of RMSE or R². They care about whether VM predictions are good enough to replace or supplement physical metrology. Industry-standard metrics include:
Key Metrics
- MAE / RMSE: Basic accuracy. Must be well within the metrology tool's own repeatability (typically 0.5–2% of the measured value).
- MAPE (Mean Absolute Percentage Error): Normalized accuracy. Target: <1% for reliance VM.
- Cp/Cpk of residuals: Process capability indices applied to VM errors. Cpk > 1.33 is a common reliance threshold.
- Prediction interval coverage: Do the 95% confidence intervals actually contain 95% of true values?
import numpy as np
from scipy import stats
def vm_performance_report(y_true, y_pred, spec_limit=2.0):
"""Generate a fab-standard VM performance report."""
residuals = y_true - y_pred
n = len(residuals)
mae = np.mean(np.abs(residuals))
rmse = np.sqrt(np.mean(residuals ** 2))
mape = np.mean(np.abs(residuals / y_true)) * 100
r_squared = 1 - np.sum(residuals**2) / np.sum((y_true - y_true.mean())**2)
# Cpk of residuals (how well-centered within spec)
mu = np.mean(residuals)
sigma = np.std(residuals)
cpk = min(spec_limit - mu, spec_limit + mu) / (3 * sigma)
# Coverage check for 95% prediction interval
z_95 = 1.96
lower = y_pred - z_95 * sigma
upper = y_pred + z_95 * sigma
coverage = np.mean((y_true >= lower) & (y_true <= upper)) * 100
print(f"=== VM Performance Report ===")
print(f"Samples: {n}")
print(f"MAE: {mae:.4f} nm")
print(f"RMSE: {rmse:.4f} nm")
print(f"MAPE: {mape:.2f}%")
print(f"R²: {r_squared:.4f}")
print(f"Cpk (residuals): {cpk:.2f} {'PASS' if cpk > 1.33 else 'FAIL'}")
print(f"95% PI Coverage: {coverage:.1f}% {'PASS' if coverage >= 93 else 'FAIL'}")
return {'mae': mae, 'rmse': rmse, 'mape': mape,
'r2': r_squared, 'cpk': cpk, 'coverage': coverage}
Did You Know?
Metrology tools themselves have measurement uncertainty. An ellipsometer measuring 50 nm film thickness might have a repeatability of ±0.2 nm. If your VM model achieves MAE of 0.3 nm, you're approaching the noise floor of the measurement itself — going below that is impossible without a better reference.
Knowledge Check
Knowledge Check
1 / 3Why should feature extraction from sensor traces be done per recipe step rather than over the entire trace?