Data Sources for Yield
Inline metrology, WAT data, FDC, and merging heterogeneous data
The Yield Data Landscape
The Yield Data Landscape
Yield prediction requires integrating multiple data sources across the entire manufacturing flow:
- Inline metrology: CD, overlay, film thickness, and other measurements taken during fabrication. Sparse sampling (5–20 sites per wafer, 5–10% of wafers).
- FDC (equipment sensor data): Process conditions for every wafer on every tool. Complete coverage but indirect — must be correlated to yield outcomes.
- Defect inspection: Defect counts, maps, and classifications from optical and e-beam inspection.
- WAT (Wafer Acceptance Test): Electrical measurements on test structures after fab completion — transistor parameters (Vt, Idsat, Ioff), resistances, capacitances.
- Sort/probe data: Die-level pass/fail and bin results from electrical testing.
- Design data: Die layout features — pattern density, metal coverage, critical design rules.
Key Concept: The Data Integration Challenge
Each data source has different granularity (wafer-level, die-level, site-level), different sampling rates, and different schemas. Merging them into a unified dataset is often 80% of the ML project effort. Wafer ID and lot ID are the typical join keys, but handling missing data and mismatched sampling is non-trivial.
Building the Wafer × Die Feature Table
Building the Wafer × Die Feature Table
Every yield model in production starts from one canonical artifact: a flat (wafer, die) feature table. Building it cleanly determines whether the model works.
Standard schema
| Column | Source | Granularity |
|---|---|---|
| lot_id, wafer_id, die_x, die_y | MES | die |
| route step IDs (etch_chamber, litho_chamber, …) | MES history | wafer × step |
| FDC summary stats per step (mean, std, slope) | FDC database | wafer × step |
| Inline metrology (CD, overlay, thickness) | Metrology DB | site (interpolated to die) |
| Defect counts in 0.5 mm neighborhood | Inspection DB | die |
| WAT params (Vt, Idsat, Ioff) at nearest test site | Test DB | site (interpolated) |
| Sort bin (label) | Probe DB | die |
Sketch of the build pipeline
import pandas as pd
def build_die_feature_table(lot_ids):
"""Join MES + FDC + metrology + defects + WAT + sort into a die-level table."""
mes = load_mes_history(lot_ids) # wafer × step
fdc = load_fdc_summaries(lot_ids) # wafer × step
metro = load_inline_metrology(lot_ids) # site
defects = load_defect_records(lot_ids) # die
wat = load_wat(lot_ids) # site
sort = load_sort_bins(lot_ids) # die
# 1. Wafer-level: route + FDC summaries
wafer_df = mes.merge(fdc, on=["lot_id", "wafer_id", "step"])
wafer_df = wafer_df.pivot_table(
index=["lot_id", "wafer_id"],
columns="step",
values=[c for c in wafer_df.columns if c.startswith("fdc_")],
)
wafer_df.columns = ["__".join(c) for c in wafer_df.columns]
wafer_df = wafer_df.reset_index()
# 2. Site-level → die-level by nearest-neighbor on (x, y)
metro_die = interpolate_to_dies(metro, key=("die_x", "die_y"))
wat_die = interpolate_to_dies(wat, key=("die_x", "die_y"))
# 3. Defect counts per die (0.5 mm radius)
defect_die = count_defects_per_die(defects, radius_mm=0.5)
# 4. Final outer join
die_df = sort.merge(wafer_df, on=["lot_id", "wafer_id"], how="left")
die_df = die_df.merge(metro_die, on=["lot_id", "wafer_id", "die_x", "die_y"], how="left")
die_df = die_df.merge(wat_die, on=["lot_id", "wafer_id", "die_x", "die_y"], how="left")
die_df = die_df.merge(defect_die,on=["lot_id", "wafer_id", "die_x", "die_y"], how="left")
return die_df
Three things that go wrong
- Step-name drift — recipes get renamed; you suddenly have two columns for the same physical step
- Coordinate misalignment — inline metrology sites and probe dies use different coordinate systems; mis-mapping invents fake correlations
- Look-ahead leakage — joining WAT (taken after the fab) into a feature for predicting yield is fine; joining it as a process input is data leakage
Key Concept: Treat the Feature Table as a Product
The feature table is the central artifact for every downstream model — yield, defect, virtual metrology. Version it, test it, and document each column. Without a stable schema, every new model project re-builds it from scratch and gets a slightly different answer.
Knowledge Check
Knowledge Check
1 / 2What fraction of the ML project effort is typically spent on data integration for yield prediction?