Yield Prediction & Optimization

Data Sources for Yield

Inline metrology, WAT data, FDC, and merging heterogeneous data

The Yield Data Landscape

The Yield Data Landscape

Yield prediction requires integrating multiple data sources across the entire manufacturing flow:

  • Inline metrology: CD, overlay, film thickness, and other measurements taken during fabrication. Sparse sampling (5–20 sites per wafer, 5–10% of wafers).
  • FDC (equipment sensor data): Process conditions for every wafer on every tool. Complete coverage but indirect — must be correlated to yield outcomes.
  • Defect inspection: Defect counts, maps, and classifications from optical and e-beam inspection.
  • WAT (Wafer Acceptance Test): Electrical measurements on test structures after fab completion — transistor parameters (Vt, Idsat, Ioff), resistances, capacitances.
  • Sort/probe data: Die-level pass/fail and bin results from electrical testing.
  • Design data: Die layout features — pattern density, metal coverage, critical design rules.

Key Concept: The Data Integration Challenge

Each data source has different granularity (wafer-level, die-level, site-level), different sampling rates, and different schemas. Merging them into a unified dataset is often 80% of the ML project effort. Wafer ID and lot ID are the typical join keys, but handling missing data and mismatched sampling is non-trivial.

Building the Wafer × Die Feature Table

Building the Wafer × Die Feature Table

Every yield model in production starts from one canonical artifact: a flat (wafer, die) feature table. Building it cleanly determines whether the model works.

Standard schema

ColumnSourceGranularity
lot_id, wafer_id, die_x, die_yMESdie
route step IDs (etch_chamber, litho_chamber, …)MES historywafer × step
FDC summary stats per step (mean, std, slope)FDC databasewafer × step
Inline metrology (CD, overlay, thickness)Metrology DBsite (interpolated to die)
Defect counts in 0.5 mm neighborhoodInspection DBdie
WAT params (Vt, Idsat, Ioff) at nearest test siteTest DBsite (interpolated)
Sort bin (label)Probe DBdie

Sketch of the build pipeline

import pandas as pd

def build_die_feature_table(lot_ids):
    """Join MES + FDC + metrology + defects + WAT + sort into a die-level table."""
    mes      = load_mes_history(lot_ids)              # wafer × step
    fdc      = load_fdc_summaries(lot_ids)            # wafer × step
    metro    = load_inline_metrology(lot_ids)         # site
    defects  = load_defect_records(lot_ids)           # die
    wat      = load_wat(lot_ids)                      # site
    sort     = load_sort_bins(lot_ids)                # die

    # 1. Wafer-level: route + FDC summaries
    wafer_df = mes.merge(fdc, on=["lot_id", "wafer_id", "step"])
    wafer_df = wafer_df.pivot_table(
        index=["lot_id", "wafer_id"],
        columns="step",
        values=[c for c in wafer_df.columns if c.startswith("fdc_")],
    )
    wafer_df.columns = ["__".join(c) for c in wafer_df.columns]
    wafer_df = wafer_df.reset_index()

    # 2. Site-level → die-level by nearest-neighbor on (x, y)
    metro_die  = interpolate_to_dies(metro,  key=("die_x", "die_y"))
    wat_die    = interpolate_to_dies(wat,    key=("die_x", "die_y"))

    # 3. Defect counts per die (0.5 mm radius)
    defect_die = count_defects_per_die(defects, radius_mm=0.5)

    # 4. Final outer join
    die_df = sort.merge(wafer_df,  on=["lot_id", "wafer_id"], how="left")
    die_df = die_df.merge(metro_die, on=["lot_id", "wafer_id", "die_x", "die_y"], how="left")
    die_df = die_df.merge(wat_die,   on=["lot_id", "wafer_id", "die_x", "die_y"], how="left")
    die_df = die_df.merge(defect_die,on=["lot_id", "wafer_id", "die_x", "die_y"], how="left")
    return die_df

Three things that go wrong

  • Step-name drift — recipes get renamed; you suddenly have two columns for the same physical step
  • Coordinate misalignment — inline metrology sites and probe dies use different coordinate systems; mis-mapping invents fake correlations
  • Look-ahead leakage — joining WAT (taken after the fab) into a feature for predicting yield is fine; joining it as a process input is data leakage

Key Concept: Treat the Feature Table as a Product

The feature table is the central artifact for every downstream model — yield, defect, virtual metrology. Version it, test it, and document each column. Without a stable schema, every new model project re-builds it from scratch and gets a slightly different answer.

Knowledge Check

Knowledge Check

1 / 2

What fraction of the ML project effort is typically spent on data integration for yield prediction?