Data Sources for Yield

Inline metrology, WAT data, FDC, and merging heterogeneous data

The Yield Data Landscape

Yield prediction requires integrating multiple data sources across the entire manufacturing flow:

Inline metrology: CD, overlay, film thickness, and other measurements taken during fabrication. Sparse sampling (5–20 sites per wafer, 5–10% of wafers).
FDC (equipment sensor data): Process conditions for every wafer on every tool. Complete coverage but indirect — must be correlated to yield outcomes.
Defect inspection: Defect counts, maps, and classifications from optical and e-beam inspection.
WAT (Wafer Acceptance Test): Electrical measurements on test structures after fab completion — transistor parameters (Vt, Idsat, Ioff), resistances, capacitances.
Sort/probe data: Die-level pass/fail and bin results from electrical testing.
Design data: Die layout features — pattern density, metal coverage, critical design rules.

Key Concept: The Data Integration Challenge

Each data source has different granularity (wafer-level, die-level, site-level), different sampling rates, and different schemas. Merging them into a unified dataset is often 80% of the ML project effort. Wafer ID and lot ID are the typical join keys, but handling missing data and mismatched sampling is non-trivial.

Building the Wafer × Die Feature Table

Every yield model in production starts from one canonical artifact: a flat (wafer, die) feature table. Building it cleanly determines whether the model works.

Standard schema

Column	Source	Granularity
lot_id, wafer_id, die_x, die_y	MES	die
route step IDs (etch_chamber, litho_chamber, …)	MES history	wafer × step
FDC summary stats per step (mean, std, slope)	FDC database	wafer × step
Inline metrology (CD, overlay, thickness)	Metrology DB	site (interpolated to die)
Defect counts in 0.5 mm neighborhood	Inspection DB	die
WAT params (Vt, Idsat, Ioff) at nearest test site	Test DB	site (interpolated)
Sort bin (label)	Probe DB	die

Sketch of the build pipeline

import pandas as pd

def build_die_feature_table(lot_ids):
    """Join MES + FDC + metrology + defects + WAT + sort into a die-level table."""
    mes      = load_mes_history(lot_ids)              # wafer × step
    fdc      = load_fdc_summaries(lot_ids)            # wafer × step
    metro    = load_inline_metrology(lot_ids)         # site
    defects  = load_defect_records(lot_ids)           # die
    wat      = load_wat(lot_ids)                      # site
    sort     = load_sort_bins(lot_ids)                # die

    # 1. Wafer-level: route + FDC summaries
    wafer_df = mes.merge(fdc, on=["lot_id", "wafer_id", "step"])
    wafer_df = wafer_df.pivot_table(
        index=["lot_id", "wafer_id"],
        columns="step",
        values=[c for c in wafer_df.columns if c.startswith("fdc_")],
    )
    wafer_df.columns = ["__".join(c) for c in wafer_df.columns]
    wafer_df = wafer_df.reset_index()

    # 2. Site-level → die-level by nearest-neighbor on (x, y)
    metro_die  = interpolate_to_dies(metro,  key=("die_x", "die_y"))
    wat_die    = interpolate_to_dies(wat,    key=("die_x", "die_y"))

    # 3. Defect counts per die (0.5 mm radius)
    defect_die = count_defects_per_die(defects, radius_mm=0.5)

    # 4. Final outer join
    die_df = sort.merge(wafer_df,  on=["lot_id", "wafer_id"], how="left")
    die_df = die_df.merge(metro_die, on=["lot_id", "wafer_id", "die_x", "die_y"], how="left")
    die_df = die_df.merge(wat_die,   on=["lot_id", "wafer_id", "die_x", "die_y"], how="left")
    die_df = die_df.merge(defect_die,on=["lot_id", "wafer_id", "die_x", "die_y"], how="left")
    return die_df

Three things that go wrong

Step-name drift — recipes get renamed; you suddenly have two columns for the same physical step
Coordinate misalignment — inline metrology sites and probe dies use different coordinate systems; mis-mapping invents fake correlations
Look-ahead leakage — joining WAT (taken after the fab) into a feature for predicting yield is fine; joining it as a process input is data leakage

Key Concept: Treat the Feature Table as a Product

The feature table is the central artifact for every downstream model — yield, defect, virtual metrology. Version it, test it, and document each column. Without a stable schema, every new model project re-builds it from scratch and gets a slightly different answer.

Understanding Yield

Yield Prediction Models

Knowledge Check

1 / 2

What fraction of the ML project effort is typically spent on data integration for yield prediction?

~80%~10%~50%~5%