Skip to content

PSI theory

Population Stability Index (PSI) is a statistical metric used to measure the distribution shift between two datasets. It is commonly used in model monitoring to detect data drift over time. A high PSI value indicates a significant change in the data distribution, potentially signaling model degradation.

PSI is particularly useful in machine learning and data science for:

  • Monitoring model input distributions to detect data drift.

  • Validating that training and production data remain similar.

  • Ensuring model predictions remain stable over time.

  • Detecting changes in customer behavior, economic conditions, or market trends.

PSI is calculated using the formula:

PSI(Yb,Y,B)=i=1B(yiybi)ln(yiybi)\text{PSI}(Y_b, Y, B) = \sum_{i=1}^{B} \left( y_i - y_{b_i} \right) \ln\left( \frac{y_i}{y_{b_i}} \right)

where:

  • y1,...,yBy_1, . . . , y_B are the proportions of the ithi_{th} bin collected during the time of inference

  • yb1,...,ybBy_{b_1}, . . . , y_{b_B} are the proportions of the ithi_{th} bin collected during the time of training

  • BB represents number of bins

The rule of thumb for interpreting PSI values is:

PSI ValueInterpretation
< 0.1No significant change
0.1 - 0.25Moderate shift, monitor closely
> 0.25Significant shift, investigate
  1. Bin the Data: Define bins (e.g., equal-width or quantile-based) for both the expected and observed distributions.
  2. Calculate Proportions: Compute pip_i and qiq_i for each bin.
  3. Apply PSI Formula: Sum the PSI contributions across all bins.
  4. Analyze Results: If PSI is high, investigate the underlying cause of the drift.

Consider an expected distribution and an observed distribution with the following bin frequencies:

Bin RangeExpected CountObserved Count
0-101000800
10-2015001400
20-3012001600
30-4013001200

Convert these into proportions:

Bin RangeExpected ProportionObserved Proportion
0-100.20.16
10-200.30.28
20-300.240.32
30-400.260.24

Applying the PSI formula:

PSI=(0.20.16)ln(0.20.16)+(0.30.28)ln(0.30.28)+(0.240.32)ln(0.240.32)+(0.260.24)ln(0.260.24)PSI = (0.2 - 0.16) \ln\left(\frac{0.2}{0.16}\right) + (0.3 - 0.28) \ln\left(\frac{0.3}{0.28}\right) + (0.24 - 0.32) \ln\left(\frac{0.24}{0.32}\right) + (0.26 - 0.24) \ln\left(\frac{0.26}{0.24}\right)

Currently, scouter PSI supports the decile binning approach, which is widely recognized as an industry standard and has shown to provide optimal performance in most use cases. We are actively working on expanding the library to support additional binning strategies, offering more flexibility to handle various scenarios.

PSI is a powerful tool for detecting data drift in production models. By regularly monitoring PSI values, data scientists and engineers can proactively maintain model performance and stability.