PSI theory
Population Stability Index(reference)
Section titled “Population Stability Index(reference)”Introduction
Section titled “Introduction”Population Stability Index (PSI) is a statistical metric used to measure the distribution shift between two datasets. It is commonly used in model monitoring to detect data drift over time. A high PSI value indicates a significant change in the data distribution, potentially signaling model degradation.
Why Use PSI?
Section titled “Why Use PSI?”PSI is particularly useful in machine learning and data science for:
-
Monitoring model input distributions to detect data drift.
-
Validating that training and production data remain similar.
-
Ensuring model predictions remain stable over time.
-
Detecting changes in customer behavior, economic conditions, or market trends.
Mathematical Definition of PSI
Section titled “Mathematical Definition of PSI”PSI is calculated using the formula:
where:
-
are the proportions of the bin collected during the time of inference
-
are the proportions of the bin collected during the time of training
-
represents number of bins
Interpreting PSI Values
Section titled “Interpreting PSI Values”The rule of thumb for interpreting PSI values is:
| PSI Value | Interpretation |
|---|---|
< 0.1 | No significant change |
0.1 - 0.25 | Moderate shift, monitor closely |
> 0.25 | Significant shift, investigate |
How PSI Works
Section titled “How PSI Works”- Bin the Data: Define bins (e.g., equal-width or quantile-based) for both the expected and observed distributions.
- Calculate Proportions: Compute and for each bin.
- Apply PSI Formula: Sum the PSI contributions across all bins.
- Analyze Results: If PSI is high, investigate the underlying cause of the drift.
Example Calculation
Section titled “Example Calculation”Consider an expected distribution and an observed distribution with the following bin frequencies:
| Bin Range | Expected Count | Observed Count |
|---|---|---|
0-10 | 1000 | 800 |
10-20 | 1500 | 1400 |
20-30 | 1200 | 1600 |
30-40 | 1300 | 1200 |
Convert these into proportions:
| Bin Range | Expected Proportion | Observed Proportion |
|---|---|---|
0-10 | 0.2 | 0.16 |
10-20 | 0.3 | 0.28 |
20-30 | 0.24 | 0.32 |
30-40 | 0.26 | 0.24 |
Applying the PSI formula:
Binning Strategies
Section titled “Binning Strategies”Currently, scouter PSI supports the decile binning approach, which is widely recognized as an industry standard and has shown to provide optimal performance in most use cases. We are actively working on expanding the library to support additional binning strategies, offering more flexibility to handle various scenarios.
Conclusion
Section titled “Conclusion”PSI is a powerful tool for detecting data drift in production models. By regularly monitoring PSI values, data scientists and engineers can proactively maintain model performance and stability.