Population Stability Index(reference)¶
Introduction¶
Population Stability Index (PSI) is a statistical metric used to measure the distribution shift between two datasets. It is commonly used in model monitoring to detect data drift over time. A high PSI value indicates a significant change in the data distribution, potentially signaling model degradation.
Why Use PSI?¶
PSI is particularly useful in machine learning and data science for:
-
Monitoring model input distributions to detect data drift.
-
Validating that training and production data remain similar.
-
Ensuring model predictions remain stable over time.
-
Detecting changes in customer behavior, economic conditions, or market trends.
Mathematical Definition of PSI¶
PSI is calculated using the formula:
where: - \(y_1, . . . , y_B\) are the proportions of the \(i_{th}\) bin collected during the time of inference
-
\(y_{b_1}, . . . , y_{b_B}\) are the proportions of the \(i_{th}\) bin collected during the time of training
-
\(B\) represents number of bins
Interpreting PSI Values¶
The rule of thumb for interpreting PSI values is:
PSI Value | Interpretation |
---|---|
< 0.1 |
No significant change |
0.1 - 0.25 |
Moderate shift, monitor closely |
> 0.25 |
Significant shift, investigate |
How PSI Works¶
- Bin the Data: Define bins (e.g., equal-width or quantile-based) for both the expected and observed distributions.
- Calculate Proportions: Compute \( p_i \) and \( q_i \) for each bin.
- Apply PSI Formula: Sum the PSI contributions across all bins.
- Analyze Results: If PSI is high, investigate the underlying cause of the drift.
Example Calculation¶
Consider an expected distribution and an observed distribution with the following bin frequencies:
Bin Range | Expected Count | Observed Count |
---|---|---|
0-10 |
1000 |
800 |
10-20 |
1500 |
1400 |
20-30 |
1200 |
1600 |
30-40 |
1300 |
1200 |
Convert these into proportions:
Bin Range | Expected Proportion | Observed Proportion |
---|---|---|
0-10 |
0.2 |
0.16 |
10-20 |
0.3 |
0.28 |
20-30 |
0.24 |
0.32 |
30-40 |
0.26 |
0.24 |
Applying the PSI formula:
Binning Strategies¶
Currently, scouter PSI supports the decile binning approach, which is widely recognized as an industry standard and has shown to provide optimal performance in most use cases.
We are actively working on expanding the library to support additional binning strategies, offering more flexibility to handle various scenarios.
Conclusion¶
PSI is a powerful tool for detecting data drift in production models. By regularly monitoring PSI values, data scientists and engineers can proactively maintain model performance and stability.