Data Shift/Drift

#machine-learning

Data shift is defined as the change of the underlying relationship between input and output data from an ML model. A shift in the distribution of the data requires the model to be retrained.

The feedback loop is one of the solutions to overcome data shift. It detects performance change and retrains the deployed model by newly collected data.

Downside to retraining the model is a potential introduced bias.

Source Correcting Dataset Shift in Machine Learning | Engineering Education (EngEd) Program | Section

Causes of Data drift

  • Sample selection bias
  • Change of environments (difference in training/test environments)

Types of Data drift

  • Covariate shift: A shift of the input variables, where the target variable remains unchanged
  • Prior probability shift: A shift of the target variable, where the input variable remain unchanged
  • Concept drift: A change in relationships between the input and output variables in the problem. It’s neither related to the data distribution nor the class distribution.

Correcting Data drift

  • Dropping biased features
  • Using Adversarial Search with two competing Agents to “win the game”
  • Reweighting features

Example: Loan Application Model

Youtube: ML Drift: Identifying Issues Before You Have a Problem