Dataset Splitting

A dataset has to be split into one for training a ML model, and one to test/evaluate it.

Random splitting … use specific ratio for splitting

  • i.e 30/70 ratio for testing/training data
  • 10% of user data will be used for testing
  • Given k: k items from each user will be used for training
  • All but k: k items from each user will be used for testing

Time based splitting: choose Cut-Off point in time to separate testing/training data

  • Global: fixed point in time for all users
  • Given-n: train with the first n items, test with the rest

K-Fold Cross Validation: split dataset in (1/k) different fragments

  • Evaluation takes place k times, where each fragment is used once for testing
  • k often is 5 or 10