ML Reinforcement Learning

Like Supervised Learning, but without a dataset. The system learns and generates data in an environment (i.e playing field of a game). The goal is to learn a policy and play a game.

The machine perceives the state of the environment as a set of feature vectors. The policy function takes the feature vector of a state and predicts the most optimal action. Action is chosen based on expected average reward.

Definitions

  • State s
  • Action a
  • Reward Function
  • Transition Function : Probability that a from s leads to , i.e.

Active and Passive RL

When we don’t know the model, meaning the reward matrix R(s,a,s’) and the transition matrix T(s,a,s’), we talk about model free learning.

Active RL:

  • Fundamental trade-off: exploration vs. exploitation
  • learn the policy
  • agent makes choices

Passive RL:

  • fixed policy
  • learn the state values
  • agent is along for the ride

Model Free Learning

We can use the Monte-Carlo Evaluation, where we let our agent try many times, and the final estimated V value will be very close to the real V value.

Value Iteration

The value iteration computes the value V{k+1}(s)_, starting with V_0(s)=0, until the value converges to the optimal value (Bellman equation).

TD-Learning (passive)

Model free way to do policy evaluation, mimicking bellman updates with running sample averages, given a fixed policy.

In order to derive a new policy, we have to learn the Q-values instead of the V-values.

Q-Learning (active)

Q-Learning converges to optimal policy, even if you’re acting suboptimally.