ML Reinforcement Learning

Like Supervised Learning, but without a dataset. The system learns and generates data in an environment (i.e playing field of a game). The goal is to learn a policy and play a game.

The machine perceives the state of the environment as a set of feature vectors. The policy function takes the feature vector of a state and predicts the most optimal action. Action is chosen based on expected average reward.

Definitions

State s
Action a
Reward Function $R (s, a, s^{'})$
Transition Function $T (s, a, s^{'})$ : Probability that a from s leads to $s^{'}$ , i.e. $P (s^{'} ∣ s, a)$

Active and Passive RL

When we don’t know the model, meaning the reward matrix R(s,a,s’) and the transition matrix T(s,a,s’), we talk about model free learning.

Active RL:

Fundamental trade-off: exploration vs. exploitation
learn the policy
agent makes choices

Passive RL:

fixed policy
learn the state values
agent is along for the ride

Model Free Learning

We can use the Monte-Carlo Evaluation, where we let our agent try many times, and the final estimated V value will be very close to the real V value.

Value Iteration

The value iteration computes the value V{k+1}(s)_, starting with V_0(s)=0, until the value converges to the optimal value (Bellman equation).

TD-Learning (passive)

Model free way to do policy evaluation, mimicking bellman updates with running sample averages, given a fixed policy.

In order to derive a new policy, we have to learn the Q-values instead of the V-values.

Q-Learning (active)

Q-Learning converges to optimal policy, even if you’re acting suboptimally.

Fabian Untermoser

Recent Notes

SolarAssets

Tech Stack

Barcamp 2023 Vim Workshop

Introduction to Obsidian

Home Lab