· Reinforcement Learning

Deep Dive
End-to-End Masterclass

From paradigms & equations to Python demos, mock data, and real-world applications — a complete RL presentation.

Q-Learning SARSA DQN Self-Driving Auto-Trader 3x3 Puzzle

Supervised vs RL vs Unsupervised

Feature	Supervised SL	Unsupervised UL	Reinforcement RL
Data	Labeled (X, y)	Unlabeled raw data	Trajectories (State, Action, Reward)
Feedback	Immediate error (Loss)	No explicit feedback	Delayed scalar Reward
Goal	Predict labels	Find hidden structure	Maximize cumulative reward
Decision	Static (i.i.d)	Static (patterns)	Sequential (actions affect future)
Analogy	Teacher with answer key	Librarian organizing books	Training a dog with treats

Classic ML Examples: Decision Tree, Ensemble, K-Means

Decision Tree Supervised

Recursively split data on best feature (Gini/Entropy).

# Pseudocode
def build_tree(data, features):
    if pure or max_depth: return leaf
    best = select_best_split(data, features)
    tree = {best: {}}
    for val in unique(data[best]):
        sub = data[data[best] == val]
        tree[best][val] = build_tree(sub, features - {best})
    return tree

# Python (sklearn)
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)

Ensemble (Random Forest) Supervised

Bootstrap samples + multiple trees + majority vote.

# Pseudocode
def random_forest(data, features, n):
    forest = []
    for i in range(n):
        boot = sample_with_replacement(data)
        tree = build_tree(boot, random_subset(features))
        forest.append(tree)
    return majority_vote(forest)

# Python (sklearn)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
preds = rf.predict(X_test)

K-Means Unsupervised

Cluster data by minimizing intra-cluster variance.

# Pseudocode
def kmeans(data, k):
    centroids = random_k_points(data, k)
    while not converged:
        clusters = assign_to_nearest(data, centroids)
        new_centroids = mean(clusters)
        if new_centroids == centroids: break
        centroids = new_centroids
    return centroids, clusters

# Python (sklearn)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
labels = kmeans.labels_
centers = kmeans.cluster_centers_

Components of RL (MDP)

Agent

The learner / decision-maker.

Environment

External system (road, market, maze).

State (S)

Current situation (pixels, prices).

Action (A)

What the agent can do (steer, buy/sell).

Reward (R)

Immediate feedback (+/-).

Policy (π)

Strategy: State → Action mapping.

Approaches & Types (w/ Pseudocode)

Model-Free · Value-Based

Learn Q-values directly from experience.

# Q-Learning Pseudocode
Initialize Q(s,a) arbitrarily
For each episode:
    Initialize state s
    For each step:
        Choose a from s using ε-greedy(Q)
        Take a, observe r, s'
        Q(s,a) += α * (r + γ * max Q(s',a') - Q(s,a))
        s = s'
    Until s is terminal

Policy-Based · Actor-Critic

Learn policy directly, often with a Critic for variance reduction.

# Policy Gradient (REINFORCE) Pseudocode
Initialize policy π_θ
For each episode:
    Generate trajectory {s_t, a_t, r_t}
    For each step t:
        G = Σ γ^(k-t) r_k   # Return
        θ += α * ∇θ log π(a_t|s_t) * G
    Update policy

Model-Based: Dyna-Q, AlphaGo Model-Free: DQN, PPO, SAC Actor-Critic: A2C, DDPG

Core Equations

Bellman Expectation:
$ V^\pi(s) = \mathbb{E}_\pi [ R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s ] $

Q-Learning Update (Off-policy):
$ Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right] $

Bellman Optimality:
$ V^*(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^*(s') \right] $

Policy Gradient Objective:
$ \nabla J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla \log \pi_\theta(a|s) \, Q^\pi(s,a) \right] $

Pros & Cons

Advantages

Tackles complex sequential decisions (robotics, games).
Superhuman performance (Go, StarCraft).
No need for labeled data; generates own experience.
Adapts to dynamic, non-linear environments.

Disadvantages

Sample inefficient: Millions of interactions needed.
Reward engineering: Poor rewards lead to catastrophic cheating.
Unstable training: Hyperparameter sensitive.
Safety: Dangerous exploration in real-world.

Lab Demo: Q-Learning · Maze Candy Problem

Q-Learning on a 4x4 grid. 🍬 Candy +10 🕳️ Pit -10

🚀

💀

🍬

Optimal Path: (0,0)→(1,0)→(2,0)→(2,1)→(2,2)→(3,2)→(3,3)

import numpy as np
import random

# 4x4 Grid: 0=Empty, 1=Candy, -1=Pit
grid = [[0,0,0,0],[0,-1,0,0],[0,0,0,0],[0,0,0,1]]
goal = (3,3); actions = [(-1,0),(1,0),(0,-1),(0,1)]
Q = np.zeros((16,4)); alpha, gamma, eps = 0.8, 0.95, 0.1

def step(s, a):
    r,c = divmod(s,4); dr,dc = actions[a]
    nr,nc = max(0,min(3,r+dr)), max(0,min(3,c+dc))
    ns = nr*4+nc
    reward = -1
    if grid[nr][nc] == 1: reward = 10
    if grid[nr][nc] == -1: reward = -10
    return ns, reward, grid[nr][nc] != 0

for ep in range(500):
    s = 0; done = False
    while not done:
        a = random.randint(0,3) if random.random() < eps else np.argmax(Q[s])
        ns, r, done = step(s, a)
        Q[s,a] += alpha * (r + gamma * np.max(Q[ns]) - Q[s,a])
        s = ns

print("Trained Q-Table (first 5 states):\n", Q[:5])
# Test: Greedy path
s, path = 0, [(0,0)]
while s != 15:
    a = np.argmax(Q[s]); s, _, _ = step(s, a)
    path.append(divmod(s,4))
print("Path:", path)  # Optimal route!

Lab Demo: SARSA & Deep Q-Network (DQN)

On‑policy SARSA and Deep Q‑Learning with neural network on the same Candy Maze.

SARSA (On‑policy)

Update rule uses the actual next action $a'$ (not max).

# SARSA update (replace Q-learning update)
# Choose a' using same ε-greedy policy
a_prime = epsilon_greedy(Q, s_prime, eps)
Q[s,a] += alpha * (r + gamma * Q[s_prime, a_prime] - Q[s,a])

# Full training loop (same environment)
for ep in range(500):
    s = 0; a = epsilon_greedy(Q, s, eps)
    done = False
    while not done:
        ns, r, done = step(s, a)
        a_next = epsilon_greedy(Q, ns, eps)
        Q[s,a] += alpha * (r + gamma * Q[ns, a_next] - Q[s,a])
        s, a = ns, a_next
# Path will be slightly different due to on-policy nature.

Deep Q-Network (DQN)

Neural network approximates Q(s,a) for large state spaces. Uses experience replay & target network.

import torch, torch.nn as nn, torch.optim as optim
import numpy as np
from collections import deque

# Simple NN: input 16 (one-hot state) + 4 actions? Better: state -> 4 Q-values
class DQN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(16, 64), nn.ReLU(),
            nn.Linear(64, 64), nn.ReLU(),
            nn.Linear(64, 4)
        )
    def forward(self, x): return self.fc(x)

# State representation: one-hot of state index (16)
def state_to_tensor(s): 
    x = np.zeros(16); x[s]=1; return torch.tensor(x, dtype=torch.float32)

model = DQN(); target = DQN(); target.load_state_dict(model.state_dict())
optimizer = optim.Adam(model.parameters(), lr=0.001)
replay = deque(maxlen=2000); batch_size=32; gamma=0.95; eps=0.1

for ep in range(500):
    s = 0; done = False
    while not done:
        # ε-greedy using model
        if np.random.rand() < eps:
            a = np.random.randint(4)
        else:
            with torch.no_grad():
                q_vals = model(state_to_tensor(s))
                a = torch.argmax(q_vals).item()
        ns, r, done = step(s, a)
        replay.append((s, a, r, ns, done))
        s = ns
        # Train if enough samples
        if len(replay) >= batch_size:
            batch = np.random.choice(len(replay), batch_size, replace=False)
            # ... compute loss (MSE) and update model periodically
            # Target network update every 10 episodes
    # Update target network every 10 episodes
    if ep % 10 == 0: target.load_state_dict(model.state_dict())

# After training, greedy path with model

Both algorithms solve the maze; DQN is essential when state space is large (e.g., raw pixels). SARSA is on‑policy, more conservative.

Mock Data Samples (5 Records)

#	Domain	State	Action	Reward	Next State / Info
1	Maze	State 5 (row=1, col=1)	Down	-1	State 9 (row=2, col=1)
2	Maze	State 10 (row=2, col=2)	Right	+10	Goal! (Candy found)
3	Self-Driving	Steering=0.2, Speed=12, Lane=0.1	Throttle 0.3	+5.2	Lane=0.05, Speed=13
4	FX Trader	RSI=72, Price=1.1050, Vol=12	SELL (size 0.5)	+$23.40	Price drops to 1.1042
5	3x3 Puzzle	Board [1,2,3,4,0,5,6,7,8]	Blank Down	-1	Board [1,2,3,4,7,5,6,0,8]

These samples represent typical transitions logged during RL training — used by Q‑Learning, SARSA, and DQN.

Applications Deep-Dive

3x3 Sliding Puzzle

State: 9-digit array (181k states).
Action: Move blank (Up/Down/Left/Right).
Reward: -1 per step, +100 if solved.
Algo: DQN (Deep Q-Network) for large state space.

Self-Driving Car

State: LIDAR / Camera pixels.
Action: Continuous (Steering, Accel).
Reward: Distance - Collision - Deviation.
Algo: PPO / SAC (continuous action spaces).

Auto-FX Trader

State: Price window + Indicators (RSI, MACD).
Action: Buy/Sell/Hold (discrete) or size (continuous).
Reward: Realized PnL - Transaction Costs.
Algo: Recurrent PPO (LSTM + PPO) for time-series.

Summary & Takeaways

Core Thesis

RL is sequential decision-making under uncertainty, trading off exploration vs. exploitation with delayed feedback.

When to use RL

Clear goal, but no labeled data.
Actions influence future outcomes (robotics, games, pricing).
Interactive simulation available.

When NOT to use RL

Static prediction (use Supervised ML).
Clustering (use Unsupervised ML).
Data collection is expensive/dangerous.

Modern Trends

Deep RL (Neural Nets) unlocks high-dim spaces.
Offline RL & Imitation Learning address sample inefficiency.
Multi-Agent RL (MARL) for competitive/cooperative scenarios.

The MDP Framing

Every problem (Maze, Puzzle, Car, Trader) reduces to the same Agent, Environment, State, Action, Reward loop. The math (Bellman) stays constant; only the function approximator scales.

"RL is the first computational theory of intelligence."

Deep Dive End-to-End Masterclass