· Reinforcement Learning

Deep Dive
End-to-End Masterclass

From paradigms & equations to Python demos, mock data, and real-world applications — a complete RL presentation.

Q-Learning SARSA DQN Self-Driving Auto-Trader 3x3 Puzzle

Supervised vs RL vs Unsupervised

FeatureSupervised SLUnsupervised ULReinforcement RL
DataLabeled (X, y)Unlabeled raw dataTrajectories (State, Action, Reward)
FeedbackImmediate error (Loss)No explicit feedbackDelayed scalar Reward
GoalPredict labelsFind hidden structureMaximize cumulative reward
DecisionStatic (i.i.d)Static (patterns)Sequential (actions affect future)
AnalogyTeacher with answer keyLibrarian organizing booksTraining a dog with treats

Classic ML Examples: Decision Tree, Ensemble, K-Means

Decision Tree Supervised
Recursively split data on best feature (Gini/Entropy).
# Pseudocode
def build_tree(data, features):
    if pure or max_depth: return leaf
    best = select_best_split(data, features)
    tree = {best: {}}
    for val in unique(data[best]):
        sub = data[data[best] == val]
        tree[best][val] = build_tree(sub, features - {best})
    return tree

# Python (sklearn)
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
Ensemble (Random Forest) Supervised
Bootstrap samples + multiple trees + majority vote.
# Pseudocode
def random_forest(data, features, n):
    forest = []
    for i in range(n):
        boot = sample_with_replacement(data)
        tree = build_tree(boot, random_subset(features))
        forest.append(tree)
    return majority_vote(forest)

# Python (sklearn)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
preds = rf.predict(X_test)
K-Means Unsupervised
Cluster data by minimizing intra-cluster variance.
# Pseudocode
def kmeans(data, k):
    centroids = random_k_points(data, k)
    while not converged:
        clusters = assign_to_nearest(data, centroids)
        new_centroids = mean(clusters)
        if new_centroids == centroids: break
        centroids = new_centroids
    return centroids, clusters

# Python (sklearn)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
labels = kmeans.labels_
centers = kmeans.cluster_centers_

Components of RL (MDP)

Agent
The learner / decision-maker.
Environment
External system (road, market, maze).
State (S)
Current situation (pixels, prices).
Action (A)
What the agent can do (steer, buy/sell).
Reward (R)
Immediate feedback (+/-).
Policy (π)
Strategy: State → Action mapping.

Approaches & Types (w/ Pseudocode)

Model-Free · Value-Based

Learn Q-values directly from experience.

# Q-Learning Pseudocode
Initialize Q(s,a) arbitrarily
For each episode:
    Initialize state s
    For each step:
        Choose a from s using ε-greedy(Q)
        Take a, observe r, s'
        Q(s,a) += α * (r + γ * max Q(s',a') - Q(s,a))
        s = s'
    Until s is terminal

Policy-Based · Actor-Critic

Learn policy directly, often with a Critic for variance reduction.

# Policy Gradient (REINFORCE) Pseudocode
Initialize policy π_θ
For each episode:
    Generate trajectory {s_t, a_t, r_t}
    For each step t:
        G = Σ γ^(k-t) r_k   # Return
        θ += α * ∇θ log π(a_t|s_t) * G
    Update policy
Model-Based: Dyna-Q, AlphaGo Model-Free: DQN, PPO, SAC Actor-Critic: A2C, DDPG

Core Equations

Bellman Expectation:
\( V^\pi(s) = \mathbb{E}_\pi [ R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s ] \)
Q-Learning Update (Off-policy):
\( Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right] \)
Bellman Optimality:
\( V^*(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^*(s') \right] \)
Policy Gradient Objective:
\( \nabla J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla \log \pi_\theta(a|s) \, Q^\pi(s,a) \right] \)

Pros & Cons

Advantages

  • Tackles complex sequential decisions (robotics, games).
  • Superhuman performance (Go, StarCraft).
  • No need for labeled data; generates own experience.
  • Adapts to dynamic, non-linear environments.

Disadvantages

  • Sample inefficient: Millions of interactions needed.
  • Reward engineering: Poor rewards lead to catastrophic cheating.
  • Unstable training: Hyperparameter sensitive.
  • Safety: Dangerous exploration in real-world.

Lab Demo: Q-Learning · Maze Candy Problem

Q-Learning on a 4x4 grid. 🍬 Candy +10 🕳️ Pit -10

🚀
💀
🍬

Optimal Path: (0,0)→(1,0)→(2,0)→(2,1)→(2,2)→(3,2)→(3,3)

import numpy as np
import random

# 4x4 Grid: 0=Empty, 1=Candy, -1=Pit
grid = [[0,0,0,0],[0,-1,0,0],[0,0,0,0],[0,0,0,1]]
goal = (3,3); actions = [(-1,0),(1,0),(0,-1),(0,1)]
Q = np.zeros((16,4)); alpha, gamma, eps = 0.8, 0.95, 0.1

def step(s, a):
    r,c = divmod(s,4); dr,dc = actions[a]
    nr,nc = max(0,min(3,r+dr)), max(0,min(3,c+dc))
    ns = nr*4+nc
    reward = -1
    if grid[nr][nc] == 1: reward = 10
    if grid[nr][nc] == -1: reward = -10
    return ns, reward, grid[nr][nc] != 0

for ep in range(500):
    s = 0; done = False
    while not done:
        a = random.randint(0,3) if random.random() < eps else np.argmax(Q[s])
        ns, r, done = step(s, a)
        Q[s,a] += alpha * (r + gamma * np.max(Q[ns]) - Q[s,a])
        s = ns

print("Trained Q-Table (first 5 states):\n", Q[:5])
# Test: Greedy path
s, path = 0, [(0,0)]
while s != 15:
    a = np.argmax(Q[s]); s, _, _ = step(s, a)
    path.append(divmod(s,4))
print("Path:", path)  # Optimal route!

Lab Demo: SARSA & Deep Q-Network (DQN)

On‑policy SARSA and Deep Q‑Learning with neural network on the same Candy Maze.

SARSA (On‑policy)

Update rule uses the actual next action \(a'\) (not max).
# SARSA update (replace Q-learning update)
# Choose a' using same ε-greedy policy
a_prime = epsilon_greedy(Q, s_prime, eps)
Q[s,a] += alpha * (r + gamma * Q[s_prime, a_prime] - Q[s,a])

# Full training loop (same environment)
for ep in range(500):
    s = 0; a = epsilon_greedy(Q, s, eps)
    done = False
    while not done:
        ns, r, done = step(s, a)
        a_next = epsilon_greedy(Q, ns, eps)
        Q[s,a] += alpha * (r + gamma * Q[ns, a_next] - Q[s,a])
        s, a = ns, a_next
# Path will be slightly different due to on-policy nature.

Deep Q-Network (DQN)

Neural network approximates Q(s,a) for large state spaces. Uses experience replay & target network.
import torch, torch.nn as nn, torch.optim as optim
import numpy as np
from collections import deque

# Simple NN: input 16 (one-hot state) + 4 actions? Better: state -> 4 Q-values
class DQN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(16, 64), nn.ReLU(),
            nn.Linear(64, 64), nn.ReLU(),
            nn.Linear(64, 4)
        )
    def forward(self, x): return self.fc(x)

# State representation: one-hot of state index (16)
def state_to_tensor(s): 
    x = np.zeros(16); x[s]=1; return torch.tensor(x, dtype=torch.float32)

model = DQN(); target = DQN(); target.load_state_dict(model.state_dict())
optimizer = optim.Adam(model.parameters(), lr=0.001)
replay = deque(maxlen=2000); batch_size=32; gamma=0.95; eps=0.1

for ep in range(500):
    s = 0; done = False
    while not done:
        # ε-greedy using model
        if np.random.rand() < eps:
            a = np.random.randint(4)
        else:
            with torch.no_grad():
                q_vals = model(state_to_tensor(s))
                a = torch.argmax(q_vals).item()
        ns, r, done = step(s, a)
        replay.append((s, a, r, ns, done))
        s = ns
        # Train if enough samples
        if len(replay) >= batch_size:
            batch = np.random.choice(len(replay), batch_size, replace=False)
            # ... compute loss (MSE) and update model periodically
            # Target network update every 10 episodes
    # Update target network every 10 episodes
    if ep % 10 == 0: target.load_state_dict(model.state_dict())

# After training, greedy path with model

Both algorithms solve the maze; DQN is essential when state space is large (e.g., raw pixels). SARSA is on‑policy, more conservative.

Mock Data Samples (5 Records)

#DomainStateActionRewardNext State / Info
1MazeState 5 (row=1, col=1)Down-1State 9 (row=2, col=1)
2MazeState 10 (row=2, col=2)Right+10Goal! (Candy found)
3Self-DrivingSteering=0.2, Speed=12, Lane=0.1Throttle 0.3+5.2Lane=0.05, Speed=13
4FX TraderRSI=72, Price=1.1050, Vol=12SELL (size 0.5)+$23.40Price drops to 1.1042
53x3 PuzzleBoard [1,2,3,4,0,5,6,7,8]Blank Down-1Board [1,2,3,4,7,5,6,0,8]

These samples represent typical transitions logged during RL training — used by Q‑Learning, SARSA, and DQN.

Applications Deep-Dive

3x3 Sliding Puzzle
State: 9-digit array (181k states).
Action: Move blank (Up/Down/Left/Right).
Reward: -1 per step, +100 if solved.
Algo: DQN (Deep Q-Network) for large state space.
Self-Driving Car
State: LIDAR / Camera pixels.
Action: Continuous (Steering, Accel).
Reward: Distance - Collision - Deviation.
Algo: PPO / SAC (continuous action spaces).
Auto-FX Trader
State: Price window + Indicators (RSI, MACD).
Action: Buy/Sell/Hold (discrete) or size (continuous).
Reward: Realized PnL - Transaction Costs.
Algo: Recurrent PPO (LSTM + PPO) for time-series.

Summary & Takeaways

Core Thesis

RL is sequential decision-making under uncertainty, trading off exploration vs. exploitation with delayed feedback.


When to use RL

  • Clear goal, but no labeled data.
  • Actions influence future outcomes (robotics, games, pricing).
  • Interactive simulation available.

When NOT to use RL

  • Static prediction (use Supervised ML).
  • Clustering (use Unsupervised ML).
  • Data collection is expensive/dangerous.

Modern Trends

  • Deep RL (Neural Nets) unlocks high-dim spaces.
  • Offline RL & Imitation Learning address sample inefficiency.
  • Multi-Agent RL (MARL) for competitive/cooperative scenarios.

The MDP Framing

Every problem (Maze, Puzzle, Car, Trader) reduces to the same Agent, Environment, State, Action, Reward loop. The math (Bellman) stays constant; only the function approximator scales.

"RL is the first computational theory of intelligence."