Reinforcement Learning (RL)
What is Reinforcement Learning?
Reinforcement Learning (RL) is a branch of machine learning where an agent learns by interacting with an environment and making decisions over time. The agent receives rewards or penalties as feedback and aims to learn an optimal policy that maximizes the cumulative (long-term) reward.
Unlike supervised learning, RL does not use labeled data, and unlike unsupervised learning, it focuses on goal-directed behavior through sequential decision-making.
A core challenge in RL is the exploration vs. exploitation trade-off:
- Exploration: Trying new actions to discover better rewards
- Exploitation: Using known actions that yield high rewards
How Reinforcement Learning Works
Core Components
- Agent: Learner or decision-maker
- Environment: External system the agent interacts with
- State (S): Current situation of the environment
- Action (A): Choices available to the agent
- Reward (R): Feedback signal after taking an action
- Policy (π): Strategy mapping states to actions
- Value Function (V / Q): Expected future reward
- Model (optional): Agent’s understanding of environment dynamics
Learning Process
- Observation: Agent observes the current state
- Action Selection: Agent selects an action using its policy
- Interaction: Action is executed in the environment
- Reward: Agent receives a reward and next state
- Policy Update: Policy/value function is updated
- Iteration: Process continues until convergence or termination
Types of Reinforcement Learning
1. Value-Based Methods
- Learn value functions to guide action selection
- Aim to maximize expected future rewards
Examples:
- Q-Learning
- SARSA
2. Policy-Based Methods
- Learn policies directly without estimating value functions
- Suitable for continuous action spaces
Examples:
- Policy Gradient
- REINFORCE
3. Actor-Critic Methods
- Combine both value-based (critic) and policy-based (actor) approaches
- More stable and sample-efficient
Examples:
- A2C (Advantage Actor-Critic)
- A3C (Asynchronous A2C)
- DDPG
4. Model-Based Methods
- Learn a model of the environment
- Use planning and simulation
Examples:
- Dyna-Q
- Monte Carlo Tree Search (MCTS)
Key Characteristics of Reinforcement Learning
- Interactive Learning: Agent learns through trial and error
- Reward-Driven: Learning guided by rewards, not labels
- Sequential Decision-Making: Actions affect future states
- Exploration vs Exploitation: Core RL challenge
- Temporal Credit Assignment: Delayed rewards are handled
- No Labeled Data: No predefined correct outputs
Common Reinforcement Learning Algorithms
1. Q-Learning
- Learns optimal action-value function (Q-function)
- Off-policy: Learns independently of current policy
- Uses Bellman equation
2. SARSA
- On-policy version of Q-learning
- Updates based on the action actually taken
- More stable but conservative
3. Deep Q-Networks (DQN)
- Uses deep neural networks to approximate Q-values
- Key techniques:
- Experience Replay
- Target Networks
4. Policy Gradient Methods
- Optimize policy directly using gradient ascent
- Effective for continuous action spaces
5. Actor-Critic Methods
- Actor learns policy
- Critic evaluates actions using value function
- Reduces variance and improves stability
✅ Example 1: Q-Learning for Grid World Navigation
✔ What this demonstrates
- Tabular Q-Learning
- Discrete state/action space
- Policy visualization
import numpy as np
import matplotlib.pyplot as plt
import random
# -------------------------
# Grid World Environment
# -------------------------
class GridWorld:
def __init__(self, size=4):
self.size = size
self.states = size * size
self.actions = 4 # up, down, left, right
self.goal = self.states - 1
self.obstacles = [5, 7]
def get_state(self, row, col):
return row * self.size + col
def step(self, state, action):
row, col = divmod(state, self.size)
if action == 0: row -= 1 # up
elif action == 1: row += 1 # down
elif action == 2: col -= 1 # left
elif action == 3: col += 1 # right
row = max(0, min(self.size - 1, row))
col = max(0, min(self.size - 1, col))
next_state = self.get_state(row, col)
reward = -1
done = False
if next_state in self.obstacles:
reward = -10
elif next_state == self.goal:
reward = 10
done = True
return next_state, reward, done
# -------------------------
# Q-Learning Agent
# -------------------------
class QLearningAgent:
def __init__(self, states, actions, lr=0.1, gamma=0.9, epsilon=0.1):
self.q_table = np.zeros((states, actions))
self.lr = lr
self.gamma = gamma
self.epsilon = epsilon
self.actions = actions
def choose_action(self, state):
if random.uniform(0, 1) < self.epsilon:
return random.randint(0, self.actions - 1)
return np.argmax(self.q_table[state])
def learn(self, state, action, reward, next_state):
predict = self.q_table[state, action]
target = reward + self.gamma * np.max(self.q_table[next_state])
self.q_table[state, action] += self.lr * (target - predict)
# -------------------------
# Training
# -------------------------
env = GridWorld(size=4)
agent = QLearningAgent(states=env.states, actions=env.actions)
episodes = 1000
for episode in range(episodes):
state = 0
total_reward = 0
while True:
action = agent.choose_action(state)
next_state, reward, done = env.step(state, action)
agent.learn(state, action, reward, next_state)
state = next_state
total_reward += reward
if done:
break
if episode % 100 == 0:
print(f"Episode {episode}, Reward: {total_reward}")
# -------------------------
# Visualize Policy
# -------------------------
policy = np.argmax(agent.q_table, axis=1)
policy_grid = policy.reshape(env.size, env.size)
plt.figure(figsize=(6, 6))
for i in range(env.size):
for j in range(env.size):
s = env.get_state(i, j)
if s == env.goal:
plt.text(j, i, "G", ha="center", va="center", fontsize=16)
elif s in env.obstacles:
plt.text(j, i, "X", ha="center", va="center", fontsize=16)
else:
arrows = ['↑', '↓', '←', '→']
plt.text(j, i, arrows[policy_grid[i, j]], ha="center", va="center")
plt.grid()
plt.title("Learned Policy (Q-Learning)")
plt.gca().invert_yaxis()
plt.show()
Episode 0, Reward: -132
Episode 100, Reward: 5
Episode 200, Reward: 5
Episode 300, Reward: 5
Episode 400, Reward: 5
Episode 500, Reward: 3
Episode 600, Reward: 5
Episode 700, Reward: 5
Episode 800, Reward: 5
Episode 900, Reward: 5
✅ Example 2: Deep Q-Network (DQN) for CartPole
✔ What this demonstrates
- Deep Q-Learning
- Experience replay
- Neural network function approximation
import gym
import numpy as np
import random
from collections import deque
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
class DQNAgent:
def __init__(self, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
self.memory = deque(maxlen=2000)
self.gamma = 0.95
self.epsilon = 1.0
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.model = self._build_model()
def _build_model(self):
model = Sequential([
Dense(24, input_dim=self.state_size, activation='relu'),
Dense(24, activation='relu'),
Dense(self.action_size, activation='linear')
])
model.compile(loss='mse', optimizer=Adam(learning_rate=0.001))
return model
def act(self, state):
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
return np.argmax(self.model.predict(state, verbose=0)[0])
def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
def replay(self, batch_size):
minibatch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target += self.gamma * np.amax(self.model.predict(next_state, verbose=0)[0])
target_f = self.model.predict(state, verbose=0)
target_f[0][action] = target
self.model.fit(state, target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
# -------------------------
# Training
# -------------------------
env = gym.make("CartPole-v1")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = DQNAgent(state_size, action_size)
episodes = 200
batch_size = 32
for e in range(episodes):
state, _ = env.reset()
state = np.reshape(state, [1, state_size])
total_reward = 0
for time in range(500):
action = agent.act(state)
next_state, reward, done, _, = env.step(action)
next_state = np.reshape(next_state, [1, state_size])
agent.remember(state, action, reward, next_state, done)
state = next_state
total_reward += reward
if done:
print(f"Episode {e}, Score: {total_reward}, Epsilon: {agent.epsilon:.2f}")
break
if len(agent.memory) > batch_size:
agent.replay(batch_size)
env.close()
✅ Example 3: Policy Gradient (REINFORCE) for MountainCar
✔ What this demonstrates
- Policy Gradient method
- Stochastic policy learning
- Continuous state space
import gym
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
class PolicyGradientAgent:
def __init__(self, state_size, action_size, gamma=0.99):
self.gamma = gamma
self.model = self._build_model(state_size, action_size)
self.optimizer = tf.keras.optimizers.Adam(0.01)
def _build_model(self, state_size, action_size):
model = tf.keras.Sequential([
layers.Dense(24, activation='relu', input_shape=(state_size,)),
layers.Dense(24, activation='relu'),
layers.Dense(action_size, activation='softmax')
])
return model
def select_action(self, state):
probs = self.model(np.expand_dims(state, axis=0))
return np.random.choice(len(probs[0]), p=probs.numpy()[0])
def train(self, states, actions, rewards):
discounted_rewards = self._discount_rewards(rewards)
with tf.GradientTape() as tape:
probs = self.model(states)
action_masks = tf.one_hot(actions, probs.shape[1])
log_probs = tf.reduce_sum(action_masks * tf.math.log(probs + 1e-10), axis=1)
loss = -tf.reduce_mean(log_probs * discounted_rewards)
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
def _discount_rewards(self, rewards):
discounted = np.zeros_like(rewards, dtype=np.float32)
running = 0
for t in reversed(range(len(rewards))):
running = running * self.gamma + rewards[t]
discounted[t] = running
discounted = (discounted - discounted.mean()) / (discounted.std() + 1e-8)
return discounted
# -------------------------
# Training
# -------------------------
env = gym.make("MountainCar-v0")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = PolicyGradientAgent(state_size, action_size)
episodes = 1000
for episode in range(episodes):
state, _ = env.reset()
states, actions, rewards = [], [], []
total_reward = 0
while True:
action = agent.select_action(state)
next_state, reward, done, _, = env.step(action)
states.append(state)
actions.append(action)
rewards.append(reward)
total_reward += reward
state = next_state
if done:
break
agent.train(np.array(states), np.array(actions), np.array(rewards))
if episode % 50 == 0:
print(f"Episode {episode}, Reward: {total_reward}")
env.close()
✅ Example 4: Actor–Critic (A2C-style) for Acrobot
✔ Key concepts demonstrated
- Actor–Critic architecture
- Temporal Difference (TD) learning
- Continuous state, discrete action space
import gym
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
# -------------------------
# Actor-Critic Agent
# -------------------------
class ActorCriticAgent:
def __init__(self, state_size, action_size, gamma=0.99):
self.gamma = gamma
self.action_size = action_size
# Actor network
self.actor = tf.keras.Sequential([
layers.Dense(64, activation='relu', input_shape=(state_size,)),
layers.Dense(64, activation='relu'),
layers.Dense(action_size, activation='softmax')
])
# Critic network
self.critic = tf.keras.Sequential([
layers.Dense(64, activation='relu', input_shape=(state_size,)),
layers.Dense(64, activation='relu'),
layers.Dense(1)
])
self.actor_optimizer = tf.keras.optimizers.Adam(0.001)
self.critic_optimizer = tf.keras.optimizers.Adam(0.002)
def get_action(self, state):
state = np.expand_dims(state, axis=0)
probs = self.actor(state)
return np.random.choice(self.action_size, p=probs.numpy()[0])
def train(self, state, action, reward, next_state, done):
state = np.expand_dims(state, axis=0)
next_state = np.expand_dims(next_state, axis=0)
with tf.GradientTape(persistent=True) as tape:
# Values
state_value = self.critic(state)[0, 0]
next_state_value = self.critic(next_state)[0, 0]
# TD error
td_target = reward + self.gamma * next_state_value * (1 - int(done))
td_error = td_target - state_value
# Actor loss
action_probs = self.actor(state)[0]
action_one_hot = tf.one_hot(action, self.action_size)
actor_loss = -tf.math.log(tf.reduce_sum(action_probs * action_one_hot) + 1e-10) * td_error
# Critic loss
critic_loss = td_error ** 2
# Apply gradients
actor_grads = tape.gradient(actor_loss, self.actor.trainable_variables)
critic_grads = tape.gradient(critic_loss, self.critic.trainable_variables)
self.actor_optimizer.apply_gradients(zip(actor_grads, self.actor.trainable_variables))
self.critic_optimizer.apply_gradients(zip(critic_grads, self.critic.trainable_variables))
return td_error.numpy()
# -------------------------
# Training
# -------------------------
env = gym.make("Acrobot-v1")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = ActorCriticAgent(state_size, action_size)
episodes = 500
max_steps = 500
for episode in range(episodes):
state, _ = env.reset()
total_reward = 0
for step in range(max_steps):
action = agent.get_action(state)
next_state, reward, done, _, = env.step(action)
td_error = agent.train(state, action, reward, next_state, done)
state = next_state
total_reward += reward
if done:
break
if episode % 50 == 0:
print(f"Episode {episode}, Reward: {total_reward}, TD Error: {td_error:.2f}")
env.close()
✅ Example 5: SARSA for FrozenLake
✔ Key concepts demonstrated
- On-policy TD control
- Exploration vs exploitation
- Discrete state & action spaces
import gym
import numpy as np
# -------------------------
# SARSA Agent
# -------------------------
class SARSAAgent:
def __init__(self, states, actions, alpha=0.1, gamma=0.99, epsilon=0.1):
self.q_table = np.zeros((states, actions))
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
def choose_action(self, state):
if np.random.random() < self.epsilon:
return np.random.randint(self.q_table.shape[1])
return np.argmax(self.q_table[state])
def learn(self, state, action, reward, next_state, next_action):
predict = self.q_table[state, action]
target = reward + self.gamma * self.q_table[next_state, next_action]
self.q_table[state, action] += self.alpha * (target - predict)
# -------------------------
# Training
# -------------------------
env = gym.make("FrozenLake-v1", is_slippery=True)
agent = SARSAAgent(
states=env.observation_space.n,
actions=env.action_space.n
)
episodes = 10000
max_steps = 100
for episode in range(episodes):
state, _ = env.reset()
action = agent.choose_action(state)
total_reward = 0
for step in range(max_steps):
next_state, reward, done, _, = env.step(action)
next_action = agent.choose_action(next_state)
agent.learn(state, action, reward, next_state, next_action)
state = next_state
action = next_action
total_reward += reward
if done:
break
if episode % 1000 == 0:
print(f"Episode {episode}, Reward: {total_reward}")
env.close()
🌍 Reinforcement Learning: Real-World Applications
1. Autonomous Driving 🚗
Input:
- Camera images
- LiDAR & radar data
- Vehicle speed, lane position
Output:
- Steering angle
- Acceleration / braking
Algorithms:
- Deep Q-Networks (DQN)
- Policy Gradient methods (PPO, DDPG)
Use Case:
Self-driving cars learn safe navigation, lane keeping, and obstacle avoidance through continuous interaction with simulated and real environments.
2. Game Playing (AlphaGo, Chess AI) ♟️
Input:
- Board state or game configuration
Output:
- Optimal move selection
Algorithms:
- Monte Carlo Tree Search (MCTS)
- Deep Q-Networks
- Policy + Value Networks
Use Case:
Agents master complex games via self-play, learning strategies beyond human intuition.
3. Robotics & Control Systems 🤖
Input:
- Sensor readings
- Joint angles, velocities
Output:
- Motor commands
Algorithms:
- Actor–Critic
- Policy Gradients
Use Case:
Robots learn grasping, walking, and manipulation tasks through trial and error instead of explicit programming.
4. Recommendation Systems 🎯
Input:
- User interaction history
- Item features
- Context (time, device, location)
Output:
- Personalized recommendations
Algorithms:
- Contextual Bandits
- Q-Learning
Use Case:
Netflix, Amazon, YouTube adapt recommendations in real time based on user feedback.
5. Resource Management in Data Centers ⚙️
Input:
- Server load
- Energy usage
- Temperature data
Output:
- Resource allocation decisions
Algorithms:
- Reinforcement Learning optimization
Use Case:
Google reduced energy usage in data center cooling systems using RL-based control policies.
✅ Best Practices & Key Considerations in RL
1. Exploration vs Exploitation
- Use ε-greedy, softmax, or UCB
- Reduce exploration gradually as learning improves
2. Reward Shaping
- Design rewards that guide desired behavior
- Avoid overly sparse or misleading rewards
3. Experience Replay
- Store transitions
(state, action, reward, next_state) - Random sampling improves stability and efficiency
4. Target Networks
- Use separate target networks in DQN
- Update periodically to stabilize learning
5. Normalization
- Normalize inputs and rewards
- Prevent unstable gradients and slow convergence
6. Hyperparameter Tuning
Key parameters:
- Learning rate
- Discount factor (γ)
- Exploration rate (ε)
Use grid search or Bayesian optimization.
7. Environment Design
- Use realistic simulations
- Ensure environments reflect real-world constraints
8. Safety Considerations ⚠️
- Apply safety constraints
- Test extensively in simulation before deployment
- Add fallback policies for critical systems
