Reinforcement Learning // 2048

Learning to Play 2048

I built an RL agent that learns to play 2048 using Stable-Baselines3 (SAC & A2C) on a custom Gym-style environment.

STACK
Python, SB3 (SAC/A2C), NumPy
FOCUS
Reward shaping, stability, evaluation
OUTCOME
Reached 512 tile (2 steps from 2048)
Training graph

Motivation

2048 looks simple, but it’s a great testbed for RL: stochastic tile spawns, sparse/delayed rewards, and a long horizon. My goal was to build a clean environment, compare algorithms, and iterate on a reward that encourages long-term board health over short-term merges.

Repo Map

  • game_2048_env.py: Custom environment and game logic
  • train_sac_sb3.py: SAC training pipeline (SB3)
  • train_a2c_sb3.py: A2C training pipeline (SB3)
  • conf/train_sb3.yaml: Config for training runs
  • use_model.py: Load and run a trained agent
  • play_human.py: Play the game with keyboard controls

Journey

1) Building the Environment

I implemented a Gym-like environment that exposes a compact state and 4 discrete actions (up, down, left, right). The first milestone was getting the rules correct and efficient: merging tiles once per move, handling random spawns, and detecting terminal states.

Environment board states

Early board state visualizations.

2) Reward Shaping

Naive rewards (just using score increases) led to greedy behavior and early dead-ends. I iterated on a composite signal balancing immediate merges with long-term board health.

  • Merges/Score: Reward tile merges to provide a dense learning signal.
  • Empty Tiles: Encourage keeping space on the board for future moves.
  • Monotonicity/Smoothness: Prefer ordered boards that reduce deadlocks.
  • Death Penalty: Penalize terminal states to prefer survivability.

3) Algorithms & Training

I compared A2C and SAC from Stable-Baselines3. A2C trained quickly but plateaued; SAC provided more stable improvements with the shaped reward. Runs were controlled via a YAML config.

4) Evaluation & Results

I evaluated using average episode score, peak tile, and survival length across multiple seeds.

  • Peak Tile: 512
  • Avg Score: 2500
  • Survival Steps: 1000
Example run

Example run: reached 256 here.

Hardships & Fixes

  • Unstable Learning: Fixed by reward rebalancing and entropy tuning.
  • Sparse Feedback: Added auxiliary signals for denser guidance.
  • Exploration vs. Survival: Tuned action noise/entropy and terminal penalties.
  • Overfitting Seeds: Evaluated across multiple seeds and tracked variance.

What I Learned

  • Reward shaping can unlock learning when signals are sparse.
  • Stability often beats peak performance for practical results.
  • Clear evaluation (metrics + seeds) keeps progress honest.
  • Small, testable changes beat sweeping rewrites.

How to Run

# Train SAC
python train_sac_sb3.py --config conf/train_sb3.yaml

# Train A2C
python train_a2c_sb3.py --config conf/train_sb3.yaml

# Use a trained model
python use_model.py --model_path <your_model.zip>

# Play as a human
python play_human.py

Next Steps

  • Prioritized experience replay / PER experiments
  • Curriculum or self-play style difficulty ramps
  • Better state featurization and tile encodings