—aka, General Experience Maker

<aside> 👥

Team

Contributors: Zichen Liu†, Anya Sims†, Keyu Duan†, Changyu Chen†

Advisors: Diyi Yang, Wee Sun Lee, Min Lin

Date: 2 August 2025

</aside>

<aside> 🔸

We’re entering the era of experience [1], where LLM training moves beyond static datasets, towards LLM agents learning from experience gathered in complex, expressive environments. As a step towards this we introduce GEM, our open-source efforts to build a General Experience Maker.

Inspired by OpenAI Gym's role in traditional RL [2], GEM serves as a dedicated, flexible environment simulator for the age of LLMs. In contrast to existing codebases [3,4], GEM deliberately decouples the environment from the training framework, making it easy to integrate with popular RL training frameworks (Oat, Verl, etc.) through clean, standardized interfaces. In addition, GEM features tool integration, flexible and easy-to-modify wrappers, async vectorized environment execution to maximize throughput, multi-environment training, and more…. everything you need to make LLM agent RL training simple.

The highlights of GEM’s first release are:

A suite of environments combining a diverse range of tasks (math, coding, general question-answering, adapted TextArena games [5], and ReasoningGym [6]) with flexible, integrated tool use (python, search, and more coming soon!).
A baseline algorithm — REINFORCE with Batch Return Normalization. Unlike prior methods in RL with verifiable rewards (RLVR) [7, 8], this algorithm is compatible with multi-turn turn-based rewards (discussed here).
Single file training scripts with Oat [9] and Verl [10]. More will be added soon (we welcome PRs from the community!).
A set of baselines across 24 environments—verified with a range of reward structures, model sizes, and termination conditions—making it simple to use GEM as the go-to library for developing and benchmarking new algorithms.

💻 Code | 📑 Doc

</aside>

†Equal contribution with random order decided by a dice roll.

Figure 1. Learning curves of Qwen3-based agents across diverse environments of 5 categories: game (language games); rg (ReasoningGym); code (coding tasks); math (python-integrated math questions); qa (search-integrated general questions). All agents are learned via a simple yet general multi-turn algorithm based on REINFORCE (Algorithm 1).

Figure 1. Learning curves of Qwen3-based agents across diverse environments of 5 categories: game (language games); rg (ReasoningGym); code (coding tasks); math (python-integrated math questions); qa (search-integrated general questions). All agents are learned via a simple yet general multi-turn algorithm based on REINFORCE (Algorithm 1).

🔥 API Quick Start

Figure 2. The standard agent-environment loop from Sutton & Barto [11]. GEM implements the 'Environment' side, providing a standardized testbed over a wide range of tasks. GEM decouples the environment from the ‘Agent’ side, allowing researchers to easily plug in, train, and benchmark their own LLM-based agents with maximum flexibility.

Figure 2. The standard agent-environment loop from Sutton & Barto [11]. GEM implements the 'Environment' side, providing a standardized testbed over a wide range of tasks. GEM decouples the environment from the ‘Agent’ side, allowing researchers to easily plug in, train, and benchmark their own LLM-based agents with maximum flexibility.

OpenAI Gym [2] has been instrumental in RL development for many years, providing a standard API for communicating between agents and environments, as well as a suite of environments compliant with the API for developing and benchmarking new algorithms. GEM brings Gym into the LLM era, with a standardized interface that closely follows Gym's, along with a diverse suite of environments. The main methods for each environment are:

reset(seed) — samples an initial environment state (e.g. a math question or a hidden word in Wordle), and returns the first observation.
step(action) — executes the action, including possible tool calling, and returns the next observation, reward, and done (a binary signal indicating whether the interaction is finished).

A simple example:

import gem

# List all supported environments
gem.print_envs()

# Initialize the environment
env = gem.make("game:GuessTheNumber-v0")

# Reset the environment to generate the first observation
observation, info = env.reset()

# Start the agent-environment loop
while True:
    action = env.sample_random_action() # insert policy here, e.g.,
    # (pseudocode) action = llm.generate(observation)

    # apply action and receive next observation, reward
    # and whether the episode has ended
    next_observation, reward, terminated, truncated, info = env.step(action)
    print("OBS", observation)
    print("ACT", action)

    # update the policy (online) here
    # e.g., policy = learn(policy, observation, action, reward, info)

    observation = next_observation
    # Exit when the episode terminates
    if terminated or truncated:
        break

🌍 GEM Environments

GEM's core components are Tasks and Tools. Each combination of a task ****and an optional set of tools constitutes an environment that can be used to challenge (and RL-tune) an LLM's capabilities in reasoning, multi-step planning, tool use, and strategic exploration.

Tasks

GEM features five main categories of tasks:

<aside> 📐

Math: Solve math problems with chain-of-thought reasoning.

</aside>

<aside> 🎲

Game: Diverse multi-step text-based games adapted from TextArena [5].

</aside>

<aside> 💬

Question-Answering: Perform knowledge-intensive retrieval and answer generation.

</aside>

<aside> 💻

Code: Generate and validate Python code with a live interpreter.

</aside>

<aside> 🏋️

ReasoningGym: Lightweight wrapper for ReasoningGym [6]. We wrap it in a unified gym-like interface to facilitate easy integration with various training frameworks.

</aside>

GEM provides an easy-to-use interface to add more environments! Math, Code, and Question-Answering tasks can be added by simply specifying a new dataset. New game environments are also easy to add: simply inherit from GEM's environment base class, define the state transitions and reward logic, and plug it into the training loop using our examples as a guide.