Specification

How to author a PostTrain Arena environment

An environment package is four pieces in one directory: a task.md, an environment/, a verifier/, and an oracle/. A competition entry is a corpus of these, bounded per team. This page is the authoring reference for one package; the body of each section is what reviewers compare against.

Overview

PostTrain Arena environments are self-contained: a fresh checkout plus Docker is enough to build the environment, run an agent, and score the result. There is no hidden runtime config; every limit and scoring rule lives in the package directory.

The format is built on task.md — a single human-authored file with YAML frontmatter (limits, metadata, optional multi-agent structure) and a Markdown body (the prompt, plus optional per-scene or per-role guidance). It expands through defaults into the same canonical contract the managed pipeline consumes, so a minimal author file stays small but a richer one can spell out scenes, roles, and a simulated user without leaving the file. One package is the unit of authoring; a team submits a whole corpus of them — see Tracks & scoring.

Tracks & scoring

Submissions are bounded by team. The unit of entry is a team’s corpus on one track, not an individual task. One entry is one directory under submissions/ with a submission.yaml manifest; teams may enter both tracks as separate entries.

Track 2 — Environments. Task packages (task.md + environment/ + verifier/ + oracle/). Per entry: 50 minimum, 100 recommended, 200 maximum.
Track 1 — Skill Learning. SKILL.md packages. Per entry: 20 minimum, 50 recommended, 100 maximum.

Scoring (Track 2, headline). The managed pipeline regenerates teacher trajectories, runs SFT then GRPO on Qwen3-8B over your environments, and evaluates the checkpoint on BenchFlow Signals — a private held-out suite (a small public sample is released for sanity checks). Your score is the held-out Delta over a fixed reference baseline trained with the identical recipe, with paired bootstrap confidence intervals. Track 1 packages are evaluated by pass@1 of a frozen reference agent — no training, no internet.

Phases. Phase 0 (warm-up): public sample only, leaderboard hidden. Phase 1 (development): full entries accepted, public-sample scoring shown live. Phase 2 (final): submissions frozen, private-suite evaluation. Entries may be withdrawn until the Phase 2 freeze, and grading is blind to author identity — the author_* frontmatter fields are used for credit after scoring, not during it.

Package layout

A team entry lives under submissions/; each environment package is one directory inside it. The package itself is the same four-part contract everywhere — in the starting-kit template, in the worked examples, and in your entry.

filesystem

submissions/
└─ your-team/
   ├─ submission.yaml                  # team_name, contact_email, track: environments
   └─ envs/
      └─ your-env-name/
         ├─ task.md                    # required — frontmatter + prompt
         ├─ environment/
         │  ├─ Dockerfile              # required — built fresh per task
         │  ├─ <seed-data>             # optional — any files the agent reads
         │  └─ skills/                 # optional — agent-discoverable skill packages
         ├─ verifier/
         │  ├─ test.sh                 # required — pytest runner shim (boilerplate)
         │  ├─ test_outputs.py         # required — the actual checks
         │  ├─ verifier.md             # required — strategy + rubric declaration
         │  └─ rubrics/
         │     └─ verifier.md          # required — plain-language pass criteria
         └─ oracle/
            └─ solve.sh                # required — produces a passing trial

Naming convention: <env-or-domain>-<short-description> — for example gmail-workflow-delegation or skillsbench-weighted-gdp-calc. Category, modality, and any safety qualifier live in the frontmatter, not the directory name.

The fastest way to start is cp -R starting-kit/template submissions/your-team/envs/your-env-name from a fresh checkout. The worked examples under starting-kit/examples/ — including the three skillsbench-* ports — exercise every part of the contract; copy the parts you need from them.

task.md

Two halves separated by a --- fence: YAML frontmatter on top, Markdown body below. Frontmatter declares limits and metadata; the body holds the prompt and, optionally, per-scene or per-role instructions and a user persona.

Minimal example

task.md

---
version: "1.0"
metadata:
  author_name: Ada Lovelace
  author_email: ada@example.com
  category: natural-science   # a real glossary slug — see the domain note
  difficulty: medium          # easy | medium | hard
  tags: [calculation, spreadsheet]
agent:
  timeout_sec: 900
verifier:
  timeout_sec: 180
environment:
  build_timeout_sec: 600
  cpus: 1
  memory_mb: 2048
  storage_mb: 10240
  allow_internet: false
---

## prompt

State the task here. One short paragraph that the agent reads first,
followed by any structured detail it needs — paths, output schema, rules.

Frontmatter reference

The required top-level keys are version, metadata, agent, verifier, and environment; inside metadata, author_name, author_email, and categoryare required. Limits default to safe values (verifier 600s, 1 CPU, 2 GB RAM, 10 GB storage); agent.timeout_sec has no default, so set it. Field names use snake_case throughout.

version (required) — schema version, currently "1.0".
metadata.author_name + author_email (required) — contact for credit; grading is blind to these.
metadata.category (required) — the task’s domain, written as a glossary slug (see the domain note below).
metadata.difficulty — easy | medium | hard.
metadata.tags — short string list, used for routing and filtering.
agent.timeout_sec — wall-clock budget for the agent (no default — set it).
verifier.timeout_sec — budget for the scoring run after the agent finishes (default 600).
environment.cpus / memory_mb / storage_mb / build_timeout_sec — sandbox limits (defaults 1 / 2048 / 10240 / 600).
environment.allow_internet — boolean; defaults to true. Set false to sandbox the agent off the network, which most tasks should.
agents.roles — optional named roles when the task is multi-agent (see below).
scenes — optional ordered list of scenes, each referencing one or more roles.
user — optional simulated-user model + stop rule when the agent needs a counterpart.

The eight domains.The competition spans eight under-served domains, listed by their display names: Sciences, Industrial & Energy Operations, Cybersecurity, Finance & Economics, Office & Knowledge Work, Media & Multimodal Content, AI/ML & Agentic Systems, and Software Engineering. The metadata.category value is a SkillsBench glossary slug — for example natural-science, software-engineering, or cybersecurity. The exact slug-to-domain mapping is finalized in the starting kit; copy a current slug from starting-kit/template or one of the examples rather than inventing one.

Body sections

The body is Markdown with a small set of well-known headings. Anything not matching a known heading is passed through as authoring notes and ignored by the runtime.

## prompt (required) — what the agent sees first.
## scene:<name> — guidance shown when that scene starts, if you declared scenes.
## role:<name> — guidance shown to that role, if you declared agents.roles.
## user-persona — the simulated user’s mindset, if you declared user.

Multi-agent example

The richer form composes roles, scenes, and a simulated user in one file — the same authoring document the eval pipeline consumes directly:

task.md

---
version: "1.0"
metadata:
  author_name: benchflow
  author_email: team@example.com
  category: software-engineering   # real glossary slug — see the domain note
  difficulty: hard
  tags: [multi-agent, planning]
agent:
  timeout_sec: 1200
verifier:
  timeout_sec: 240
environment:
  build_timeout_sec: 600
  cpus: 2
  memory_mb: 4096
  storage_mb: 10240
  allow_internet: false
agents:
  roles:
    planner:
      agent: claude-agent-acp
      model: claude-sonnet-4-6
    executor:
      agent: codex-acp
      model: gpt-5.5
      reasoning_effort: high
scenes:
  - name: plan
    turns: [{ role: planner }]
  - name: implement
    turns: [{ role: executor }]
user:
  model: claude-haiku
  stop_rule: satisfied-or-5-rounds
---

## prompt

Refactor the tiny service so it keeps the same public behavior while splitting
request parsing, business logic, and output formatting into separate modules.

## scene:plan

Read the task, inspect the code, and write a concise implementation plan.

## scene:implement

Apply the plan. Run the verifier before finishing.

## user-persona

You are impatient and only reveal the order id if the agent asks for it
specifically.

environment/

The environment directory holds the Dockerfile and any seed data the agent needs at trial time. Start from a plain Ubuntu base and add only what the task needs — the dogfooded SkillsBench tasks all begin with FROM ubuntu:24.04.

environment/Dockerfile

# environment/Dockerfile
FROM ubuntu:24.04
ENV DEBIAN_FRONTEND=noninteractive

# Install Python + anything else the task needs.
RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /root

# Task-specific deps. Pin versions.
RUN pip3 install --break-system-packages \
    requests==2.32.3

# Seed data the agent reads at trial time.
COPY data.json /root/data.json

# Optional: bundle a skill package the agent can discover.
# COPY skills /root/skills

The agent works inside $BENCHFLOW_WORKSPACE (default /root). Anything the verifier needs to see must be written under that path before the agent exits. Trial sandboxes start fresh per attempt.

An optional environment/skills/ directory follows the Agent Skills spec: each subdirectory is one skill with its own SKILL.md, scripts, and references. Bundling skills with the task is the usual way to teach an agent the domain-specific moves your verifier expects.

verifier/

The verifier runs after the agent finishes and writes three artifacts to /logs/verifier/:

reward.txt — a single float, one of 0.0 or 1.0 for binary checks (or a real number in [0,1] for graded ones).
reward.json — { "reward": <float> }.
ctrf.json — pytest's CTRF details for per-check breakdowns.

verifier/test.sh is a small shim that runs pytest against test_outputs.py and writes those artifacts — copy it verbatim from the template. Put your real checks in test_outputs.py:

verifier/test_outputs.py

# verifier/test_outputs.py — pytest tests the trial's outputs.
# Passing → reward 1.0; any failure → reward 0.0. Use one test class
# per check group so partial-credit cases stay readable.
import json, os
from pathlib import Path
import pytest

WORKSPACE = Path(os.environ.get("BENCHFLOW_WORKSPACE", "/root"))
ANSWER_FILE = WORKSPACE / "answer.json"

class TestAnswerFileExists:
    def test_file_exists(self):
        assert ANSWER_FILE.exists(), f"Answer file not found at {ANSWER_FILE}"

    def test_file_is_valid_json(self):
        with open(ANSWER_FILE) as f:
            try:
                json.load(f)
            except json.JSONDecodeError as e:
                pytest.fail(f"Answer file is not valid JSON: {e}")

class TestAnswerContent:
    def test_result_matches_expected(self):
        with open(ANSWER_FILE) as f:
            data = json.load(f)
        assert data["result"] == 42, f"Expected 42, got {data['result']}"

The verifier.md file alongside the tests declares the rubric the pytest results roll up into, and gives reviewers a plain-language description for edge cases. Schema:

verifier/verifier.md

---
document_version: '0.3'
verifier:
  name: your-task-verifier
  default_strategy: pytest
  strategies:
    pytest:
      type: script
      command: ./test.sh
  rubric:
    combine: weighted_sum
    dimensions:
      correctness:
        weight: 1.0
        source: pytest
  outputs:
    reward_text: /logs/verifier/reward.txt
    reward_json: /logs/verifier/reward.json
    details_json: /logs/verifier/ctrf.json
---

## role:reviewer

State what a passing trial looks like in plain language. Human reviewers
read this when grading edge cases the pytest tests can't decide alone.

The verifier runs after the agent with no network access by default. If scoring needs to call an LLM judge or external API, opt in through the strategy you declare in verifier.md (for example an llm-judgestrategy) rather than widening the agent’s own network — the starting kit shows the supported pattern.

oracle/

The oracle is a reference solution. Its job is to prove the task is reachable: when run under the same environment, it must score a passing reward (1.0 for binary checks). Reviewers re-run the oracle as part of acceptance, and CI re-runs it on every change to the environment image.

oracle/solve.sh

#!/bin/bash
# oracle/solve.sh — reference solution. Runs in the same container the
# agent uses. Must achieve a passing score so reviewers can confirm the
# task is solvable; CI re-runs it on every image bump.
set -e
WORKSPACE="${BENCHFLOW_WORKSPACE:-/root}"
mkdir -p "$WORKSPACE"

cat > "$WORKSPACE/answer.json" << 'JSON'
{
  "result": 42
}
JSON

Keep the oracle as simple as the task allows — it is documentation as much as a regression check. A long, clever oracle usually means the task is doing too much.

Validate locally

The contributor path is self-contained: everything runs with just python3 and Docker — no benchflow install. Two structural checks plus an oracle replay cover the contract:

shell

# Structural — fast, no Docker required
python3 scripts/check_task.py submissions/your-team/envs
python3 scripts/check_submission.py

# Oracle replay — build the image, run the oracle, score it
scripts/run_local.sh submissions/your-team/envs/your-env-name

# Empty trial — prove the verifier rejects a do-nothing run
scripts/run_local.sh submissions/your-team/envs/your-env-name --skip-oracle

check_task.py catches frontmatter typos, missing required fields, and missing files (including that verifier/rubrics/ holds at least one rubric).check_submission.py validates the team manifest and that the entry count sits within the track bounds — it warns below the minimum and fails above the maximum. The oracle replay must score 1.0; the empty trial must not. Get all four green before opening a PR — CI runs the same checks.

Submit

Open a pull request against the posttrainarena repo adding or updating your team entry under submissions/. Paste the tail of both run_local.sh runs in the description so reviewers can see the image built, the oracle passed, and the empty trial did not. Discussion happens on Discord and in the PR thread.

Accepted environments are released openly. The managed pipeline regenerates teacher trajectories, runs SFT then GRPO on Qwen3-8B over your corpus, and scores the checkpoint’s held-out Delta on BenchFlow Signals against a fixed reference baseline. Standings publish once Phase 1 opens; Phase 2 freezes submissions for the private-suite evaluation.

View on GitHub Back to the landing page