Overview
This dashboard estimates the carbon footprint of large language model (LLM) inference running on the National Research Platform (NRP) Nautilus cluster. All measurements are derived passively from existing telemetry — no instrumentation of the LLM services is required.
Key principle: Carbon = Energy × Grid Intensity. We measure GPU energy directly from hardware sensors already reporting to the cluster's Prometheus instance, then multiply by published grid carbon intensity for each node's location.
Step 1 — Measuring GPU Power
GPU power draw is read from
NVIDIA DCGM Exporter
(Data Center GPU Manager), which runs as a DaemonSet on every GPU node in the
NRP cluster. DCGM reads the hardware power sensor via NVML
(nvmlDeviceGetPowerUsage) and exports it to Prometheus as:
DCGM_FI_DEV_POWER_USAGE{namespace, container, Hostname, ...} # watts, per GPU
We aggregate all GPUs belonging to each LLM pod using a PromQL sum:
sum by (namespace, container, Hostname) (
avg_over_time(DCGM_FI_DEV_POWER_USAGE{namespace=~"nrp-llm|sdsc-llm"}[5m])
)
The avg_over_time(...[5m]) wrapper smooths short-lived spikes into a 5-minute
rolling average. This gives total GPU power in watts for each deployed model.
Measurements are scraped every 30 seconds.
Limitation: We measure GPU power only. CPU, RAM, storage, and networking power are not included. For GPU-heavy inference workloads this typically represents 70–85% of total server power. A full datacenter PUE factor (usually 1.1–1.5) is also not applied. Our figures are therefore conservative underestimates of total facility energy.
Step 2 — Token Throughput (Prompt + Generation)
Token throughput is read from vLLM's built-in Prometheus metrics. Two counters are tracked separately and summed into a total token rate:
-- Output (decode) tokens generated per second:
sum by (namespace, container, model_name) (
rate(vllm:generation_tokens_total[5m])
)
-- Input (prefill) tokens processed per second:
sum by (namespace, container, model_name) (
rate(vllm:prompt_tokens_total[5m])
)
Both phases consume GPU energy: the prefill phase runs attention over every input token before the first output token is produced; the decode phase generates one output token at a time. Counting only output tokens would therefore misrepresent the true cost of inference — especially for agentic workloads where a single user-visible response may involve tens of thousands of input tokens across multiple hidden tool calls.
Why this matters for agentic AI: In agentic workloads, a single user-visible response may involve tens of thousands of input tokens across multiple hidden tool calls — input tokens can account for 90–95% of all tokens processed. If CO₂/token were computed over output tokens only, it would overstate the per-token cost by 10–20×. The dashboard uses total tokens (input + output) as the denominator, so reported CO₂/token reflects the full energy cost amortized across every token the GPU processed.
The CO₂ per token metric is only reported when total throughput (input + output) is at least 5 tok/s. Below this threshold the ratio is dominated by near-idle power draw rather than the model's efficiency under real load. The dashboard cards break down throughput into Output tok/s, Input tok/s, and Total tok/s for transparency.
Step 3 — Carbon Intensity by Grid Location
NRP nodes are hosted at universities and research institutions across the United States
and internationally. Grid carbon intensity varies substantially by location. We use
EPA eGRID 2022 subregion averages
matched to each node's Hostname label. When the hostname doesn't match,
a secondary lookup by Kubernetes namespace prefix is attempted (e.g. sdsc-llm
maps to California). If neither matches, the fallback is the California CAMX intensity
(0.198 kg CO₂/kWh), since the majority of NRP Nautilus nodes are hosted at SDSC in
San Diego:
| Institution / Region | Hostname pattern | eGRID Subregion | Intensity (kg CO₂/kWh) |
|---|---|---|---|
| California (SDSC, CSUS, Caltech, Humboldt, UCSD, UCLA, UCSB, CalIT2, CSUMB) | *.sdsc.*, *.csus.*, *.humboldt.*, *.caltech.*, *.ucsd.*, *.ucla.*, *.ucsb.*, *.calit2.*, csumb.*, nautilus-*, sdsc-* | CAMX | 0.198 |
| NYU (New York) | *.nyu.* | NYUP | 0.174 |
| UNL (Nebraska) | *.unl.* | MROW | 0.531 |
| UT Austin / TACC (Texas) | *.utexas.*, *.tacc.* | ERCO | 0.393 |
| Clemson (South Carolina) | *.clemson.* | SRSO | 0.423 |
| University of Hawaii | *.hawaii.* | HIOA | 0.702 |
| K-State (Kansas) | *.ksu.* | SPSO | 0.555 |
| KREONET (South Korea) | *.kreonet.* | Korean grid | 0.459 |
| All other nodes | — | CAMX (California default) | 0.198 |
Step 4 — Carbon Calculations
Three derived metrics are computed:
CO₂ grams per hour
CO₂_g/hr = P_watts × intensity_kg/kWh
= P_watts × intensity # because W × kg/kWh = W × kg/(1000W·h) × 1000 = g/h
CO₂ milligrams per token (total tokens)
total_tokens_per_sec = prompt_tokens_per_sec + generation_tokens_per_sec
CO₂_mg/token = P_watts × intensity_kg/kWh × (1e6 mg/kg) / (3.6e6 J/kWh) / total_tokens_per_sec
= P_watts × intensity × 0.2778 / total_tokens_per_sec
The denominator is total tokens processed (prompt + generation), not output tokens alone. This is the correct normalization: the GPU expends energy on every token it touches — both the input context during prefill and each generated token during decode.
Commercial frontier comparison — per-model physics estimate
Rather than comparing against a fixed hypothetical workload, the dashboard computes a per-model, apples-to-apples power comparison: for each NRP model's actual observed token throughput (prompt and generation rates), what total power would a commercial frontier model draw to serve the same tokens at the same rate?
Key principle: The comparison holds everything fixed except model size — same tokens, same traffic level, same grid. The frontier model's CO₂/token is computed using the same grid carbon intensity as the NRP model it's compared against. This isolates the model size/energy tradeoff: a larger frontier model may produce higher-quality output, but requires substantially more hardware — and therefore more energy — to host and run. Because grid intensity is the same on both sides, the CO₂ ratio equals the watts ratio.
Commercial frontier model specification
We model a commercial frontier LLM as a ~1.5 T parameter Mixture-of-Experts (MoE) with ~300 B active parameters per token, served in FP8 on H100-80GB GPUs:
| Parameter | Value | Derivation |
|---|---|---|
| Total parameters | ~1.5 T | Frontier MoE estimate (2026) |
| Active parameters/token | ~300 B | Typical MoE activation ratio |
| Weight memory (FP8) | ~1.5 TB | 1 byte/param × 1.5 T params |
| Minimum GPUs | 24× H100-80GB | 1.92 TB total HBM for weights + KV cache + activations |
| Idle power | 225 W/GPU × 24 = 5,400 W | H100 SXM measured idle |
Marginal token energy (above idle)
| Phase | Marginal energy | Derivation |
|---|---|---|
| Prefill (input tokens) | 0.5 J/token | Compute-bound: 600 GFLOPS/token, 24× H100 at 60% util → 23,700 tok/s; additional power (700−225)×24 = 11,400 W → 0.48 J/tok |
| Decode (output tokens) | 6 J/token | Memory-bandwidth-bound at low batch (B≈1–4, matching NRP traffic): must load 300 GB active weights per step; 80.4 TB/s → 270 tok/s; ~3,000 W additional → ~11 J at B=1, ~4 J at B=4 |
Per-model comparison formula
frontier_watts = 24 × 225 # idle floor: 5,400 W
+ prompt_tokens_per_sec × 0.5 # marginal prefill
+ generation_tokens_per_sec × 6.0 # marginal decode
ratio = frontier_watts / nrp_measured_watts
This formula is applied to each NRP model using its actual observed prompt and generation token rates. The comparison is shown as a watts bar on each model card and as a CO₂/token bar chart (colored by energy ratio) across all models.
Example: Qwen3.5-397B at low traffic
NRP measured: 783 W (8× A100, real DCGM reading)
Prompt rate: 3 tok/s
Generation rate: 15 tok/s
Commercial frontier equivalent:
Idle floor: 24 × 225 = 5,400 W
Prefill: 3 × 0.5 = 2 W
Decode: 15 × 6.0 = 90 W
Total: 5,492 W
Ratio: 5,492 / 783 = 7.0× — NRP uses 7× less energy for the same tokens.
Why this is fair: Both sides include hosting costs. The NRP model draws ~783 W to keep 8× A100 GPUs powered; the commercial frontier would draw ~5,400 W just to keep 24× H100 GPUs powered. At low traffic, both clusters are underutilized — the comparison reflects the real energy cost of maintaining each capability, not an idealized marginal-only estimate for one side.
What this doesn't measure: NRP figures are GPU power only (no CPU, DRAM, PUE). Applying a typical 1.5–2× system-overhead correction to both sides would preserve the ratio. Commercial cloud PUE (~1.1–1.3) vs. research facility PUE may differ slightly but does not change the order-of-magnitude comparison.
Each card also displays the observed prompt token percentage — what fraction of total tokens are input (prefill) vs output (decode), for transparency about the workload driving the comparison.
Dashboard visualizations
The dashboard presents the frontier comparison in two ways:
- CO₂/token bar chart: Bars show the 24-hour average CO₂ per token (mg) for each active NRP model. Green diamond (◆) markers show the current 5-minute rate. Bar color ranges from green (low carbon per token) to red (high) — reflecting both energy efficiency and grid carbon intensity at the model's location.
- Model card "N× less energy" metric: Each model card shows how many
times more power the commercial frontier would require to serve the same tokens
(e.g. "7.0× less energy"). This is
frontier_watts / nrp_measured_watts. - Power comparison bar (per card): A horizontal bar comparing NRP measured watts (green) against the frontier equivalent (purple band), with a breakdown of idle floor vs. marginal power.
Energy efficiency across models (J/token view)
The bar chart can be toggled to show J/token (joules per token) — a hardware- and grid-neutral measure of how efficiently each model converts GPU power into tokens:
J_per_token = power_watts / total_tokens_per_sec
Unlike CO₂/token, this metric is independent of grid carbon intensity, making it directly comparable across models regardless of where they are hosted. A model with high J/token is underutilized relative to its GPU allocation — the GPUs draw near-idle power while producing few tokens. Key drivers of high J/token include:
- Low traffic: GPU idle power dominates when few tokens are being processed
- Oversized allocation: More GPUs than needed for the model's actual demand
- Model architecture: Dense models activate all parameters on every token, while Mixture-of-Experts (MoE) models route each token through a subset (~20%) of total weights. A dense 31B model may therefore have higher J/token than a 397B MoE, because the MoE only activates ~60B parameters per token
Note: J/token only appears for models with >5 tok/s total throughput. Below this threshold the metric is dominated by idle power and not meaningful as an efficiency measure. Bar color ranges from green (most efficient) to red (least efficient) relative to the other active NRP models. Tooltips show the underlying watts, tok/s, and GPU hardware for context.
Cumulative CO₂ and cluster CO₂/token (time-series view)
For the time-series charts, we query the Prometheus range API at adaptive resolution (5-minute steps for 24 h, hourly for 7 d, 6-hourly for 30 d), apply the same per-node intensity to each sample, and integrate:
CO₂_kg_cumulative = Σ (P_watts_i × intensity_i × Δt_hrs / 1000)
-- Total token rate used as denominator for cluster-wide CO₂/token:
sum by (namespace, container) (
rate(vllm:generation_tokens_total[5m]) + rate(vllm:prompt_tokens_total[5m])
)
Limitations and Future Work
- CPU, DRAM, and network power are excluded (typically adds 15–30% to total)
- Datacenter PUE (Power Usage Effectiveness) is not applied (adds 10–50%)
- Carbon intensity values are annual averages; real-time grid mix varies hour-to-hour
- Nodes not matching hostname or namespace patterns fall back to the California (CAMX) average
- DCGM reports a single power reading per GPU; we cannot decompose energy into prefill vs decode phases on the NRP side
- The Commercial frontier model spec is an informed estimate, not a known architecture; the comparison is illustrative of the size/energy tradeoff, not a precise measurement of any specific commercial model
- Decode marginal energy (6 J/token) assumes low batch matching typical NRP traffic; commercial providers batching hundreds of concurrent requests achieve lower per-token decode costs, but also require the same or greater hosting floor
- The eGRID intensity table in
internal/carbon/intensity.gowill be updated as new annual data is published
For a more complete lifecycle analysis including training, hardware manufacturing, and cooling overhead, see the references below.
References
| Source | Key contribution |
|---|---|
| Luccioni, Jernite & Strubell (2024). Power Hungry Processing. ACM FAccT. | Per-query energy & CO₂ measurements for open LLMs (BLOOMz, Flan-T5, etc.) |
| Delavande, Pierrard & Luccioni (2025). Small Talk, Big Impact. arXiv. | Per-token energy decomposition: prefill vs. decode scaling; basis for J/token envelope estimates |
| Patterson et al. (2021). Carbon Emissions and Large Neural Network Training. arXiv. | GPT-3 training energy (1,287 MWh); location & hardware efficiency analysis |
| Li et al. (2023). Making AI Less Thirsty. arXiv. | GPT-3 inference water and energy footprint estimates |
| Shehabi et al. (2024). Powering Intelligence. Lawrence Berkeley National Laboratory. | US AI data center energy projections; efficiency trajectories |
| IEA (2024). Electricity 2024. | Global data center energy (460 TWh 2022, >1000 TWh by 2026); ChatGPT ≈ 10× Google search |
| US EPA eGRID (2022). | Regional grid carbon intensity values used in this dashboard |
| Lottick et al. — CodeCarbon project. | Open-source framework for tracking ML carbon emissions; inspiration for this work |
Source Code
This dashboard is open source:
github.com/boettiger-lab/nrp-carbon-api.
The carbon intensity lookup table is in
internal/carbon/intensity.go.