Methodology — NRP Carbon Dashboard

Overview

This dashboard estimates the carbon footprint of large language model (LLM) inference running on the National Research Platform (NRP) Nautilus cluster. All measurements are derived passively from existing telemetry — no instrumentation of the LLM services is required.

Key principle: Carbon = Energy × Grid Intensity. We measure GPU energy directly from hardware sensors already reporting to the cluster's Prometheus instance, then multiply by published grid carbon intensity for each node's location.

Step 1 — Measuring GPU Power

GPU power draw is read from NVIDIA DCGM Exporter (Data Center GPU Manager), which runs as a DaemonSet on every GPU node in the NRP cluster. DCGM reads the hardware power sensor via NVML (nvmlDeviceGetPowerUsage) and exports it to Prometheus as:

DCGM_FI_DEV_POWER_USAGE{namespace, container, Hostname, ...}  # watts, per GPU

We aggregate all GPUs belonging to each LLM pod using a PromQL sum:

sum by (namespace, container, Hostname) (
  avg_over_time(DCGM_FI_DEV_POWER_USAGE{namespace=~"nrp-llm|sdsc-llm"}[5m])
)

The avg_over_time(...[5m]) wrapper smooths short-lived spikes into a 5-minute rolling average. This gives total GPU power in watts for each deployed model. Measurements are scraped every 30 seconds.

Limitation: We measure GPU power only. CPU, RAM, storage, and networking power are not included. For GPU-heavy inference workloads this typically represents 70–85% of total server power. A full datacenter PUE factor (usually 1.1–1.5) is also not applied. Our figures are therefore conservative underestimates of total facility energy.

Step 2 — Token Throughput (Prompt + Generation)

Token throughput is read from vLLM's built-in Prometheus metrics. Two counters are tracked separately and summed into a total token rate:

-- Output (decode) tokens generated per second:
sum by (namespace, container, model_name) (
  rate(vllm:generation_tokens_total[5m])
)

-- Input (prefill) tokens processed per second:
sum by (namespace, container, model_name) (
  rate(vllm:prompt_tokens_total[5m])
)

Both phases consume GPU energy: the prefill phase runs attention over every input token before the first output token is produced; the decode phase generates one output token at a time. Counting only output tokens would therefore misrepresent the true cost of inference — especially for agentic workloads where a single user-visible response may involve tens of thousands of input tokens across multiple hidden tool calls.

Why this matters for agentic AI: In agentic workloads, a single user-visible response may involve tens of thousands of input tokens across multiple hidden tool calls. Empirical studies confirm this dramatically. A 2025 measurement study (arXiv:2506.04301) found that agentic reasoning frameworks (Reflexion, LATS) consume 62–137× more energy per task than single-turn queries. A tokenomics study of multi-agent software engineering (arXiv:2601.14470) found that 53.9% of all tokens processed are input (prefill) tokens, with an input-to-output ratio of roughly 2:1. Real-world API data (OpenRouter, 2025) shows that programming prompts — the dominant agentic use case — routinely exceed 20,000 input tokens and grew ~4× in average length from early 2024 to late 2025.

If CO₂/token were computed over output tokens only, it would overstate the per-token cost by 10–20×. The dashboard uses total tokens (input + output) as the denominator, so reported CO₂/token reflects the full energy cost amortized across every token the GPU processed.

The CO₂ per token metric is only reported when total throughput (input + output) is at least 5 tok/s. Below this threshold the ratio is dominated by near-idle power draw rather than the model's efficiency under real load. The dashboard cards break down throughput into Output tok/s, Input tok/s, and Total tok/s for transparency.

Token-weighted averaging

The 24-hour and 7-day CO₂/token averages shown in the bar chart use a token-weighted mean rather than an unweighted arithmetic mean. Because CO₂/token is a ratio (power ÷ throughput), samples taken during low-throughput periods produce extreme values — a model drawing 2,000 W at 5 tok/s yields 110 mg/token, versus 0.03 mg at 3,000 tok/s. An unweighted mean gives equal influence to both, inflating the average by orders of magnitude.

The weighted formula:

avg_CO₂/token = Σ(CO₂_per_token_i × tokens_per_sec_i) / Σ(tokens_per_sec_i)

This is equivalent to computing total CO₂ emitted divided by total tokens produced, which is the physically meaningful quantity. A 30-second sample at 5 tok/s (representing ~150 tokens) contributes proportionally less than a sample at 3,000 tok/s (representing ~90,000 tokens). Models under sustained heavy load naturally converge to their true operating efficiency.

Internally, samples are aggregated into hourly buckets (168 buckets for 7 days), each storing the weighted sum and token sum for that hour. On startup, the scraper backfills these buckets from 7 days of historical Prometheus data, so the 24-hour and 7-day averages are immediately correct after a service restart rather than starting from zero.

Step 3 — Carbon Intensity by Grid Location

NRP nodes are hosted at universities and research institutions across the United States and internationally. Grid carbon intensity varies substantially by location. We use EPA eGRID 2022 subregion averages matched to each node's Hostname label. When the hostname doesn't match, a secondary lookup by Kubernetes namespace prefix is attempted (e.g. sdsc-llm maps to California). If neither matches, the fallback is the California CAMX intensity (0.198 kg CO₂/kWh), since the majority of NRP Nautilus nodes are hosted at SDSC in San Diego:

Institution / Region	Hostname pattern	eGRID Subregion	Intensity (kg CO₂/kWh)
California (SDSC, CSUS, Caltech, Humboldt, UCSD, UCLA, UCSB, CalIT2, CSUMB)	`.sdsc.`, `.csus.`, `.humboldt.`, `.caltech.`, `.ucsd.`, `.ucla.`, `.ucsb.`, `.calit2.`, `csumb.`, `nautilus-`, `sdsc-*`	CAMX	0.198
NYU (New York)	`.nyu.`	NYUP	0.174
UNL (Nebraska)	`.unl.`	MROW	0.531
UT Austin / TACC (Texas)	`.utexas.`, `.tacc.`	ERCO	0.393
Clemson (South Carolina)	`.clemson.`	SRSO	0.423
University of Hawaii	`.hawaii.`	HIOA	0.702
K-State (Kansas)	`.ksu.`	SPSO	0.555
KREONET (South Korea)	`.kreonet.`	Korean grid	0.459
All other nodes	—	CAMX (California default)	0.198

Step 4 — Carbon Calculations

Three derived metrics are computed:

CO₂ grams per hour

CO₂_g/hr = P_watts × intensity_kg/kWh
         = P_watts × intensity   # because W × kg/kWh = W × kg/(1000W·h) × 1000 = g/h

CO₂ milligrams per token (total tokens)

total_tokens_per_sec = prompt_tokens_per_sec + generation_tokens_per_sec

CO₂_mg/token = P_watts × intensity_kg/kWh × (1e6 mg/kg) / (3.6e6 J/kWh) / total_tokens_per_sec
             = P_watts × intensity × 0.2778 / total_tokens_per_sec

The denominator is total tokens processed (prompt + generation), not output tokens alone. This is the correct normalization: the GPU expends energy on every token it touches — both the input context during prefill and each generated token during decode.

Commercial frontier comparison — per-model physics estimate

Rather than comparing against a fixed hypothetical workload, the dashboard computes a per-model, apples-to-apples power comparison: for each NRP model's actual observed token throughput (prompt and generation rates), what total power would a commercial frontier model draw to serve the same tokens at the same rate?

Key principle: The comparison holds everything fixed except model size — same tokens, same traffic level, same grid. The frontier model's CO₂/token is computed using the same grid carbon intensity as the NRP model it's compared against. This isolates the model size/energy tradeoff: a larger frontier model may produce higher-quality output, but requires substantially more hardware — and therefore more energy — to host and run. Because grid intensity is the same on both sides, the CO₂ ratio equals the watts ratio.

Commercial frontier model specification

We model a commercial frontier LLM as a ~1.5 T parameter Mixture-of-Experts (MoE) with ~300 B active parameters per token, served in FP8 on H100-80GB GPUs. This is an informed estimate, not a known architecture — commercial providers (OpenAI, Anthropic, Google) do not publicly disclose hardware configurations or per-query power measurements. The closest available ground truth is Google's measured production figure of 0.24 Wh per median Gemini prompt (arXiv:2508.15734, August 2025), covering accelerators, CPU, DRAM, and datacenter overhead. An independent bottom-up analysis estimates a median of 0.34 Wh per query (IQR: 0.18–0.67 Wh) for models exceeding 200 B parameters on H100 hardware (arXiv:2509.20241). Our physics-based idle floor of ~5,400 W for 24× H100 GPUs is consistent with these per-query figures at realistic throughput: at 50 tok/s total and 0.34 Wh/query (~700 tokens), the implied power draw is ~87 W above the idle floor, broadly consistent with our marginal energy estimates below.

Parameter	Value	Derivation
Total parameters	~1.5 T	Frontier MoE estimate (2026)
Active parameters/token	~300 B	Typical MoE activation ratio
Weight memory (FP8)	~1.5 TB	1 byte/param × 1.5 T params
Minimum GPUs	24× H100-80GB	1.92 TB total HBM for weights + KV cache + activations
Idle power	225 W/GPU × 24 = 5,400 W	H100 SXM measured idle

Marginal token energy (above idle)

Phase	Marginal energy	Derivation
Prefill (input tokens)	0.5 J/token	Compute-bound: 600 GFLOPS/token, 24× H100 at 60% util → 23,700 tok/s; additional power (700−225)×24 = 11,400 W → 0.48 J/tok
Decode (output tokens)	6 J/token	Memory-bandwidth-bound at low batch (B≈1–4, matching NRP traffic): must load 300 GB active weights per step; 80.4 TB/s → 270 tok/s; ~3,000 W additional → ~11 J at B=1, ~4 J at B=4

Per-model comparison formula

frontier_watts = 24 × 225                                   # idle floor: 5,400 W
               + prompt_tokens_per_sec × 0.5                 # marginal prefill
               + generation_tokens_per_sec × 6.0             # marginal decode

ratio = frontier_watts / nrp_measured_watts

This formula is applied to each NRP model using its actual observed prompt and generation token rates. The comparison is shown as a watts bar on each model card and as a CO₂/token bar chart (colored by energy ratio) across all models.

Example: Qwen3.5-397B at low traffic

NRP measured:    783 W  (8× A100, real DCGM reading)
Prompt rate:       3 tok/s
Generation rate:  15 tok/s

Commercial frontier equivalent:
  Idle floor:    24 × 225 = 5,400 W
  Prefill:        3 × 0.5 =     2 W
  Decode:        15 × 6.0 =    90 W
  Total:                    5,492 W

Ratio: 5,492 / 783 = 7.0× — NRP uses 7× less energy for the same tokens.

Why this is fair: Both sides include hosting costs. The NRP model draws ~783 W to keep 8× A100 GPUs powered; the commercial frontier would draw ~5,400 W just to keep 24× H100 GPUs powered. At low traffic, both clusters are underutilized — the comparison reflects the real energy cost of maintaining each capability, not an idealized marginal-only estimate for one side.

What this doesn't measure: NRP figures are GPU power only (no CPU, DRAM, PUE). Applying a typical 1.5–2× system-overhead correction to both sides would preserve the ratio. Commercial cloud PUE (~1.1–1.3) vs. research facility PUE may differ slightly but does not change the order-of-magnitude comparison.

Each card also displays the observed prompt token percentage — what fraction of total tokens are input (prefill) vs output (decode), for transparency about the workload driving the comparison.

Dashboard visualizations

The dashboard presents the frontier comparison in two ways:

CO₂/token bar chart: Bars show the token-weighted 24-hour average CO₂ per token (mg) for each active NRP model. Each 30-second sample is weighted by its throughput (tok/s), so periods of high utilization contribute proportionally more to the average than brief low-throughput transitions. This prevents ramp-up/ramp-down periods (where CO₂/token is transiently extreme due to idle power divided by few tokens) from dominating the long-run average. Green diamond (◆) markers show the current 5-minute rate. Bar color ranges from green (low carbon per token) to red (high) — reflecting both energy efficiency and grid carbon intensity at the model's location.
Model card "N× less energy" metric: Each model card shows how many times more power the commercial frontier would require to serve the same tokens (e.g. "7.0× less energy"). This is frontier_watts / nrp_measured_watts.
Power comparison bar (per card): A horizontal bar comparing NRP measured watts (green) against the frontier equivalent (purple band), with a breakdown of idle floor vs. marginal power.

Energy efficiency across models (J/token view)

The bar chart can be toggled to show J/token (joules per token) — a hardware- and grid-neutral measure of how efficiently each model converts GPU power into tokens:

J_per_token = power_watts / total_tokens_per_sec

Unlike CO₂/token, this metric is independent of grid carbon intensity, making it directly comparable across models regardless of where they are hosted. A model with high J/token is underutilized relative to its GPU allocation — the GPUs draw near-idle power while producing few tokens. Key drivers of high J/token include:

Low traffic: GPU idle power dominates when few tokens are being processed
Oversized allocation: More GPUs than needed for the model's actual demand
Model architecture: Dense models activate all parameters on every token, while Mixture-of-Experts (MoE) models route each token through a subset (~20%) of total weights. A dense 31B model may therefore have higher J/token than a 397B MoE, because the MoE only activates ~60B parameters per token

Note: J/token only appears for models with >5 tok/s total throughput. Below this threshold the metric is dominated by idle power and not meaningful as an efficiency measure. Bar color ranges from green (most efficient) to red (least efficient) relative to the other active NRP models. Tooltips show the underlying watts, tok/s, and GPU hardware for context.

Cumulative CO₂ and cluster CO₂/token (time-series view)

For the time-series charts, we query the Prometheus range API at adaptive resolution (5-minute steps for 24 h, hourly for 7 d, 6-hourly for 30 d), apply the same per-node intensity to each sample, and integrate:

CO₂_kg_cumulative = Σ (P_watts_i × intensity_i × Δt_hrs / 1000)

-- Total token rate used as denominator for cluster-wide CO₂/token:
sum by (namespace, container) (
  rate(vllm:generation_tokens_total[5m]) + rate(vllm:prompt_tokens_total[5m])
)

Limitations and Future Work

CPU, DRAM, and network power are excluded (typically adds 15–30% to total)
Datacenter PUE (Power Usage Effectiveness) is not applied (adds 10–50%)
Carbon intensity values are annual averages; real-time grid mix varies hour-to-hour
Nodes not matching hostname or namespace patterns fall back to the California (CAMX) average
DCGM reports a single power reading per GPU; we cannot decompose energy into prefill vs decode phases on the NRP side
The commercial frontier model spec is an informed estimate, not a known architecture. No empirical per-token energy measurement exists for GPT-4, Claude 3/3.5/3.7, or Gemini 1.5/2. The closest empirical anchor is Google's production measurement of 0.24 Wh/prompt for Gemini Apps (arXiv:2508.15734). Our physics-based estimate is broadly consistent with this and with independent bottom-up estimates (0.34 Wh median, arXiv:2509.20241)
Decode marginal energy (6 J/token) assumes low batch matching typical NRP traffic; commercial providers batching hundreds of concurrent requests achieve lower per-token decode costs, but also require the same or greater hosting floor
Agentic energy amplification is highly framework-dependent: measured research-grade frameworks show 62–137× energy vs single-turn (arXiv:2506.04301); production systems are likely lower but remain poorly characterized in the literature
The eGRID intensity table in internal/carbon/intensity.go will be updated as new annual data is published

For a more complete lifecycle analysis including training, hardware manufacturing, and cooling overhead, see the references below.

References

Source	Key contribution
Luccioni, Jernite & Strubell (2024). Power Hungry Processing. ACM FAccT.	Per-query energy & CO₂ measurements for open LLMs (BLOOMz, Flan-T5, etc.)
Delavande, Pierrard & Luccioni (2025). Small Talk, Big Impact. arXiv.	Per-token energy decomposition: prefill vs. decode scaling; basis for J/token envelope estimates
Patterson et al. (2021). Carbon Emissions and Large Neural Network Training. arXiv.	GPT-3 training energy (1,287 MWh); location & hardware efficiency analysis
Li et al. (2023). Making AI Less Thirsty. arXiv.	GPT-3 inference water and energy footprint estimates
Shehabi et al. (2024). Powering Intelligence. Lawrence Berkeley National Laboratory.	US AI data center energy projections; efficiency trajectories
IEA (2024). Electricity 2024.	Global data center energy (460 TWh 2022, >1000 TWh by 2026); ChatGPT ≈ 10× Google search
US EPA eGRID (2022).	Regional grid carbon intensity values used in this dashboard
Lottick et al. — CodeCarbon project.	Open-source framework for tracking ML carbon emissions; inspiration for this work
Samsi et al. (2023). From Words to Watts. arXiv.	Foundational empirical measurement of LLM inference energy on A100/V100; LLaMA-65B: 3–4 J/output token
TokenPowerBench (2025). arXiv:2512.03024.	Comprehensive H100 benchmark of LLM inference power; MoE models ~2–3× more energy-efficient per token than dense equivalents
Google (2025). Measuring the environmental impact of delivering AI at Google Scale. arXiv.	Only production-measured energy figure for a frontier commercial model: 0.24 Wh median per Gemini prompt (August 2025); 33× efficiency gain year-over-year
Luccioni et al. (2025). Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute. arXiv.	Best bottom-up estimate for frontier >200B models: 0.34 Wh/query median (IQR: 0.18–0.67); agentic 15× token multiplier → ~13× energy increase
The Cost of Dynamic Reasoning. (2025). arXiv.	Empirical measurement of agentic AI energy: Reflexion/LATS workflows consume 62–137× more energy than single-turn baseline on same model
Tokenomics (2026). arXiv.	Empirical tokenomics of multi-agent software engineering: 53.9% input tokens, ~2:1 input:output ratio; code review consumes 59% of all tokens
From Prompts to Power. (2025). arXiv.	32,500 empirical measurements; confirms decode phase dominates energy (~96–97% of total); output tokens 10–30× more expensive per token than input
How Hungry is AI? (2025). arXiv.	GPT-4o estimates via API timing + hardware modeling: 0.42 Wh (short query) to 2.9 Wh (long query); cross-model comparison framework
Epoch AI (2024). How much energy does ChatGPT use?	Transparent bottom-up estimate for GPT-4o: ~0.3 Wh per typical query; based on leaked architecture specs and H100 TDP modeling
a16z / OpenRouter (2025). State of AI: 100T Token Study.	Empirical API traffic data: prompt lengths grew ~4× (2024–2025); programming prompts routinely exceed 20,000 input tokens; agentic use dominates growth

Source Code

This dashboard is open source: github.com/boettiger-lab/nrp-carbon-api. The carbon intensity lookup table is in internal/carbon/intensity.go.