← Dashboard

NRP Carbon — Methodology

Overview

This dashboard estimates the carbon footprint of large language model (LLM) inference running on the National Research Platform (NRP) Nautilus cluster. All measurements are derived passively from existing telemetry — no instrumentation of the LLM services is required.

Key principle: Carbon = Energy × Grid Intensity. We measure GPU energy directly from hardware sensors already reporting to the cluster's Prometheus instance, then multiply by published grid carbon intensity for each node's location.

Step 1 — Measuring GPU Power

GPU power draw is read from NVIDIA DCGM Exporter (Data Center GPU Manager), which runs as a DaemonSet on every GPU node in the NRP cluster. DCGM reads the hardware power sensor via NVML (nvmlDeviceGetPowerUsage) and exports it to Prometheus as:

DCGM_FI_DEV_POWER_USAGE{namespace, container, Hostname, ...}  # watts, per GPU

We aggregate all GPUs belonging to each LLM pod using a PromQL sum:

sum by (namespace, container, Hostname) (
  avg_over_time(DCGM_FI_DEV_POWER_USAGE{namespace=~"nrp-llm|sdsc-llm"}[5m])
)

The avg_over_time(...[5m]) wrapper smooths short-lived spikes into a 5-minute rolling average. This gives total GPU power in watts for each deployed model. Measurements are scraped every 30 seconds.

Limitation: We measure GPU power only. CPU, RAM, storage, and networking power are not included. For GPU-heavy inference workloads this typically represents 70–85% of total server power. A full datacenter PUE factor (usually 1.1–1.5) is also not applied. Our figures are therefore conservative underestimates of total facility energy.

Step 2 — Token Throughput (Prompt + Generation)

Token throughput is read from vLLM's built-in Prometheus metrics. Two counters are tracked separately and summed into a total token rate:

-- Output (decode) tokens generated per second:
sum by (namespace, container, model_name) (
  rate(vllm:generation_tokens_total[5m])
)

-- Input (prefill) tokens processed per second:
sum by (namespace, container, model_name) (
  rate(vllm:prompt_tokens_total[5m])
)

Both phases consume GPU energy: the prefill phase runs attention over every input token before the first output token is produced; the decode phase generates one output token at a time. Counting only output tokens would therefore misrepresent the true cost of inference — especially for agentic workloads where a single user-visible response may involve tens of thousands of input tokens across multiple hidden tool calls.

Why this matters for agentic AI: In agentic workloads, a single user-visible response may involve tens of thousands of input tokens across multiple hidden tool calls — input tokens can account for 90–95% of all tokens processed. If CO₂/token were computed over output tokens only, it would overstate the per-token cost by 10–20×. The dashboard uses total tokens (input + output) as the denominator, so reported CO₂/token reflects the full energy cost amortized across every token the GPU processed.

The CO₂ per token metric is only reported when total throughput (input + output) is at least 5 tok/s. Below this threshold the ratio is dominated by near-idle power draw rather than the model's efficiency under real load. The dashboard cards break down throughput into Output tok/s, Input tok/s, and Total tok/s for transparency.

Step 3 — Carbon Intensity by Grid Location

NRP nodes are hosted at universities and research institutions across the United States and internationally. Grid carbon intensity varies substantially by location. We use EPA eGRID 2022 subregion averages matched to each node's Hostname label. When the hostname doesn't match, a secondary lookup by Kubernetes namespace prefix is attempted (e.g. sdsc-llm maps to California). If neither matches, the fallback is the California CAMX intensity (0.198 kg CO₂/kWh), since the majority of NRP Nautilus nodes are hosted at SDSC in San Diego:

Institution / RegionHostname patterneGRID SubregionIntensity (kg CO₂/kWh)
California (SDSC, CSUS, Caltech, Humboldt, UCSD, UCLA, UCSB, CalIT2, CSUMB)*.sdsc.*, *.csus.*, *.humboldt.*, *.caltech.*, *.ucsd.*, *.ucla.*, *.ucsb.*, *.calit2.*, csumb.*, nautilus-*, sdsc-*CAMX0.198
NYU (New York)*.nyu.*NYUP0.174
UNL (Nebraska)*.unl.*MROW0.531
UT Austin / TACC (Texas)*.utexas.*, *.tacc.*ERCO0.393
Clemson (South Carolina)*.clemson.*SRSO0.423
University of Hawaii*.hawaii.*HIOA0.702
K-State (Kansas)*.ksu.*SPSO0.555
KREONET (South Korea)*.kreonet.*Korean grid0.459
All other nodesCAMX (California default)0.198

Step 4 — Carbon Calculations

Three derived metrics are computed:

CO₂ grams per hour

CO₂_g/hr = P_watts × intensity_kg/kWh
         = P_watts × intensity   # because W × kg/kWh = W × kg/(1000W·h) × 1000 = g/h

CO₂ milligrams per token (total tokens)

total_tokens_per_sec = prompt_tokens_per_sec + generation_tokens_per_sec

CO₂_mg/token = P_watts × intensity_kg/kWh × (1e6 mg/kg) / (3.6e6 J/kWh) / total_tokens_per_sec
             = P_watts × intensity × 0.2778 / total_tokens_per_sec

The denominator is total tokens processed (prompt + generation), not output tokens alone. This is the correct normalization: the GPU expends energy on every token it touches — both the input context during prefill and each generated token during decode.

Commercial frontier comparison — per-model physics estimate

Rather than comparing against a fixed hypothetical workload, the dashboard computes a per-model, apples-to-apples power comparison: for each NRP model's actual observed token throughput (prompt and generation rates), what total power would a commercial frontier model draw to serve the same tokens at the same rate?

Key principle: The comparison holds everything fixed except model size — same tokens, same traffic level, same grid. The frontier model's CO₂/token is computed using the same grid carbon intensity as the NRP model it's compared against. This isolates the model size/energy tradeoff: a larger frontier model may produce higher-quality output, but requires substantially more hardware — and therefore more energy — to host and run. Because grid intensity is the same on both sides, the CO₂ ratio equals the watts ratio.

Commercial frontier model specification

We model a commercial frontier LLM as a ~1.5 T parameter Mixture-of-Experts (MoE) with ~300 B active parameters per token, served in FP8 on H100-80GB GPUs:

ParameterValueDerivation
Total parameters~1.5 TFrontier MoE estimate (2026)
Active parameters/token~300 BTypical MoE activation ratio
Weight memory (FP8)~1.5 TB1 byte/param × 1.5 T params
Minimum GPUs24× H100-80GB1.92 TB total HBM for weights + KV cache + activations
Idle power225 W/GPU × 24 = 5,400 WH100 SXM measured idle

Marginal token energy (above idle)

PhaseMarginal energyDerivation
Prefill (input tokens)0.5 J/tokenCompute-bound: 600 GFLOPS/token, 24× H100 at 60% util → 23,700 tok/s; additional power (700−225)×24 = 11,400 W → 0.48 J/tok
Decode (output tokens)6 J/tokenMemory-bandwidth-bound at low batch (B≈1–4, matching NRP traffic): must load 300 GB active weights per step; 80.4 TB/s → 270 tok/s; ~3,000 W additional → ~11 J at B=1, ~4 J at B=4

Per-model comparison formula

frontier_watts = 24 × 225                                   # idle floor: 5,400 W
               + prompt_tokens_per_sec × 0.5                 # marginal prefill
               + generation_tokens_per_sec × 6.0             # marginal decode

ratio = frontier_watts / nrp_measured_watts

This formula is applied to each NRP model using its actual observed prompt and generation token rates. The comparison is shown as a watts bar on each model card and as a CO₂/token bar chart (colored by energy ratio) across all models.

Example: Qwen3.5-397B at low traffic

NRP measured:    783 W  (8× A100, real DCGM reading)
Prompt rate:       3 tok/s
Generation rate:  15 tok/s

Commercial frontier equivalent:
  Idle floor:    24 × 225 = 5,400 W
  Prefill:        3 × 0.5 =     2 W
  Decode:        15 × 6.0 =    90 W
  Total:                    5,492 W

Ratio: 5,492 / 783 = 7.0× — NRP uses 7× less energy for the same tokens.

Why this is fair: Both sides include hosting costs. The NRP model draws ~783 W to keep 8× A100 GPUs powered; the commercial frontier would draw ~5,400 W just to keep 24× H100 GPUs powered. At low traffic, both clusters are underutilized — the comparison reflects the real energy cost of maintaining each capability, not an idealized marginal-only estimate for one side.

What this doesn't measure: NRP figures are GPU power only (no CPU, DRAM, PUE). Applying a typical 1.5–2× system-overhead correction to both sides would preserve the ratio. Commercial cloud PUE (~1.1–1.3) vs. research facility PUE may differ slightly but does not change the order-of-magnitude comparison.

Each card also displays the observed prompt token percentage — what fraction of total tokens are input (prefill) vs output (decode), for transparency about the workload driving the comparison.

Dashboard visualizations

The dashboard presents the frontier comparison in two ways:

Energy efficiency across models (J/token view)

The bar chart can be toggled to show J/token (joules per token) — a hardware- and grid-neutral measure of how efficiently each model converts GPU power into tokens:

J_per_token = power_watts / total_tokens_per_sec

Unlike CO₂/token, this metric is independent of grid carbon intensity, making it directly comparable across models regardless of where they are hosted. A model with high J/token is underutilized relative to its GPU allocation — the GPUs draw near-idle power while producing few tokens. Key drivers of high J/token include:

Note: J/token only appears for models with >5 tok/s total throughput. Below this threshold the metric is dominated by idle power and not meaningful as an efficiency measure. Bar color ranges from green (most efficient) to red (least efficient) relative to the other active NRP models. Tooltips show the underlying watts, tok/s, and GPU hardware for context.

Cumulative CO₂ and cluster CO₂/token (time-series view)

For the time-series charts, we query the Prometheus range API at adaptive resolution (5-minute steps for 24 h, hourly for 7 d, 6-hourly for 30 d), apply the same per-node intensity to each sample, and integrate:

CO₂_kg_cumulative = Σ (P_watts_i × intensity_i × Δt_hrs / 1000)

-- Total token rate used as denominator for cluster-wide CO₂/token:
sum by (namespace, container) (
  rate(vllm:generation_tokens_total[5m]) + rate(vllm:prompt_tokens_total[5m])
)

Limitations and Future Work

For a more complete lifecycle analysis including training, hardware manufacturing, and cooling overhead, see the references below.

References

SourceKey contribution
Luccioni, Jernite & Strubell (2024). Power Hungry Processing. ACM FAccT. Per-query energy & CO₂ measurements for open LLMs (BLOOMz, Flan-T5, etc.)
Delavande, Pierrard & Luccioni (2025). Small Talk, Big Impact. arXiv. Per-token energy decomposition: prefill vs. decode scaling; basis for J/token envelope estimates
Patterson et al. (2021). Carbon Emissions and Large Neural Network Training. arXiv. GPT-3 training energy (1,287 MWh); location & hardware efficiency analysis
Li et al. (2023). Making AI Less Thirsty. arXiv. GPT-3 inference water and energy footprint estimates
Shehabi et al. (2024). Powering Intelligence. Lawrence Berkeley National Laboratory. US AI data center energy projections; efficiency trajectories
IEA (2024). Electricity 2024. Global data center energy (460 TWh 2022, >1000 TWh by 2026); ChatGPT ≈ 10× Google search
US EPA eGRID (2022). Regional grid carbon intensity values used in this dashboard
Lottick et al. — CodeCarbon project. Open-source framework for tracking ML carbon emissions; inspiration for this work

Source Code

This dashboard is open source: github.com/boettiger-lab/nrp-carbon-api. The carbon intensity lookup table is in internal/carbon/intensity.go.