Carbon Mapper¶

The Carbon Mapper reader is a typed client for Carbon Mapper's two API surfaces (REST catalog + STAC) that hides three real-world inconveniences:

Two protocols, two bbox conventions — the REST catalog wants repeated ?bbox=W&bbox=S&... keys, STAC wants ?bbox=W,S,E,N. Mix them up and the server 422s with no useful error.
Three resource types with hand-rolled joins — a plume is a single detection, a tile (a.k.a. scene) is the L2B raster it was detected in, and a source is the DBSCAN cluster of plumes at one physical site. The API exposes them via different endpoints with no FK-style links; this layer does the joins.
Inconsistent error shapes — 404s look different per resource. We translate them to a small typed exception hierarchy so callers can except CMPlumeNotFound rather than string-match requests.HTTPError.

The catalog ships methane and CO₂ retrievals from Tanager-1, EMIT, AVIRIS-3, AVIRIS-NG, and GAO. Plume detection is operational on all platforms; published L2B scenes lag plume publication by weeks-to-months for Tanager (see Publication lag).

> CH4 only in this notebook. The reader is gas-agnostic > internally, but query helpers are typed Literal["CH4"] for now; > CO2 lands in a follow-up.

This notebook walks the layers bottom-up:

Layer	Module	Use when
Raw HTTP	`download.py`	You need a field this layer doesn't expose, or you're prototyping a new endpoint wrapper.
Typed query	`api_queries.py`	Default. Returns `CMRawPlume` / `CMTileItem` / `CMSource`, never raw dicts.
Cross-resolution	`api_queries.get__for_` / `get_plume_context`	One call → `(plume, tile, source)` — the typical ingestion shape.
Per-plume image	`image.py` / `CMPlumeImage`	Per-plume product bundle (mask, concentrations, IME, RGB, outline). Handles v3a (STAC) + v3c (CDN-only) via URL-pattern derivation.
L2B scene raster	`rasters.py` / `CMImageRaster`	Scene-level CMF retrieval and RGB sibling.

Companion: products_explore.ipynb covers the raster wrappers in depth.

Install¶

The Carbon Mapper reader is gated behind the [carbonmapper] extra to keep georeader-spaceml's base install minimal. Install with:

pip install 'georeader-spaceml[carbonmapper]'

This pulls in pydantic (for CMRawPlume) and requests (for the HTTP client). No Azure or other cloud-vendor SDKs are required.

Authentication¶

Every cell below hits the live API and needs a Bearer token. CarbonMapperConfig.load() resolves credentials in this priority order:

CARBONMAPPER_TOKEN environment variable — one-shot, no refresh.
CARBONMAPPER_EMAIL + CARBONMAPPER_PASSWORD environment variables — refreshable via obtain_token.
Config file at the canonical location (matches sibling readers like emit.py / S2_SAFE_reader.py):
~/.georeader/auth_carbonmapper.json ← canonical

Legacy fallbacks (still honoured if present): - ./config/carbonmapper_token.json - ~/.config/carbonmapper/config.json - ~/.carbonmapper.json - ./.carbonmapper.json

On first run, if no config file is found and no env vars are set, a stub ~/.georeader/auth_carbonmapper.json is created with placeholder values for you to edit.

Sign up for a developer account at api.carbonmapper.org — the free tier covers all the calls in this notebook.

Publication lag¶

Carbon Mapper's plume catalog and STAC catalog publish on different cadences. As of late 2025:

Asset	Latency from acquisition
Plume (L4A, in `/catalog/plumes/...`)	hours to days
Tile / scene (L2B, in `/stac/collections/l2b-ch4-mfa-v3a/...`)	weeks to months (Tanager)

Practical consequences:

api_queries.list_plumes(...) returns plumes whose parent L2B scene is not yet in STAC. Don't expect every plume to round- trip through get_tile_for_plume — it returns None when the parent is unpublished, and get_tile() raises CMSceneNotPublished.
For ingestion pipelines, treat CMSceneNotPublished as defer-and-retry, not an error.
The plume's geometry, emission_auto, and wind are authoritative without the L2B raster — you only need the L2B for visualisation / re-quantification / model retraining.

Setup¶

from datetime import datetime, timezone

from georeader.readers.carbonmapper import (
    CMAPIError,
    CMPlumeNotFound,
    CMSceneNotPublished,
    CMSource,
    CMSourceNotFound,
    CMTileItem,
    CarbonMapperConfig,
    api_queries,
    download,
)

# --- 429-resilient HTTP -----------------------------------------------
# CarbonMapper rate-limits per account. §§ 5–6 below fire dozens of
# catalog probes back-to-back and can trip the per-minute cap. Mount
# a retry-aware adapter on a shared Session and re-bind the `requests`
# module shortcuts to route through it — so every HTTP call in this
# kernel (including the bare `requests.get(...)` calls in §§ 5–6 and
# the ones inside georeader's `.download` helpers) gets automatic 429
# backoff honouring `Retry-After`, plus exponential backoff for 5xx.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

_cm_session = requests.Session()
_cm_session.mount("https://", HTTPAdapter(max_retries=Retry(
    total=8,
    backoff_factor=2.0,           # 2, 4, 8, 16, 32, 64 s
    status_forcelist=(429, 500, 502, 503, 504),
    respect_retry_after_header=True,
    allowed_methods=frozenset(["GET", "POST"]),
)))
requests.get = _cm_session.get
requests.post = _cm_session.post
requests.request = _cm_session.request

# Resolve a Bearer token from env / config file (see "Authentication").
TOKEN = CarbonMapperConfig.load().refresh_access_token()

# Protagonist plume — Tanager-1 over the Permian basin, 2025-12-12.
PLUME_ID = "tan20251212t185057c20s4001-E"
SCENE_ID = PLUME_ID.rsplit("-", 1)[0]
PERMIAN_BBOX = (-104.5, 32.0, -103.5, 32.8)  # (W, S, E, N)
print(f"plume = {PLUME_ID}")
print(f"scene = {SCENE_ID}")

plume = tan20251212t185057c20s4001-E
scene = tan20251212t185057c20s4001

Domain model¶

Three resource types, with these relationships:

**Relationships**
Parent	Cardinality	Child	Meaning
SOURCE	1 — N	PLUME	DBSCAN clusters detections at one physical site
TILE	1 — N	PLUME	L2B scene contains the detected plumes

**Entity properties**
Entity	Field	Type	Notes
PLUME	`plume_id`	string	`tan20251212t185057...-E`
	`emission_auto`	float	kg/h
	`geometry`	polygon	—
	`wind_u_v`	float	from CM forecast
SOURCE	`source_name`	string	`{gas}_{sector}_{m}m_{lon}_{lat}`
	`plume_count`	int	across all scenes
	`emission_auto`	float	site-aggregate (kg/h)
TILE	`scene_id`	string	`plume_id.rsplit('-',1)[0]`
	`platform`	string	`tan / emi / ang / av3 / gao`
	`acquired`	datetime	L2B GeoTIFF, may lag publication

Plume — one detection. Carries emission_auto (kg/h), wind, geometry, and a plume_id that encodes the source-instrument prefix and acquisition timestamp (e.g. tan20251212t185057c20s4001-E).
Tile (or scene) — the L2B GeoTIFF the plume was extracted from. One tile contains 0..N plumes. scene_id is plume_id.rsplit('-', 1)[0].
Source — DBSCAN cluster of plumes at the same physical site. One source contains 1..N plumes across many scenes / dates. Identified by the deterministic key {gas}_{sector}_{footprint_m}m_{lon}_{lat}.

A plume always has a parent scene (encoded in the id), but the parent L2B item may not be published in STAC yet (see Publication lag). A plume may not yet be clustered into a source if it's the first detection at that site.

1 · Typed models¶

Two frozen dataclasses you'll see flowing through every other call. Worth a minute up front so you know what you're getting back.

1.1 `CMSource` — DBSCAN-clustered point source¶

Carbon Mapper aggregates plumes detected at the same physical location into a source — a deterministic point-source record addressed by {gas}_{sector}_{footprint_m}m_{lon}_{lat}. The /catalog/sources.geojson endpoint returns features whose source_name carries a stray ?plume_gas=... query suffix that must be stripped before using the value as a key into other endpoints (the suffix is an accidental bleed from the geojson endpoint's filtering query string — Carbon Mapper plans to fix it upstream, but the strip is defensive in the meantime). CMSource.from_geojson_feature does the strip unconditionally so downstream code can treat source_name as canonical.

feature = {
    "properties": {
        "source_name": "CH4_1B2_100m_-104.17525_32.49125?plume_gas=CH4&amp;bbox=...",
        "sector": "1B2",
        "gas": "CH4",
        "plume_count": 12,
        "persistence": 0.42,
        "emission_auto": 250.0,
        "emission_uncertainty_auto": 35.0,
    },
    "geometry": {"type": "Point", "coordinates": [-104.17525, 32.49125]},
}
src = CMSource.from_geojson_feature(feature)
print(src.source_name)                        # suffix stripped
print((src.point.x, src.point.y))             # (-104.17525, 32.49125)
print(f"{src.plume_count} plumes · sector {src.sector} · gas {src.gas}")

CH4_1B2_100m_-104.17525_32.49125
(-104.17525, 32.49125)
12 plumes · sector 1B2 · gas CH4

1.2 `CMTileItem` — typed wrapper over a STAC item¶

Frozen dataclass exposing the fields we use in practice (scene_id, collection, datetime, platform, bbox, geometry, asset_urls). The full properties dict and the raw STAC item stay attached for one-off field access.

stac_item = download.stac_get_item("l2b-ch4-mfa-v3a", SCENE_ID, token=TOKEN)
tile = CMTileItem.from_stac_item(stac_item)
print(f"{tile.scene_id} · {tile.platform} · {tile.datetime}")
print(f"bbox  : {tile.bbox}")
print(f"assets: {sorted(tile.asset_urls)[:5]}")

tan20251212t185057c20s4001 · tan · 2025-12-12 18:50:57+00:00
bbox  : (-104.5861937, 31.6662665, -103.9359726, 33.0604707)
assets: ['artifact-mask.tif', 'cmf-unortho.tif', 'cmf.tif', 'uas.txt', 'uncertainty-unortho.tif']

2 · `download.py` — raw HTTP wrappers¶

You usually shouldn't reach here — api_queries.py is the supported surface. We expose it because (a) the bbox encoding is non-obvious and worth understanding, and (b) the REST endpoints carry fields the typed layer doesn't yet model (e.g. raw CSV exports). Thin endpoint wrappers — same return shape as the upstream JSON, but with bbox-encoding, retries, and Bearer auth handled.

2.1 bbox encoding — REST vs STAC¶

Carbon Mapper's two API surfaces disagree on bbox shape:

REST Catalog (/catalog/...) wants repeated keys: ?bbox=W&bbox=S&bbox=E&bbox=N. Comma-joined returns 422.
STAC (/stac/...) wants the comma-joined form: ?bbox=W,S,E,N.

_rest_bbox_params returns a list-valued dict (requests serialises lists as repeated keys); _stac_bbox_param returns the comma-joined string.

from georeader.readers.carbonmapper.download import (
    _rest_bbox_params, _stac_bbox_param,
)

print(_rest_bbox_params(PERMIAN_BBOX))
# {'bbox': ['-104.5', '32.0', '-103.5', '32.8']}

print(_stac_bbox_param(PERMIAN_BBOX))
# {'bbox': '-104.5,32.0,-103.5,32.8'}

{'bbox': ['-104.5', '32.0', '-103.5', '32.8']}
{'bbox': '-104.5,32.0,-103.5,32.8'}

2.2 Endpoint wrappers¶

# stac_get_item — one STAC item by collection + scene_id
item = download.stac_get_item("l2b-ch4-mfa-v3a", SCENE_ID, token=TOKEN)
print(f"{item['id']} {item['properties']['datetime']}")

tan20251212t185057c20s4001 2025-12-12T18:50:57Z

# get_source_for_plume_name — find the source for our protagonist plume
src_dict = download.get_source_for_plume_name(PLUME_ID, token=TOKEN)
SOURCE_NAME = src_dict["source_name"]
print(SOURCE_NAME)

CH4_1B2_100m_-104.17525_32.49125

# get_source_by_name — REST source record (flat dict with properties)
src_dict = download.get_source_by_name(SOURCE_NAME, token=TOKEN)
print(f"plumes: {src_dict.get('plume_count')}  emission: {src_dict.get('emission_auto')} kg/h")

plumes: None  emission: None kg/h

# get_source_plumes_csv — every plume attributed to one source as CSV text
csv_text = download.get_source_plumes_csv(SOURCE_NAME, token=TOKEN)
print(csv_text[:200])

plume_id,plume_latitude,plume_longitude,datetime,country,state_province,ipcc_sector,gas,emission_cmf_type,plume_bounds,instrument,mission_phase,published_at,modified,emission_version,processing_softwa

# stac_search — accepts ids= for direct STAC item lookup
fc = download.stac_search(
    collections=["l2b-ch4-mfa-v3a"],
    ids=[SCENE_ID],
    limit=5,
    token=TOKEN,
)
print(f"{len(fc['features'])} feature(s) returned")

1 feature(s) returned

3 · `api_queries.py` — typed query layer¶

The default surface for downstream code. Three families:

Single-resource fetchers — get_plume, get_tile, get_source. Translate 404s to typed exceptions.
List helpers — list_plumes, list_tiles, list_sources. Take a bbox + datetime range + filters.
Cross-resolution — given a plume, get its tile / source / full context. Given a source, get all its plumes / tiles. The join logic is hidden so callers don't reinvent scene_id derivation, dedup, etc.

3.1 Single-resource fetchers¶

plume = api_queries.get_plume(TOKEN, PLUME_ID)
print(f"{plume.plume_id}  gas={plume.gas}  emission_auto={plume.emission_auto} kg/h")
print(f"scene_id={plume.scene_id}")

tan20251212t185057c20s4001-E  gas=CH4  emission_auto=1007.6564374669618 kg/h
scene_id=tan20251212t185057c20s4001

tile = api_queries.get_tile(TOKEN, SCENE_ID)  # default collection l2b-ch4-mfa-v3a
print(f"{tile.scene_id} · {tile.platform}")
print(f"bbox: {tile.bbox}")

tan20251212t185057c20s4001 · tan
bbox: (-104.5861937, 31.6662665, -103.9359726, 33.0604707)

source = api_queries.get_source(TOKEN, SOURCE_NAME)
print(f"{source.source_name}")
print(f"plumes: {source.plume_count}  sector: {source.sector}  emission: {source.emission_auto} kg/h")

CH4_1B2_100m_-104.17525_32.49125
plumes: 0  sector:   emission: None kg/h

3.2 Typed exceptions¶

404s are translated to typed CMAPIError subclasses — easy to catch one resource type without swallowing real failures. All three inherit from CMAPIError so a single except CMAPIError catches every documented failure mode; anything else is an HTTP / network error and should be allowed to propagate.

try:
    # Well-formed but non-existent plume id (the API 422s on malformed input).
    api_queries.get_plume(TOKEN, "tan29991231t000000c00s4001-Z")
except CMPlumeNotFound as exc:
    print(f"caught CMPlumeNotFound: {exc}")

try:
    api_queries.get_source(TOKEN, "CH4_1B2_100m_0_0")
except CMSourceNotFound as exc:
    print(f"caught CMSourceNotFound: {exc}")

try:
    # A scene whose L2B item is not yet published in STAC.
    api_queries.get_tile(TOKEN, "tan29991231t000000c00s4001")
except CMSceneNotPublished as exc:
    print(f"caught CMSceneNotPublished: {exc}")

# All three inherit from CMAPIError if you want a single catch.
print(f"isinstance(CMPlumeNotFound(...), CMAPIError) -&gt; "
      f"{isinstance(CMPlumeNotFound('x'), CMAPIError)}")

caught CMPlumeNotFound: Plume not found: tan29991231t000000c00s4001-Z

caught CMSourceNotFound: Source not found: CH4_1B2_100m_0_0
caught CMSceneNotPublished: L2B scene not published: tan29991231t000000c00s4001
isinstance(CMPlumeNotFound(...), CMAPIError) -> True

3.3 List helpers¶

dt_min = datetime(2025, 12, 1, tzinfo=timezone.utc)
dt_max = datetime(2025, 12, 31, tzinfo=timezone.utc)

plumes = api_queries.list_plumes(
    TOKEN,
    bbox=PERMIAN_BBOX,
    datetime_min=dt_min,
    datetime_max=dt_max,
    gas="CH4",
)
print(f"{len(plumes)} plumes")
for p in plumes[:3]:
    print(f"  {p.plume_id}  emission_auto={p.emission_auto}")

22 plumes
  tan20251212t185057c20s4001-C  emission_auto=876.1038259991867
  tan20251212t185057c20s4001-D  emission_auto=432.5738771282001
  tan20251212t185057c20s4001-E  emission_auto=1007.6564374669618

# NB: STAC search caps page size at 100 — pass an explicit limit until
# pagination lands.
tiles = api_queries.list_tiles(
    TOKEN,
    bbox=PERMIAN_BBOX,
    datetime_min=dt_min,
    datetime_max=dt_max,
    limit=50,
)
print(f"{len(tiles)} tiles")
for t in tiles[:3]:
    print(f"  {t.scene_id} {t.platform} {t.datetime}")

4 tiles
  tan20251212t185057c20s4001 tan 2025-12-12 18:50:57+00:00
  tan20251210t183649c71s4001 tan 2025-12-10 18:36:49+00:00
  tan20251210t183749c38s4001 tan 2025-12-10 18:37:49+00:00

sources = api_queries.list_sources(TOKEN, bbox=PERMIAN_BBOX, gas="CH4")
print(f"{len(sources)} sources")
for s in sources[:3]:
    print(f"  {s.source_name}  plumes={s.plume_count}  emission={s.emission_auto}")

10593 sources
  CH4_6A_500m_-117.26768_34.59375  plumes=3  emission=23.0636206375622
  CH4_6A_500m_-118.51707_34.32769  plumes=485  emission=1070.9700550816046
  CH4_6A_500m_-119.38080_36.39176  plumes=34  emission=174.91747754329324

3.4 Cross-resolution helpers¶

The headline value-add — these do the join work the upstream API leaves to the caller (scene-id derivation, source lookup-by-plume, dedup of scene_ids per source, etc.).

# plume → tile (returns None when the L2B scene isn't published yet)
tile = api_queries.get_tile_for_plume(TOKEN, PLUME_ID)
print(f"tile  : {tile and tile.scene_id}")

tile  : tan20251212t185057c20s4001

# plume → source (returns None when CM hasn't clustered this plume yet)
source = api_queries.get_source_for_plume(TOKEN, PLUME_ID)
print(f"source: {source and source.source_name}")

source: CH4_1B2_100m_-104.17525_32.49125

# One call → (plume, tile|None, source|None) — the typical ingestion shape.
plume, tile, source = api_queries.get_plume_context(TOKEN, PLUME_ID)
print(f"plume  : {plume.plume_id}")
print(f"  tile  : {tile and tile.scene_id}")
print(f"  source: {source and source.source_name}")

plume  : tan20251212t185057c20s4001-E
  tile  : tan20251212t185057c20s4001
  source: CH4_1B2_100m_-104.17525_32.49125

# tile → all plumes in that scene
plumes = api_queries.list_plumes_for_tile(TOKEN, SCENE_ID)
print(f"{len(plumes)} plumes in scene {SCENE_ID}")

0 plumes in scene tan20251212t185057c20s4001

# source → every plume attributed to it (parsed from CSV)
plumes = api_queries.list_plumes_for_source(TOKEN, SOURCE_NAME)
print(f"{len(plumes)} plumes for source {SOURCE_NAME}")

# source → distinct parent tiles (dedups scene_ids before STAC ids= search)
tiles = api_queries.list_tiles_for_source(TOKEN, SOURCE_NAME)
print(f"{len(tiles)} unique parent tiles")

1 plumes for source CH4_1B2_100m_-104.17525_32.49125

1 unique parent tiles

4 · End-to-end mini-workflow¶

Tie it together: take a bbox + date range, list plumes, expand each into full context, count how many have a published L2B parent and how many are clustered into a source.

plumes = api_queries.list_plumes(
    TOKEN, bbox=PERMIAN_BBOX, datetime_min=dt_min, datetime_max=dt_max, gas="CH4",
)

n_with_tile = n_with_source = 0
for p in plumes[:25]:  # cap to keep the request count modest
    _, tile, source = api_queries.get_plume_context(TOKEN, p.plume_id)
    n_with_tile += tile is not None
    n_with_source += source is not None

print(f"checked       : {min(len(plumes), 25)}")
print(f"L2B published : {n_with_tile}")
print(f"in a source   : {n_with_source}")

checked       : 22
L2B published : 22
in a source   : 22

The 22 / 22 / 22 saturation reads as: in this Permian month, every plume detection has both (a) a parent L2B published in STAC and (b) a clustered source. That's typical for archives older than ~3 months — the publication lag has caught up. Run the same query over the most recent 30 days and you'll see N / 0 / 0 (plumes exist, scenes pending) — which is what your ingestion pipeline needs to handle gracefully via CMSceneNotPublished.

5 · Plume catalog stats — what's in the live catalog right now¶

Cells below hit the live API at notebook-execution time, so numbers will drift between runs. Each cell calls one or more /catalog/plumes/annotated?limit=1 requests and reads the total_count field — no large data transfer. Total runtime ≈ 25s.

If the API is rate-limiting at the moment you re-run, drop the section: the prose narrative stands on its own.

> CH4-only — plume_gas="CH4" is implicit on every call below.

5.1 Headline counts¶

Total CH4 plumes in the catalog, split by instrument. The instrument filter is case-sensitive upstream — tan / emi / ang / av3 are lowercase, GAO is uppercase. Other obvious filter names are silently ignored: use plume_gas not gas, and instrument not platform.

import pandas as pd
import requests

BASE = "https://api.carbonmapper.org/api/v1"
H = {"Authorization": f"Bearer {TOKEN}"}


def plume_count(**filters) -&gt; int | None:
    """`total_count` for the plume catalog under the given filters."""
    r = requests.get(
        f"{BASE}/catalog/plumes/annotated",
        headers=H, params={"limit": 1, "plume_gas": "CH4", **filters},
        timeout=30,
    )
    return r.json().get("total_count")


total = plume_count()
# Instrument codes are case-sensitive upstream — `gao` returns None,
# `GAO` works. Other codes are lowercase. Worth a comment because it
# bites everyone once.
by_inst = {code: plume_count(instrument=code) for code in
           ("tan", "emi", "ang", "av3", "GAO")}

print(f"TOTAL CH4: {total:,} plumes\n")
print("By instrument")
for k, v in by_inst.items():
    print(f"  {k:5s} {v:&gt;8,}")

TOTAL CH4: 32,642 plumes

By instrument
  tan     11,970
  emi      4,297
  ang      4,693
  av3      2,343
  GAO      9,339

5.2 IPCC sector distribution¶

Carbon Mapper attributes most plumes to an IPCC sector code. Anything below 1B2 (oil & gas) is dwarfed by it — Tanager's operational targeting bias toward upstream O&G shows up clearly.

sector_codes = ["1A1", "1B1a", "1B2", "3A", "4B", "6A", "6B"]
by_sector = {s: plume_count(sectors=s) for s in sector_codes}

df_sector = pd.DataFrame({
    "sector": list(by_sector),
    "name": ["Energy generation", "Coal mining", "Oil &amp; gas",
             "Enteric fermentation", "Livestock", "Solid waste",
             "Waste water"],
    "plumes": list(by_sector.values()),
}).sort_values("plumes", ascending=False, na_position="last")
df_sector["share"] = df_sector["plumes"] / df_sector["plumes"].sum()
df_sector

	sector	name	plumes	share
2	1B2	Oil & gas	18657.0	0.580564
5	6A	Solid waste	8552.0	0.266119
1	1B1a	Coal mining	3574.0	0.111215
4	4B	Livestock	1051.0	0.032705
0	1A1	Energy generation	252.0	0.007842
6	6B	Waste water	50.0	0.001556
3	3A	Enteric fermentation	NaN	NaN

5.3 Monthly activity — last 12 months¶

Plumes-by-month using ISO datetime intervals. Tanager-1 went operational mid-2024, so the early months are sparse; recent months reflect both the fleet ramp and the publication lag (newer detections are still flowing into the catalog).

from datetime import datetime, timedelta, timezone


def month_count(year: int, month: int) -&gt; int | None:
    start = datetime(year, month, 1, tzinfo=timezone.utc)
    end = (datetime(year + (month == 12), (month % 12) + 1, 1,
                    tzinfo=timezone.utc)
           - timedelta(seconds=1))
    return plume_count(datetime=f"{start.isoformat()}/{end.isoformat()}")


now = datetime.now(timezone.utc)
months = []
y, m = now.year, now.month
for _ in range(12):
    months.append((y, m, month_count(y, m)))
    m -= 1
    if m == 0:
        m, y = 12, y - 1

df_months = pd.DataFrame(months, columns=["year", "month", "plumes"])
df_months["label"] = df_months.apply(
    lambda r: f"{r.year}-{r.month:02d}", axis=1,
)
df_months[["label", "plumes"]].iloc[::-1].reset_index(drop=True)

	label	plumes
0	2025-06	934
1	2025-07	1073
2	2025-08	932
3	2025-09	935
4	2025-10	892
5	2025-11	1011
6	2025-12	961
7	2026-01	669
8	2026-02	775
9	2026-03	949
10	2026-04	303
11	2026-05	0

5.4 Emission rate distribution¶

The headline metric per plume is emission_auto in kg/h. Pull one page (1,000 plumes) and summarise — the long tail is dramatic: the median CH4 plume is sub-kT/yr, the p99 is in the multi-kT/yr super-emitter regime.

r = requests.get(
    f"{BASE}/catalog/plumes/annotated",
    headers=H, params={"limit": 1000, "plume_gas": "CH4"}, timeout=60,
)
emissions_kgh = pd.Series(
    [item.get("emission_auto") for item in r.json().get("items", [])],
    name="emission_kg_per_h",
).dropna()

# Convert to kt/yr (×24h ×365.25d ÷ 1e6 kg/kt) for context.
emissions_kt_yr = emissions_kgh * 24 * 365.25 / 1_000_000

print(f"Sample size: {len(emissions_kgh):,} CH4 plumes\n")
print("kg/h:")
print(emissions_kgh.describe(percentiles=[0.5, 0.75, 0.9, 0.99]).round(1))
print("\nkt/yr equivalent:")
print(emissions_kt_yr.describe(percentiles=[0.5, 0.75, 0.9, 0.99]).round(2))

Sample size: 754 CH4 plumes

kg/h:
count      754.0
mean      1228.9
std       1567.6
min         79.4
50%        802.1
75%       1464.6
90%       2540.7
99%       7265.3
max      20900.0
Name: emission_kg_per_h, dtype: float64

kt/yr equivalent:
count    754.00
mean      10.77
std       13.74
min        0.70
50%        7.03
75%       12.84
90%       22.27
99%       63.69
max      183.21
Name: emission_kg_per_h, dtype: float64

6 · STAC inventory — what's downloadable¶

The plume catalog (§ 5) is the detection index — what was spotted, where, when, how strong. The STAC catalogue is the download index — the actual GeoTIFFs and per-plume products you can pull bytes for. There are 86 STAC collections total, but most are superseded versions; only ~9 are actively published as of late 2025.

6.1 Collection counts by level¶

The l<n> prefix tells you what kind of product:

L2B — orthorectified scene-level retrievals (cmf, RGB, uncertainty, artifact-mask).
L2C — per-scene CH4/CO2 composites (less common downstream).
L3A — per-plume products (the small plume_tif clip + ime retrieval crop).
L4A — retrieval cubes; flat per-platform listings of L4 outputs.

v3a is the current canonical version family. Older versions (v1, v3, j001, jpl legacy) still exist for archival reads.

import re
from collections import defaultdict

r = requests.get(f"{BASE}/stac/collections", headers=H, timeout=30)
all_collections = r.json()["collections"]

groups: dict[str, list[str]] = defaultdict(list)
for c in all_collections:
    m = re.match(r"(l\d[a-z]?)-", c["id"])
    if m:
        groups[m.group(1)].append(c["id"])

df_levels = pd.DataFrame({
    "level": sorted(groups),
    "collections": [len(groups[k]) for k in sorted(groups)],
})
df_levels.loc[len(df_levels)] = ["TOTAL", len(all_collections)]
df_levels

	level	collections
0	l2	1
1	l2b	31
2	l2c	3
3	l3a	32
4	l3c	1
5	l4a	18
6	TOTAL	86

6.2 Active v3a collections — item counts¶

For each *-v3a collection, fetch numberMatched via a 1-result STAC search. Empty placeholder collections (e.g. l2b-ch4-mfma-v3a) show 0 — the algorithm variant isn't currently published. The pairs l2b-ch4-mfa-v3a / l2b-co2-mfa-v3a have identical item counts because they're the same Tanager scenes processed twice for different gases.

v3a = [c["id"] for c in all_collections
       if (c["id"].endswith("-v3a") or c["id"].endswith("-quick-v3a"))
       and ("ch4" in c["id"] or "rgb" in c["id"])]

rows = []
for cid in v3a:
    info = next(c for c in all_collections if c["id"] == cid)
    extent = info.get("extent", {}).get("temporal", {}) \
                  .get("interval", [[None, None]])[0]
    r = requests.get(
        f"{BASE}/stac/search",
        headers=H, params={"collections": cid, "limit": 1}, timeout=30,
    )
    matched = r.json().get("numberMatched")
    rows.append({
        "collection": cid,
        "items": matched,
        "start": (extent[0] or "")[:10],
        "end": (extent[1] or "")[:10],
    })

df_v3a = pd.DataFrame(rows).sort_values(
    by=["items", "collection"], ascending=[False, True],
).reset_index(drop=True)
df_v3a

	collection	items	start	end
0	l2b-ch4-mfa-v3a	1675	2025-07-11	2025-12-16
1	l2b-rgb-v3a	1672	2025-07-11	2025-12-16
2	l3a-vis-ch4-mfa-v3a	1451	2023-10-25	2025-12-16
3	l3a-ime-ch4-mfa-v3a	1450	2023-10-25	2025-12-16
4	l4a-ch4-mfa-v3a	1450	2023-10-25	2025-12-16
5	l2b-ch4-mfma-v3a	0	2025-07-11	2025-12-16
6	l4a-combined-ch4-quick-v3a	0	2025-11-06	2025-12-15
7	l4a-combined-ch4-v3a	0	2024-11-21	2025-12-15

> v3c is the live processing version, but isn't in STAC. The > latest Tanager CH4 plumes (post-2025-12-16) live in > l3a-vis-ch4-mfa-v3c / l3a-ime-ch4-mfa-v3c (and -v3d for the > very newest) — but those collections are not exposed via > /stac/collections or any item lookup. They're reachable only > via direct asset URLs derived from /catalog/plume/{id}. Both > wrappers handle this transparently via URL-pattern derivation: > CMPlumeImage for the per-plume L3A bundle (see § 8 below), and > CMImageRaster (via api_queries.get_image_raster_for_plume) > for the parent L2B scene — which tries STAC first and falls back > to the same URL-pattern trick when the scene isn't registered.

6.3 Asset shapes — what's actually inside an item¶

Sample one item per active CH4 collection and list the asset keys. This is the canonical map between collection (what you search for) and asset key (what RasterioReader actually opens).

Collection	Headline asset	Use it for
`l2b-ch4-mfa-v3a`	`cmf.tif`	CH4 column-density retrieval
`l2b-rgb-v3a`	`rgb.tif`	True-colour overlay
`l3a-ime-ch4-mfa-v3a`	`ime-cmf-concentrations.tif`	Per-plume IME retrieval crop
`l3a-vis-ch4-mfa-v3a`	`plume.tif` (band-4 alpha) + `plume-outline.geojson`	Per-plume mask / polygon

sample_collections = [
    "l2b-ch4-mfa-v3a", "l2b-rgb-v3a",
    "l3a-ime-ch4-mfa-v3a", "l3a-vis-ch4-mfa-v3a",
]

asset_rows = []
for cid in sample_collections:
    r = requests.get(
        f"{BASE}/stac/search",
        headers=H, params={"collections": cid, "limit": 1}, timeout=30,
    )
    feats = r.json().get("features", [])
    if not feats:
        asset_rows.append({"collection": cid, "assets": "(empty)"})
        continue
    keys = sorted((feats[0].get("assets") or {}).keys())
    asset_rows.append({"collection": cid, "assets": ", ".join(keys)})

pd.DataFrame(asset_rows)

	collection	assets
0	l2b-ch4-mfa-v3a	artifact-mask.tif, cmf-unortho.tif, cmf.tif, u...
1	l2b-rgb-v3a	rgb.tif
2	l3a-ime-ch4-mfa-v3a	ime-cmf-concentrations.png, ime-cmf-concentrat...
3	l3a-vis-ch4-mfa-v3a	plume-concentrations.tif, plume-outline.geojso...

7 · Reachable products reference¶

Static reference tables — the canonical map of what's reachable from the API, by resource type. Numbers in §§ 5–6 are live; the tables here are documentation and don't drift with each notebook run.

7.1 Plume-level products¶

Every detection ships with a small bundle of per-plume products keyed off plume_id. Source paths: most assets live on the /catalog/plume/{id} REST response (URLs); a handful additionally appear as STAC item assets under l3a-*-ch4-mfa-v3a collections.

Asset key	Format	What it is	Where to find it
`plume_tif`	RGBA GeoTIFF	Per-plume binary mask — band 4 is the alpha channel	`/catalog/plume/{id}.plume_tif` and `l3a-vis-ch4-mfa-v3a` STAC item assets
`plume_png`	PNG	Plume mask viz	`/catalog/plume/{id}.plume_png`
`plume_rgb_png`	PNG	Plume mask overlaid on RGB	`/catalog/plume/{id}.plume_rgb_png`
`con_tif`	GeoTIFF	Per-plume CH4 retrieval crop (column density)	`l3a-ime-ch4-mfa-v3a` STAC item assets, asset key `ime-cmf-concentrations.tif`
`rgb_png` / `rgb.png`	PNG	Per-plume RGB context tile	`/catalog/plume/{id}.rgb_png` and `l3a-vis-ch4-mfa-v3a`
`ime_outline_geojson` / `plume-outline.geojson`	GeoJSON	Plume polygon — preferred over band-4 mask extraction	`l3a-vis-ch4-mfa-v3a` STAC item assets
`plumes.csv`	CSV	All plumes attributed to one source	`/catalog/source/{source_name}/plumes.csv`

The georeader wrapper CMPlumeImage exposes the GeoTIFFs (plume_tif, plume-concentrations.tif, ime-cmf-concentrations.tif, rgb.tif) and the canonical outline GeoJSON. PNG-only assets aren't wrapped (no native georeferencing).

7.2 STAC collections — current CH4 (v3a)¶

Carbon Mapper's "active" Tanager-1 CH4 product family. Everything older / superseded still resolves under /stac/collections (86 total), but new ingestion should target the v3a family below.

Collection	Level	Items	Temporal	Description
`l2b-ch4-mfa-v3a`	L2B	1,675	2025-07-11 → 2025-12-16	CH4 retrieval scene — assets: `cmf.tif`, `cmf-unortho.tif`, `uncertainty.tif`, `uncertainty-unortho.tif`, `artifact-mask.tif`, `uas.txt`
`l2b-rgb-v3a`	L2B	1,672	same	True-colour sibling — `rgb.tif`. 3 short of the cmf collection (still being published)
`l3a-ime-ch4-mfa-v3a`	L3A	1,450	2023-10-25 → 2025-12-16	Per-plume CH4 IME retrieval crop — `ime-cmf-concentrations.tif`, `ime-cmf-mask.tif`, `ime-cmf-outline.geojson`
`l3a-vis-ch4-mfa-v3a`	L3A	1,451	same	Per-plume CH4 visualisation — `plume.tif` (band-4 alpha mask), `plume-outline.geojson`, `plume-rgb.png`, `plume-concentrations.tif`
`l4a-ch4-mfa-v3a`	L4A	1,450	same	CH4 retrieval cube — collection-level metadata, no per-item assets

Empty *-v3a placeholders (l2b-ch4-mfma-v3a, l4a-combined-ch4-{quick-,}v3a) exist in the catalog but are not currently published. CO2 collections trimmed from this table — this PR is CH4-only.

7.3 Source-level products¶

Sources are DBSCAN clusters of plumes at the same physical site. The full source list is small enough (~12 K rows) to fetch in one shot — the endpoint returns the entire FeatureCollection in ~1.5 s. CH4 sources: 10,569; CO2 sources: 2,140; total: 12,709 (as of probe time).

Endpoints¶

Endpoint	Returns	Notes
`/catalog/sources.geojson`	FeatureCollection of CMSource	Strip the `?plume_gas=...` suffix from `source_name` before keying — see § 1.1
`/catalog/source/{source_name}`	Single source dict (flat REST)
`/catalog/source/{source_name}/plumes.csv`	CSV of every plume attributed to the source	One row per plume, full metadata
`/catalog/source/by-plume/{plume_id}`	Single source dict	Resolve plume → source without scanning the GeoJSON

Properties on each source feature¶

Field	Type	Description
`source_name`	str	Deterministic key `{gas}_{sector}_{footprint_m}m_{lon}_{lat}`
`gas`	str	`CH4` or `CO2`
`sector`	str	IPCC sector code (e.g. `1B2`, `6A`)
`plume_count`	int	Plumes in the cluster
`plume_ids`	list[str]	All `plume_id`s attributed to this source
`observation_scenes_names`	list[str]	Scenes that contributed
`persistence`	float	Cluster temporal stability (0–1)
`emission_auto`	float	Site-aggregate emission (kg/h)
`emission_uncertainty_auto`	float
`published_at_min` / `_max`	datetime	First/last publication of any constituent plume
`timestamp_min` / `_max`	datetime	First/last acquisition time
`detection_date_count` / `observation_date_count` / `date_count`	int	Distinct-day counts (detection vs. all observations)

The georeader wrapper CMSource exposes the headline fields as a frozen dataclass; the full properties dict is stashed on CMSource.raw for one-off access.

8 · `CMPlumeImage` — per-plume product bundle¶

The headline of this PR. One :class:CMPlumeImage is the cropped raster suite for one CH4 plume — binary mask, full column-density crop, IME-clipped retrieval, RGB context, plus the canonical outline polygon.

Highlights:

Five lazy properties — mask, concentrations, ime_concentrations, rgb, outline. Each opens its asset on first access, cached after.
Three constructors — from_plume_id (one HTTP, handles v3a and v3c), from_cmrawplume (zero HTTP if you have the typed plume), from_stac_item (driving STAC search directly; v3a only).
Outline canonical — fetches plume-outline.geojson via the derived URL; falls back to band-4 alpha vectorize on fetch failure (with a warning).
v3a + v3c handled transparently — URL-pattern derivation rewrites the host to the Bearer-aware api gateway, then builds every asset URL from a single seed (plume_tif).

from georeader.readers.carbonmapper import CMPlumeImage

# 1. Build from a plume_id — one HTTP round-trip
img = CMPlumeImage.from_plume_id(PLUME_ID, token=TOKEN)
print(img)

CMPlumeImage
  plume_id:       tan20251212t185057c20s4001-E
  assets present: ['plume.tif', 'plume-concentrations.tif', 'plume-outline.geojson', 'rgb.tif', 'ime-cmf-concentrations.tif', 'ime-cmf-mask.tif', 'ime-cmf-outline.geojson']
  overview_level: full

8.1 Lazy properties¶

Each property opens its asset on first access. No I/O happens during from_plume_id beyond the catalog metadata fetch.

from georeader.rasterio_reader import RasterioReader


def describe(name, reader):
    if reader is None:
        return f"{name:22s} (absent)"
    return f"{name:22s} {type(reader).__name__}  shape={reader.shape}"


print(describe("mask:",               img.mask))
print(describe("concentrations:",     img.concentrations))
print(describe("ime_concentrations:", img.ime_concentrations))
print(describe("rgb:",                img.rgb))

mask:                  RasterioReader  shape=(4, 101, 100)

concentrations:        RasterioReader  shape=(1, 85, 56)

ime_concentrations:    RasterioReader  shape=(1, 28, 28)

rgb:                   RasterioReader  shape=(3, 101, 100)

8.2 Outline (canonical GeoJSON)¶

outline returns a shapely geometry in EPSG:4326. The canonical source is plume-outline.geojson (fetched from the v3a STAC asset, or the URL-pattern equivalent for v3c). If that fetch fails, the property falls back to vectorizing the band-4 alpha of mask and logs a warning.

outline = img.outline
print(f"type:   {type(outline).__name__}")
print(f"area:   {outline.area:.6f}  (degrees², EPSG:4326)")
print(f"bounds: {tuple(round(b, 4) for b in outline.bounds)}")

type:   Polygon
area:   0.000063  (degrees², EPSG:4326)
bounds: (-104.177, 32.4778, -104.1687, 32.4927)