Getting Started#

A template repository for fine-tuning Surya, the first foundation model for heliophysics, on your own downstream solar science tasks.


The Surya Foundation Model#

Surya is a 366-million-parameter spatiotemporal transformer pre-trained on full-resolution data from NASA’s Solar Dynamics Observatory (SDO). It was developed as a NASA-IMPACT / IBM AI4Science collaboration and is described in:

Surya: A Foundation Model for HeliophysicsarXiv:2508.14112

The model ingests 13-channel SDO image stacks (8 AIA wavelengths + 4 HMI magnetic components + HMI doppler velocity) at native 4096×4096 resolution and has demonstrated strong performance across a range of solar physics tasks:

Task

Improvement over prior state-of-the-art

Solar flare forecasting (TSS)

+22%

Solar wind speed prediction (RMSE)

+19%

Active region segmentation

EUV spectra modeling (1,343 bands)

Resources

Architecture#

Surya uses two novel transformer block types that make it efficient on full-resolution solar imagery:

  • Spectral Gating — transforms patches to the frequency domain via FFT, applies learnable complex weights, then returns via iFFT. Captures global structure efficiently.

  • Long-Short Attention — combines local windowed attention (window_size=2) with global attention via dynamic projection (dp_rank=4). Handles the 4096×4096 spatial extent without quadratic cost.

The full backbone is 2 spectral gating blocks followed by 8 long-short attention blocks, with patch size 16 and embedding dimension 1280.


Purpose of This Repository#

Surya is a powerful foundation, but foundation models only create scientific value when researchers can adapt them to their own questions. That adaptation step — loading pre-trained weights, defining a task-specific head, wiring up data and metrics, and running a reproducible training loop requires support when executed for the first time.

This repository provides a clean, heavily documented template that addresses that need. The goal is to provide a well-explained starting point that a scientist can read, understand, and modify in an afternoon:

  • Every component has a clear home and a documented interface.

  • A single YAML file controls all hyperparameters.

  • Numbered notebooks walk through each stage of the workflow interactively before the production script ties them together.

  • The template task (solar flare intensity regression) is realistic enough to illustrate the full pattern, but simple enough that it doesn’t obscure what you need to change.


Repository Structure#

surya_workshop/
│
├── data/
│   └── indices/                        # Pre-built CSV index files for the SDO dataset
│       ├── surya_aws_s3_full_index.csv # Complete index of all available SDO timesteps on S3
│       ├── surya_aws_s3_train.csv      # Training split
│       ├── surya_aws_s3_val.csv        # Validation split
│       └── surya_aws_s3_test.csv       # Test split
│
├── workshop_infrastructure/            # Shared utilities used by all downstream apps
│   ├── configs.py                      # Typed dataclasses: ModelConfig, LoraAdapterConfig, TimeEmbeddingConfig
│   ├── utils.py                        # build_scalers(), apply_peft_lora(), load_pretrained_weights(),
│   │                                   # UploadBestCheckpointToS3, create_logger
│   ├── benchmark_s3.py                 # Benchmark S3 download throughput to tune transfer settings
│   ├── datasets/
│   │   ├── helio.py                    # HelioNetCDFDataset — base dataset (local + S3, signum-log normalization)
│   │   └── transformations.py          # Additional data transformations
│   ├── models/
│   │   ├── finetune_models.py          # HelioSpectformer1D / HelioSpectformer2D fine-tuning wrappers
│   │   ├── helio_spectformer.py        # Full backbone (HelioSpectFormer)
│   │   ├── spectformer.py              # Spectral gating blocks
│   │   ├── transformer_ls.py           # Long-short attention blocks
│   │   ├── embedding.py                # Temporal embedding modules
│   │   └── flow.py                     # Learned flow utilities
│   └── data/
│       ├── create_csv_index.py         # Build a timestep CSV index from a collection of NetCDF files
│       └── split_csv_index.py          # Split an index into train / val / test sets
│
└── downstream_apps/
    └── template/                       # Solar flare intensity regression (the working template task)
        │
        ├── ADAPTING.md                 # Step-by-step guide for creating your own downstream app
        │
        ├── configs/
        │   └── config_script.yaml      # Single source of truth for all hyperparameters
        ├── configs.py                  # DataConfig, TrainingConfig, OutputConfig + load_config()
        │
        ├── assets/
        │   ├── scalers.yaml            # Per-channel normalization statistics (downloaded on first run)
        │   └── surya.366m.v1.pt        # Pre-trained Surya weights (downloaded on first run)
        ├── data/
        │   └── hek_flare_catalog.csv   # HEK flare event catalog used as regression targets
        │
        ├── datasets/
        │   └── template_dataset.py     # FlareDSDataset — extends HelioNetCDFDataset with flare labels
        ├── lightning_modules/
        │   └── pl_simple_baseline.py   # FlareLightningModule — Lightning training + validation loop
        ├── metrics/
        │   └── template_metrics.py     # FlareMetrics — MSE loss + RRSE evaluation metrics
        ├── models/
        │   └── simple_baseline.py      # RegressionFlareModel — linear baseline (no backbone)
        │
        ├── download_scalers_and_weights.sh   # Download assets/ from HuggingFace (run once)
        │
        ├── 0_dataset_dataloader_template.ipynb   # Step 1: explore the dataset and DataLoader
        ├── 1_baseline_template.ipynb             # Step 2: train a linear baseline
        ├── 2_finetune_template_1D.ipynb          # Step 3: fine-tune Surya interactively
        └── 3_finetune_template_1D.py             # Step 4: production training script

Two layers to understand:

  • workshop_infrastructure/ — reusable components that any downstream app can import. You should rarely need to change anything here.

  • downstream_apps/template/ — everything specific to one task. When you build your own app, you copy this folder and modify it.


How to Use This Repository#

1. Environment setup#

# Clone with submodule
git clone --recurse-submodules https://github.com/your-org/surya_workshop.git
cd surya_workshop

# Create and activate the conda environment
conda env create -f environment.yml
conda activate surya_ws

Python 3.12+ is required. Key dependencies: PyTorch, PyTorch Lightning, PEFT, WandB, SunPy, xarray, Dask, fsspec.

2. Work through the template notebooks in order#

Each notebook is self-contained and builds directly on the previous one. They are designed to be run interactively so you can inspect data, check tensor shapes, and verify each component before committing to a full training run.

Notebook

What it teaches

0_dataset_dataloader_template.ipynb

How SDO data is indexed, loaded, and normalized; what a sample dict looks like

1_baseline_template.ipynb

Training a simple linear model end-to-end; defines the metric and evaluation baseline

2_finetune_template_1D.ipynb

Loading Surya weights, applying LoRA, and fine-tuning interactively

3. Run the production training script#

Once you’re satisfied with the notebook workflow, 3_finetune_template_1D.py runs the same logic as notebook 2 but with multi-GPU DDP support, checkpoint saving, and WandB logging:

# Single GPU
CUDA_VISIBLE_DEVICES=0 python -m downstream_apps.template.3_finetune_template_1D \
    --config downstream_apps/template/configs/config_script.yaml

# Multi-GPU (DDP)
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m downstream_apps.template.3_finetune_template_1D \
    --config downstream_apps/template/configs/config_script.yaml

# Quick sanity check (cap epochs without editing the YAML)
CUDA_VISIBLE_DEVICES=0 python -m downstream_apps.template.3_finetune_template_1D \
    --config downstream_apps/template/configs/config_script.yaml \
    --max-epochs 2 --no-wandb

All hyperparameters live in config_script.yaml — batch size, learning rate, LoRA settings, S3 paths, and more. The script reads the YAML via load_config() and returns a fully typed TrainingConfig, so IDE autocompletion works and typos are caught at startup rather than mid-training.

Set max_samples: 10 in the YAML during development to cap the dataset size for fast iteration.

4. (Optional) Tune S3 download performance#

If your training data is read from S3, workshop_infrastructure/benchmark_s3.py measures download throughput across combinations of thread concurrency and part size and recommends the best settings for your connection:

python -m workshop_infrastructure.benchmark_s3 s3://bucket/path/to/file.nc --anon --quick

One of the files in the surya index works fine (e.g. s3://nasa-surya-bench/2011/01/20110131_0000.nc). The --quick flag runs a 9-cell grid and finishes in about 2–3 minutes. Copy the recommended values into the data: section of config_script.yaml:

s3_boto3_max_concurrency: 8   # suggested by benchmark
s3_boto3_part_size_mb: 32     # suggested by benchmark

On EC2 in the same AWS region as the bucket, expect 500–1000+ MB/s. Over a regular internet connection, 20–150 MB/s is typical — in either case the benchmark will find the fastest achievable settings.

EC2 users: to ensure S3 traffic routes over the AWS internal backbone and never touches an internet or NAT gateway, confirm that a VPC S3 Gateway Endpoint is attached to your VPC (AWS Console → VPC → Endpoints → filter by “S3 Gateway”). It is free and takes two minutes to create. Without it, even same-region traffic passes through a gateway, reducing throughput and incurring data-transfer costs. The benchmark script will remind you of this automatically when it detects it is running on EC2.

5. Adapt the template for your own task#

Refer to notebooks for a step-by-step guide. The short version:

  1. cp -r downstream_apps/template downstream_apps/your_task

  2. Edit datasets/template_dataset.py to load your labels alongside the SDO image stack.

  3. Edit metrics/template_metrics.py to define your loss and evaluation metrics.

  4. Edit configs/config_script.yaml to point at your data and set your hyperparameters.

  5. Run notebook 2 → verify the forward pass → run the training script.

You typically do not need to touch workshop_infrastructure/ at all.

Key Design Decisions#

YAML as single source of truth. All parameters are declared once in config_script.yaml and nowhere else. The CLI exposes only four arguments: --config (required), --no-wandb (dev toggle), --train_baseline (mode switch), and --max-epochs (sweep override). This keeps experiment management simple and reproducible.

Typed configuration. load_config() parses the YAML into a TrainingConfig dataclass. Downstream code receives a typed object instead of a raw dict — adding a new field requires editing the dataclass and YAML only, not load_config() itself.

Notebooks and script are parallel, not redundant. The notebooks are the learning path — they expose internals and make it easy to inspect intermediate results. The script is the production path — it adds DDP, robust checkpointing, and WandB integration. Both read the same YAML.

LoRA for efficient fine-tuning. By default, PEFT LoRA adapters are added to all attention and feed-forward layers (rank 8, alpha 8, dropout 0.1). This allows the full Surya backbone to remain effectively frozen while adapting it to a new task with a small number of trainable parameters. LoRA can be disabled in the YAML if you prefer full fine-tuning or backbone freezing.