Getting Started#
A template repository for fine-tuning Surya, the first foundation model for heliophysics, on your own downstream solar science tasks.
The Surya Foundation Model#
Surya is a 366-million-parameter spatiotemporal transformer pre-trained on full-resolution data from NASA’s Solar Dynamics Observatory (SDO). It was developed as a NASA-IMPACT / IBM AI4Science collaboration and is described in:
Surya: A Foundation Model for Heliophysics — arXiv:2508.14112
The model ingests 13-channel SDO image stacks (8 AIA wavelengths + 4 HMI magnetic components + HMI doppler velocity) at native 4096×4096 resolution and has demonstrated strong performance across a range of solar physics tasks:
Task |
Improvement over prior state-of-the-art |
|---|---|
Solar flare forecasting (TSS) |
+22% |
Solar wind speed prediction (RMSE) |
+19% |
Active region segmentation |
— |
EUV spectra modeling (1,343 bands) |
— |
Resources
Model weights:
nasa-ibm-ai4science/Surya-1.0on HuggingFacePre-training dataset:
nasa-ibm-ai4science/core-sdoon HuggingFaceSource code: NASA-IMPACT/Surya
License: Apache 2.0
Architecture#
Surya uses two novel transformer block types that make it efficient on full-resolution solar imagery:
Spectral Gating — transforms patches to the frequency domain via FFT, applies learnable complex weights, then returns via iFFT. Captures global structure efficiently.
Long-Short Attention — combines local windowed attention (
window_size=2) with global attention via dynamic projection (dp_rank=4). Handles the 4096×4096 spatial extent without quadratic cost.
The full backbone is 2 spectral gating blocks followed by 8 long-short attention blocks, with patch size 16 and embedding dimension 1280.
Purpose of This Repository#
Surya is a powerful foundation, but foundation models only create scientific value when researchers can adapt them to their own questions. That adaptation step — loading pre-trained weights, defining a task-specific head, wiring up data and metrics, and running a reproducible training loop requires support when executed for the first time.
This repository provides a clean, heavily documented template that addresses that need. The goal is to provide a well-explained starting point that a scientist can read, understand, and modify in an afternoon:
Every component has a clear home and a documented interface.
A single YAML file controls all hyperparameters.
Numbered notebooks walk through each stage of the workflow interactively before the production script ties them together.
The template task (solar flare intensity regression) is realistic enough to illustrate the full pattern, but simple enough that it doesn’t obscure what you need to change.
Repository Structure#
surya_workshop/
│
├── data/
│ └── indices/ # Pre-built CSV index files for the SDO dataset
│ ├── surya_aws_s3_full_index.csv # Complete index of all available SDO timesteps on S3
│ ├── surya_aws_s3_train.csv # Training split
│ ├── surya_aws_s3_val.csv # Validation split
│ └── surya_aws_s3_test.csv # Test split
│
├── workshop_infrastructure/ # Shared utilities used by all downstream apps
│ ├── configs.py # Typed dataclasses: ModelConfig, LoraAdapterConfig, TimeEmbeddingConfig
│ ├── utils.py # build_scalers(), apply_peft_lora(), load_pretrained_weights(),
│ │ # UploadBestCheckpointToS3, create_logger
│ ├── benchmark_s3.py # Benchmark S3 download throughput to tune transfer settings
│ ├── datasets/
│ │ ├── helio.py # HelioNetCDFDataset — base dataset (local + S3, signum-log normalization)
│ │ └── transformations.py # Additional data transformations
│ ├── models/
│ │ ├── finetune_models.py # HelioSpectformer1D / HelioSpectformer2D fine-tuning wrappers
│ │ ├── helio_spectformer.py # Full backbone (HelioSpectFormer)
│ │ ├── spectformer.py # Spectral gating blocks
│ │ ├── transformer_ls.py # Long-short attention blocks
│ │ ├── embedding.py # Temporal embedding modules
│ │ └── flow.py # Learned flow utilities
│ └── data/
│ ├── create_csv_index.py # Build a timestep CSV index from a collection of NetCDF files
│ └── split_csv_index.py # Split an index into train / val / test sets
│
└── downstream_apps/
└── template/ # Solar flare intensity regression (the working template task)
│
├── ADAPTING.md # Step-by-step guide for creating your own downstream app
│
├── configs/
│ └── config_script.yaml # Single source of truth for all hyperparameters
├── configs.py # DataConfig, TrainingConfig, OutputConfig + load_config()
│
├── assets/
│ ├── scalers.yaml # Per-channel normalization statistics (downloaded on first run)
│ └── surya.366m.v1.pt # Pre-trained Surya weights (downloaded on first run)
├── data/
│ └── hek_flare_catalog.csv # HEK flare event catalog used as regression targets
│
├── datasets/
│ └── template_dataset.py # FlareDSDataset — extends HelioNetCDFDataset with flare labels
├── lightning_modules/
│ └── pl_simple_baseline.py # FlareLightningModule — Lightning training + validation loop
├── metrics/
│ └── template_metrics.py # FlareMetrics — MSE loss + RRSE evaluation metrics
├── models/
│ └── simple_baseline.py # RegressionFlareModel — linear baseline (no backbone)
│
├── download_scalers_and_weights.sh # Download assets/ from HuggingFace (run once)
│
├── 0_dataset_dataloader_template.ipynb # Step 1: explore the dataset and DataLoader
├── 1_baseline_template.ipynb # Step 2: train a linear baseline
├── 2_finetune_template_1D.ipynb # Step 3: fine-tune Surya interactively
└── 3_finetune_template_1D.py # Step 4: production training script
Two layers to understand:
workshop_infrastructure/— reusable components that any downstream app can import. You should rarely need to change anything here.downstream_apps/template/— everything specific to one task. When you build your own app, you copy this folder and modify it.
How to Use This Repository#
1. Environment setup#
# Clone with submodule
git clone --recurse-submodules https://github.com/your-org/surya_workshop.git
cd surya_workshop
# Create and activate the conda environment
conda env create -f environment.yml
conda activate surya_ws
Python 3.12+ is required. Key dependencies: PyTorch, PyTorch Lightning, PEFT, WandB, SunPy, xarray, Dask, fsspec.
2. Work through the template notebooks in order#
Each notebook is self-contained and builds directly on the previous one. They are designed to be run interactively so you can inspect data, check tensor shapes, and verify each component before committing to a full training run.
Notebook |
What it teaches |
|---|---|
|
How SDO data is indexed, loaded, and normalized; what a sample dict looks like |
|
Training a simple linear model end-to-end; defines the metric and evaluation baseline |
|
Loading Surya weights, applying LoRA, and fine-tuning interactively |
3. Run the production training script#
Once you’re satisfied with the notebook workflow, 3_finetune_template_1D.py runs the same logic as notebook 2 but with multi-GPU DDP support, checkpoint saving, and WandB logging:
# Single GPU
CUDA_VISIBLE_DEVICES=0 python -m downstream_apps.template.3_finetune_template_1D \
--config downstream_apps/template/configs/config_script.yaml
# Multi-GPU (DDP)
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m downstream_apps.template.3_finetune_template_1D \
--config downstream_apps/template/configs/config_script.yaml
# Quick sanity check (cap epochs without editing the YAML)
CUDA_VISIBLE_DEVICES=0 python -m downstream_apps.template.3_finetune_template_1D \
--config downstream_apps/template/configs/config_script.yaml \
--max-epochs 2 --no-wandb
All hyperparameters live in config_script.yaml — batch size, learning rate, LoRA settings, S3 paths, and more. The script reads the YAML via load_config() and returns a fully typed TrainingConfig, so IDE autocompletion works and typos are caught at startup rather than mid-training.
Set max_samples: 10 in the YAML during development to cap the dataset size for fast iteration.
4. (Optional) Tune S3 download performance#
If your training data is read from S3, workshop_infrastructure/benchmark_s3.py measures download throughput across combinations of thread concurrency and part size and recommends the best settings for your connection:
python -m workshop_infrastructure.benchmark_s3 s3://bucket/path/to/file.nc --anon --quick
One of the files in the surya index works fine (e.g. s3://nasa-surya-bench/2011/01/20110131_0000.nc). The --quick flag runs a 9-cell grid and finishes in about 2–3 minutes. Copy the recommended values into the data: section of config_script.yaml:
s3_boto3_max_concurrency: 8 # suggested by benchmark
s3_boto3_part_size_mb: 32 # suggested by benchmark
On EC2 in the same AWS region as the bucket, expect 500–1000+ MB/s. Over a regular internet connection, 20–150 MB/s is typical — in either case the benchmark will find the fastest achievable settings.
EC2 users: to ensure S3 traffic routes over the AWS internal backbone and never touches an internet or NAT gateway, confirm that a VPC S3 Gateway Endpoint is attached to your VPC (AWS Console → VPC → Endpoints → filter by “S3 Gateway”). It is free and takes two minutes to create. Without it, even same-region traffic passes through a gateway, reducing throughput and incurring data-transfer costs. The benchmark script will remind you of this automatically when it detects it is running on EC2.
5. Adapt the template for your own task#
Refer to notebooks for a step-by-step guide. The short version:
cp -r downstream_apps/template downstream_apps/your_taskEdit
datasets/template_dataset.pyto load your labels alongside the SDO image stack.Edit
metrics/template_metrics.pyto define your loss and evaluation metrics.Edit
configs/config_script.yamlto point at your data and set your hyperparameters.Run notebook 2 → verify the forward pass → run the training script.
You typically do not need to touch workshop_infrastructure/ at all.
Key Design Decisions#
YAML as single source of truth. All parameters are declared once in config_script.yaml and nowhere else. The CLI exposes only four arguments: --config (required), --no-wandb (dev toggle), --train_baseline (mode switch), and --max-epochs (sweep override). This keeps experiment management simple and reproducible.
Typed configuration. load_config() parses the YAML into a TrainingConfig dataclass. Downstream code receives a typed object instead of a raw dict — adding a new field requires editing the dataclass and YAML only, not load_config() itself.
Notebooks and script are parallel, not redundant. The notebooks are the learning path — they expose internals and make it easy to inspect intermediate results. The script is the production path — it adds DDP, robust checkpointing, and WandB integration. Both read the same YAML.
LoRA for efficient fine-tuning. By default, PEFT LoRA adapters are added to all attention and feed-forward layers (rank 8, alpha 8, dropout 0.1). This allows the full Surya backbone to remain effectively frozen while adapting it to a new task with a small number of trainable parameters. LoRA can be disabled in the YAML if you prefer full fine-tuning or backbone freezing.