MSG
MSG Pipeline
In this tutorial, we will walk through how one can download data and prep for any further machine learning work with the MSG dataset. We will:
- download the data
- harmonize the data
- create patches that are ready for ML consumption.
import autoroot
import os
from dotenv import load_dotenv
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import rasterio
import cartopy
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER
xr.set_options(
keep_attrs=True,
display_expand_data=False,
display_expand_coords=False,
display_expand_data_vars=False,
display_expand_indexes=False
)
np.set_printoptions(threshold=10, edgeitems=2)
import seaborn as sns
sns.reset_defaults()
sns.set_context(context="talk", font_scale=1.0)
%matplotlib inline
Save DirectoryĀ¶
This is arguably the most important part. We need to define where we want to save the data.
We use the autoroot
package to manually handle all of the
root_dir = autoroot.root
Note: The data is very heavy! So make sure you have adequate space.
save_dir = os.getenv("ITI_DATA_SAVEDIR")
AccountĀ¶
We use the NASA data registry which hosts all of the datasets. We use the EUMETSAT Data Access API which enables us to easily download data using python.
Warning: the user must have an account for the EUMETSAT API.
Please follow the link to register for an account.
There are different ways to authenticate your account.
We recommend you log in once and store it to your local ~/.netrc
file or alternatively, setting the .env
variable to your EARTHDATA_USERNAME
and EARTH_PASSWORD
.
See these instructions for more information.
ConfigĀ¶
We have a configuration file which features some of the options available for downloading data. One can take a peek using the command below.
!cat $autoroot.root/config/example/download.yaml
# PERIOD period: start_date: '2020-10-01' start_time: '00:00:00' end_date: '2020-10-31' end_time: '23:59:00' # CLOUD MASK cloud_mask: True # PATH FOR SAVING DATA save_dir: data defaults: - _self_
We also have some more things we can change that are satellite specific.
We can see them using the command below.
# !cat $autoroot.root/config/example/satellite/msg.yaml
download:
_target_: rs_tools._src.data.msg.downloader_msg.download
save_dir: ${save_dir}/msg/raw
start_date: ${period.start_date}
start_time: ${period.start_time}
end_date: ${period.end_date}
end_time: ${period.end_time}
daily_window_t0: "9:00:00"
daily_window_t1: "17:00:00"
time_step: "1:00:00"
For this tutorial, we will change the save directory, start/end time, and the time step.
Notice how we will change some configurations within the download.yaml
file and some others that are within the satellite.yaml
file, in particular the msg.yaml
.
python rs_tools \
satellite=aqua \
stage=download \
save_dir="/path/to/savedir" \
period.start_date="2020-10-01" \
period.end_date="2020-10-02" \
period.start_time="09:00:00" \
period.end_time="21:00:00"
GeoProcessingĀ¶
We have an extensive geoprocessing steps to be able to
We can peek into the rs_tools/config/example/download.yaml
configuration file to see some of the options we have to modify this.
# !cat $autoroot.root/config/example/satellite/aqua.yaml
geoprocess:
_target_: rs_tools._src.geoprocessing.modis.geoprocessor_modis.geoprocess
read_path: ${read_path}/aqua/raw
save_path: ${save_path}/aqua/geoprocessed
satellite: aqua
In particular, we will focus on the geoprocess
step within the configuration.
The most important options are the resolution
and the region
.
The resolution is a float or integer that is measured in km.
Below, we have an example of the command we
python rs_tools \
satellite=msg \
stage=geoprocess \
read_path="/path/to/savedir/" \
save_path="/path/to/savedir/" \
satellite.geoprocess.resolution=5000
# !ls $save_dir/msg/geoprocessed
We can see the saved data are clean
/path/to/savedir/msg/geoprocessed/20201001091243_msg.nc
/path/to/savedir/msg/geoprocessed/20201001151243_msg.nc
/path/to/savedir/msg/geoprocessed/20201002091243_msg.nc
ds = xr.open_dataset(f"{save_dir}/msg/geoprocessed/20201001091243_msg.nc", engine="netcdf4")
ds
<xarray.Dataset> Size: 40MB Dimensions: (x: 1457, y: 400, band: 11, time: 1, band_wavelength: 11) Coordinates: (8) Data variables: (1) Attributes: calibration: radiance standard_name: toa_outgoing_radiance_per_unit_wavenumber platform_name: Meteosat-11 sensor: seviri units: mW m-2 sr-1 (cm-1)-1 orbital_parameters: {"projection_longitude": 0.0, "projection_latitude":...
# in an even better way
fig = plt.figure(figsize=(8,8))
ax = plt.axes(projection=ccrs.PlateCarree())
cbar_kwargs = {
"fraction": 0.06,
"pad": 0.1,
"orientation": "horizontal",
}
# ax.set_extent([-20, -10, 30, 60])
# out["1"].plot(ax=ax, transform=ccrs.PlateCarree())
# ax.pcolormesh(out["1"].longitude, out["1"].latitude, out["1"].values)
ds.isel(band=0).Rad.plot.pcolormesh(
x="longitude", y="latitude", transform=ccrs.PlateCarree(),
cbar_kwargs=cbar_kwargs
)
# ax.set(xlim=[-140, -70,],
# ylim=[ -40, 10])
# # Add map features with Cartopy
# ax.add_feature(cfeature.NaturalEarthFeature('physical', 'land', '10m',
# edgecolor='face',
# facecolor='lightgray'))
ax.coastlines()
# Plot lat/lon grid
gl = ax.gridlines(crs=ccrs.PlateCarree(), draw_labels=True,
linewidth=0.1, color='k', alpha=1,
linestyle='--')
gl.top_labels = False
gl.right_labels = False
gl.xformatter = LONGITUDE_FORMATTER
gl.yformatter = LATITUDE_FORMATTER
gl.xlabel_style = {'size': 12}
gl.ylabel_style = {'size': 12}
plt.tight_layout()
plt.show()
PatchingĀ¶
!cat $root_dir/config/example/patch.yaml
# PATH WITH GEOPROCESSED DATA read_path: data # PATH FOR SAVING PATCHES save_path: data #Ā PATCH PARAMETERS patch_size: 256 stride_size: 256 # NAN CUTOFF nan_cutoff: 0.5 # FILETYPE TO SAVE [nc = netcdf, np = numpy] save_filetype: nc defaults: - _self_
# !ls $save_dir/msg/analysis/
ds = xr.open_dataset(f"{save_dir}/msg/analysis/20201001091243_patch_0.nc", engine="netcdf4")
ds
<xarray.Dataset> Size: 71kB Dimensions: (x: 32, y: 32, band: 11, time: 1, band_wavelength: 11) Coordinates: (8) Data variables: (1)
# in an even better way
fig = plt.figure()
ax = plt.axes(projection=ccrs.PlateCarree())
# ax.set_extent([-20, -10, 30, 60])
# out["1"].plot(ax=ax, transform=ccrs.PlateCarree())
# ax.pcolormesh(out["1"].longitude, out["1"].latitude, out["1"].values)
ds.isel(band=0).Rad.plot.pcolormesh(x="longitude", y="latitude", transform=ccrs.PlateCarree())
# ax.set(xlim=[-140, -70,],
# ylim=[ -40, 10])
# # Add map features with Cartopy
# ax.add_feature(cfeature.NaturalEarthFeature('physical', 'land', '10m',
# edgecolor='face',
# facecolor='lightgray'))
ax.coastlines()
# Plot lat/lon grid
gl = ax.gridlines(crs=ccrs.PlateCarree(), draw_labels=True,
linewidth=0.1, color='k', alpha=1,
linestyle='--')
gl.top_labels = False
gl.right_labels = False
gl.xformatter = LONGITUDE_FORMATTER
gl.yformatter = LATITUDE_FORMATTER
gl.xlabel_style = {'size': 8}
gl.ylabel_style = {'size': 8}
plt.tight_layout()
plt.show()
DataLoadingĀ¶
We can start using any dataloader framework right away. In this example, we will use PyTorch.
from rs_tools._src.utils.io import get_list_filenames
from rs_tools._src.datamodule.utils import load_nc_file
from rs_tools._src.datamodule.editor import StackDictEditor, CoordNormEditor
from toolz import compose_left
/home/juanjohn/miniconda/envs/rs_tools/lib/python3.10/site-packages/goes2go/data.py:519: FutureWarning: 'H' is deprecated and will be removed in a future version. Please use 'h' instead of 'H'. within=pd.to_timedelta(config["nearesttime"].get("within", "1H")), /home/juanjohn/miniconda/envs/rs_tools/lib/python3.10/site-packages/goes2go/NEW.py:188: FutureWarning: 'H' is deprecated and will be removed in a future version. Please use 'h' instead of 'H'. within=pd.to_timedelta(config["nearesttime"].get("within", "1H")),
We will create a very simple demo dataloader
from torch.utils.data import Dataset, DataLoader
from typing import Optional, Callable
class NCDataReader(Dataset):
def __init__(self, data_dir: str, ext: str=".nc", transforms: Optional[Callable]=None):
self.data_dir = data_dir
self.data_filenames = get_list_filenames(data_dir, ext)
self.transforms = transforms
def __getitem__(self, ind) -> np.ndarray:
nc_path = self.data_filenames[ind]
x = load_nc_file(nc_path)
if self.transforms is not None:
x = self.transforms(x)
return x
def __len__(self):
return len(self.data_filenames)
ds = NCDataReader(f"{save_dir}/msg/analysis")
dl = DataLoader(ds, batch_size=8)
out = next(iter(dl))
list(out.keys())
['data', 'wavelengths', 'coords', 'cloud_mask']
out["data"].shape, out["coords"].shape
(torch.Size([8, 11, 32, 32]), torch.Size([8, 2, 32, 32]))
Transforms/EditorsĀ¶
We can also use custom transformations within the dataset (just like standard PyTorch) to transform our dataset
transforms = compose_left(
CoordNormEditor(),
StackDictEditor(),
)
# initialize dataset with transforms
ds = NCDataReader(f"{save_dir}/msg/analysis", transforms=transforms)
# initialize dataloader
dl = DataLoader(ds, batch_size=1)
# do one iteration
out = next(iter(dl))
# inspect a batch
out.shape
ge