MODIS Pipeline¶
In this tutorial, we will walk through how one can download data and prep for any further machine learning work with the GOES16 dataset. We will:
- download the data
- harmonize the data
- create patches that are ready for ML consumption.
import autoroot
import os
from dotenv import load_dotenv
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import rasterio
import cartopy
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER
xr.set_options(
keep_attrs=True,
display_expand_data=False,
display_expand_coords=False,
display_expand_data_vars=False,
display_expand_indexes=False
)
np.set_printoptions(threshold=10, edgeitems=2)
import seaborn as sns
sns.reset_defaults()
sns.set_context(context="talk", font_scale=1.0)
%matplotlib inline
Save Directory¶
This is arguably the most important part. We need to define where we want to save the data.
We use the autoroot
package to manually handle all of the
root_dir = autoroot.root
Note: The data is very heavy! So make sure you have adequate space.
save_dir = os.getenv("ITI_DATA_SAVEDIR")
Account¶
We use the NASA data registry which hosts all of the datasets. We use the EarthAccess API which enables us to easily download data using python.
Warning: the user must have an account for the NASA EarthData API.
Please follow the link to register for an account.
There are different ways to authenticate your account.
We recommend you log in once and store it to your local ~/.netrc
file or alternatively, setting the .env
variable to your EARTHDATA_USERNAME
and EARTH_PASSWORD
.
See these instructions for more information.
Config¶
We have a configuration file which features some of the options available for downloading data. One can take a peek using the command below.
!cat $autoroot.root/config/example/download.yaml
# PERIOD period: start_date: '2020-10-01' start_time: '00:00:00' end_date: '2020-10-31' end_time: '23:59:00' # CLOUD MASK cloud_mask: True # PATH FOR SAVING DATA save_dir: data defaults: - _self_
We also have some more things we can change that are satellite specific.
We can see them using the command below.
# !cat $autoroot.root/config/example/satellite/aqua.yaml
download:
_target_: rs_tools._src.data.modis.downloader_aqua.download
save_dir: ${save_dir}/aqua/
start_date: ${period.start_date}
start_time: ${period.start_time}
end_date: ${period.end_date}
end_time: ${period.end_time}
region: "-130 -15 -90 5" # "lon_min lat_min lon_max lat_max"
For this tutorial, we will change the save directory, start/end time, and the time step.
Notice how we will change some configurations within the download.yaml
file and some others that are within the satellite.yaml
file, in particular the aqua.yaml
.
python rs_tools \
satellite=aqua \
stage=download \
save_dir="/path/to/savedir" \
period.start_date="2020-10-01" \
period.end_date="2020-10-02" \
period.start_time="09:00:00" \
period.end_time="21:00:00"
GeoProcessing¶
We have an extensive geoprocessing steps to be able to
We can peek into the rs_tools/config/example/download.yaml
configuration file to see some of the options we have to modify this.
# !cat $autoroot.root/config/example/satellite/aqua.yaml
geoprocess:
_target_: rs_tools._src.geoprocessing.modis.geoprocessor_modis.geoprocess
read_path: ${read_path}/aqua/raw
save_path: ${save_path}/aqua/geoprocessed
satellite: aqua
In particular, we will focus on the geoprocess
step within the configuration.
The most important options are the resolution
and the region
.
The resolution is a float or integer that is measured in km.
Below, we have an example of the command we
python rs_tools \
satellite=aqua \
stage=geoprocess \
read_path=$save_dir/data/iti \
save_path=$save_dir/data/iti
We can see the saved data are clean
/path/to/savedir/goes16/geoprocessed/20201001150019_goes16.nc
/path/to/savedir/goes16/geoprocessed/20201002150019_goes16.nc
# !ls $save_dir/aqua/geoprocessed
ds = xr.open_dataset(f"{save_dir}/aqua/geoprocessed/20201001195500_aqua.nc", engine="netcdf4")
ds
<xarray.Dataset> Size: 453MB Dimensions: (y: 2040, x: 1354, band: 38, time: 1, band_wavelength: 38) Coordinates: (6) Dimensions without coordinates: y, x Data variables: (1) Attributes: calibration: radiance standard_name: toa_outgoing_radiance_per_unit_wavelength platform_name: EOS-Aqua sensor: modis units: Watts/m^2/micrometer/steradian
# in an even better way
fig = plt.figure(figsize=(8,8))
ax = plt.axes(projection=ccrs.PlateCarree())
cbar_kwargs = {
"fraction": 0.06,
"pad": 0.1,
"orientation": "horizontal",
}
# ax.set_extent([-20, -10, 30, 60])
# out["1"].plot(ax=ax, transform=ccrs.PlateCarree())
# ax.pcolormesh(out["1"].longitude, out["1"].latitude, out["1"].values)
ds.isel(band=0).Rad.plot.pcolormesh(
x="longitude", y="latitude", transform=ccrs.PlateCarree(),
cbar_kwargs=cbar_kwargs
)
ax.set(xlim=[-140, -70,],
ylim=[ -40, 10])
# # Add map features with Cartopy
# ax.add_feature(cfeature.NaturalEarthFeature('physical', 'land', '10m',
# edgecolor='face',
# facecolor='lightgray'))
ax.coastlines()
# Plot lat/lon grid
gl = ax.gridlines(crs=ccrs.PlateCarree(), draw_labels=True,
linewidth=0.1, color='k', alpha=1,
linestyle='--')
gl.top_labels = False
gl.right_labels = False
gl.xformatter = LONGITUDE_FORMATTER
gl.yformatter = LATITUDE_FORMATTER
gl.xlabel_style = {'size': 12}
gl.ylabel_style = {'size': 12}
plt.tight_layout()
plt.show()
Patching¶
!cat $cwd/config/example/patch.yaml
cat: /config/example/patch.yaml: No existe el fichero o el directorio
The most important arguments are the patch_size
and stride_size
argument.
The patch_size dictates how big the patches should be and the stride_size dictates how much space should be between patches.
For complete overlap, the stride size should be the patch_size-1
.
For no overlap, the stride size should be patch_size
python rs_tools satellite=aqua stage=patch read_path=$save_dir save_path=$save_dir nan_cutoff=0.5
patch_size=16 stride_size=16
Demo Visualization¶
ds = xr.open_dataset(f"{save_dir}/aqua/analysis/20201001195500_patch_0.nc", engine="netcdf4")
ds
<xarray.Dataset> Size: 169kB Dimensions: (y: 32, x: 32, band: 38, time: 1, band_wavelength: 38) Coordinates: (6) Dimensions without coordinates: y, x Data variables: (1)
# in an even better way
fig = plt.figure()
ax = plt.axes(projection=ccrs.PlateCarree())
# ax.set_extent([-20, -10, 30, 60])
# out["1"].plot(ax=ax, transform=ccrs.PlateCarree())
# ax.pcolormesh(out["1"].longitude, out["1"].latitude, out["1"].values)
ds.isel(band=0).Rad.plot.pcolormesh(x="longitude", y="latitude", transform=ccrs.PlateCarree())
# ax.set(xlim=[-140, -70,],
# ylim=[ -40, 10])
# # Add map features with Cartopy
# ax.add_feature(cfeature.NaturalEarthFeature('physical', 'land', '10m',
# edgecolor='face',
# facecolor='lightgray'))
ax.coastlines()
# Plot lat/lon grid
gl = ax.gridlines(crs=ccrs.PlateCarree(), draw_labels=True,
linewidth=0.1, color='k', alpha=1,
linestyle='--')
gl.top_labels = False
gl.right_labels = False
gl.xformatter = LONGITUDE_FORMATTER
gl.yformatter = LATITUDE_FORMATTER
gl.xlabel_style = {'size': 8}
gl.ylabel_style = {'size': 8}
plt.tight_layout()
plt.show()
DataLoading¶
We can start using any dataloader framework right away. In this example, we will use PyTorch.
from rs_tools._src.utils.io import get_list_filenames
from rs_tools._src.datamodule.utils import load_nc_file
from rs_tools._src.datamodule.editor import StackDictEditor, CoordNormEditor
from toolz import compose_left
/home/juanjohn/miniconda/envs/rs_tools/lib/python3.11/site-packages/goes2go/data.py:519: FutureWarning: 'H' is deprecated and will be removed in a future version. Please use 'h' instead of 'H'. within=pd.to_timedelta(config["nearesttime"].get("within", "1H")), /home/juanjohn/miniconda/envs/rs_tools/lib/python3.11/site-packages/goes2go/NEW.py:188: FutureWarning: 'H' is deprecated and will be removed in a future version. Please use 'h' instead of 'H'. within=pd.to_timedelta(config["nearesttime"].get("within", "1H")),
We will create a very simple demo dataloader
from torch.utils.data import Dataset, DataLoader
from typing import Optional, Callable
class NCDataReader(Dataset):
def __init__(self, data_dir: str, ext: str=".nc", transforms: Optional[Callable]=None):
self.data_dir = data_dir
self.data_filenames = get_list_filenames(data_dir, ext)
self.transforms = transforms
def __getitem__(self, ind) -> np.ndarray:
nc_path = self.data_filenames[ind]
x = load_nc_file(nc_path)
if self.transforms is not None:
x = self.transforms(x)
return x
def __len__(self):
return len(self.data_filenames)
ds = NCDataReader(f"{save_dir}/aqua/analysis")
dl = DataLoader(ds, batch_size=8)
out = next(iter(dl))
list(out.keys())
['data', 'wavelengths', 'coords', 'cloud_mask']
out["data"].shape, out["coords"].shape
(torch.Size([8, 38, 32, 32]), torch.Size([8, 2, 32, 32]))
Transforms/Editors¶
We can also use custom transformations within the dataset (just like standard PyTorch) to transform our dataset
transforms = compose_left(
CoordNormEditor(),
StackDictEditor(),
)
# initialize dataset with transforms
ds = NCDataReader(f"{save_dir}/aqua/analysis", transforms=transforms)
# initialize dataloader
dl = DataLoader(ds, batch_size=8)
# do one iteration
out = next(iter(dl))
# inspect a batch
out.shape
The Kernel crashed while executing code in the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.