There's a variety of samples script out there for using the HRRR Zarr data. They don't all load the data the same way, and how to load it can be pretty confusing. This guide provides our current (as of writing) recommendations for the "best" ways to access the data, with weight given to performance (speed), code length, and a preference for high-level APIs. In the examples, as much as possible is written as functions that can be copied to your code.
The data is split into separate Zarr arrays based on the following:
For your use case, many of these values may always be the same. For instance, perhaps there are two variables and other than that only the model run changes. But the functions provided here need to be usable for all cases, so let's create a data class in Python to keep track of these.
import dataclasses
import datetime
@dataclasses.dataclass
class ZarrId:
run_hour: datetime.datetime
level_type: str
var_level: str
var_name: str
model_type: str
def format_chunk_id(self, chunk_id):
if self.model_type == "fcst":
# Extra id part since forecasts have an additional (time) dimension
return "0." + str(chunk_id)
else:
return chunk_id
Each Zarr array is chunked such that every grid point is consistently in the same spot in the same chunk. This allows us to efficiently analyze data at a single grid point or small region without having to load the whole zarr array. This picture shows how the grid is chunked:
As you can see, the chunk ids have the format <grid y chunk index>.<grid x chunk index>
level_type = "sfc"
model_type = "fcst"
run_hour = datetime.datetime(2021, 1, 1, 7)
def create_hrrr_zarr_explorer_url(level_type, model_type, run_hour):
url = "https://hrrrzarr.s3.amazonaws.com/index.html"
url += run_hour.strftime(
f"#{level_type}/%Y%m%d/%Y%m%d_%Hz_{model_type}.zarr/")
return url
print(create_hrrr_zarr_explorer_url(level_type, model_type, run_hour))
print(create_hrrr_zarr_explorer_url("prs", "anl", run_hour))
https://hrrrzarr.s3.amazonaws.com/index.html#sfc/20210101/20210101_07z_fcst.zarr/ https://hrrrzarr.s3.amazonaws.com/index.html#prs/20210101/20210101_07z_anl.zarr/
Url to directly download a chunk:
zarr_id = ZarrId(
run_hour=datetime.datetime(2020, 8, 1, 0), # Aug 1, 0Z
level_type="sfc",
var_level="1000mb",
var_name="TMP",
model_type="anl"
)
chunk_id = "4.3"
def create_https_chunk_url(zarr_id, chunk_id):
url = "https://hrrrzarr.s3.amazonaws.com"
url += zarr_id.run_hour.strftime(
f"/{zarr_id.level_type}/%Y%m%d/%Y%m%d_%Hz_{zarr_id.model_type}.zarr/")
url += f"{zarr_id.var_level}/{zarr_id.var_name}/{zarr_id.var_level}/{zarr_id.var_name}"
url += f"/{zarr_id.format_chunk_id(chunk_id)}"
return url
create_https_chunk_url(zarr_id, chunk_id)
'https://hrrrzarr.s3.amazonaws.com/sfc/20200801/20200801_00z_anl.zarr/1000mb/TMP/1000mb/TMP/4.3'
For automated purposes, we improve performance by downloading from the S3 protocol instead of HTTPS. There are three S3 urls we care about that are related to the zarr data format:
def create_s3_group_url(zarr_id, prefix=True):
url = "s3://hrrrzarr/" if prefix else "" # Skip when using boto3
url += zarr_id.run_hour.strftime(
f"{zarr_id.level_type}/%Y%m%d/%Y%m%d_%Hz_{zarr_id.model_type}.zarr/")
url += f"{zarr_id.var_level}/{zarr_id.var_name}"
return url
create_s3_group_url(zarr_id)
's3://hrrrzarr/sfc/20200801/20200801_00z_anl.zarr/1000mb/TMP'
def create_s3_subgroup_url(zarr_id, prefix=True):
url = create_s3_group_url(zarr_id, prefix)
url += f"/{zarr_id.var_level}"
return url
create_s3_subgroup_url(zarr_id)
's3://hrrrzarr/sfc/20200801/20200801_00z_anl.zarr/1000mb/TMP/1000mb'
def create_s3_chunk_url(zarr_id, chunk_id, prefix=False):
url = create_s3_subgroup_url(zarr_id, prefix)
url += f"/{zarr_id.var_name}/{zarr_id.format_chunk_id(chunk_id)}"
return url
create_s3_chunk_url(zarr_id, chunk_id)
'sfc/20200801/20200801_00z_anl.zarr/1000mb/TMP/1000mb/TMP/4.3'
The projection description for the spatial grid, which is stored in the HRRR grib2 files, is not available directly in the zarr arrays. Additionally, the zarr data has the x and y coordinates in the native Lambert Conformal Conic projection, but not the latitude and longitude data. The various use cases will have more specific examples of how to handle this information, but here we'll note that:
{'a': 6371229,
'b': 6371229,
'proj': 'lcc',
'lon_0': 262.5,
'lat_0': 38.5,
'lat_1': 38.5,
'lat_2': 38.5}
We also offer a "chunk index" that has the latitude and longitude coordinates for the grid, more on that under Use Case 3.
The following code sets up the correct CRS in cartopy:
import cartopy.crs as ccrs
projection = ccrs.LambertConformal(central_longitude=262.5,
central_latitude=38.5,
standard_parallels=(38.5, 38.5),
globe=ccrs.Globe(semimajor_axis=6371229,
semiminor_axis=6371229))
In this case, we recommend using xarray.open_mfdataset. It allows you to load the group with its metadata and the subgroup that contains the actual data variable at the same time.
Note: If you get an error like "CRSError: Invalid datum string", there's a fix for it at the bottom of this page.
import s3fs
import xarray as xr
import metpy
def load_dataset(urls):
fs = s3fs.S3FileSystem(anon=True)
ds = xr.open_mfdataset([s3fs.S3Map(url, s3=fs) for url in urls], engine='zarr')
# add the projection data
ds = ds.rename(projection_x_coordinate="x", projection_y_coordinate="y")
ds = ds.metpy.assign_crs(projection.to_cf())
ds = ds.metpy.assign_latitude_longitude()
return ds
load_dataset([create_s3_group_url(zarr_id), create_s3_subgroup_url(zarr_id)])
<xarray.Dataset> Dimensions: (x: 1799, y: 1059) Coordinates: * x (x) float64 -2.698e+06 -2.695e+06 ... 2.696e+06 * y (y) float64 -1.587e+06 -1.584e+06 ... 1.587e+06 metpy_crs object Projection: lambert_conformal_conic latitude (y, x) float64 21.14 21.15 21.15 ... 47.85 47.84 longitude (y, x) float64 -122.7 -122.7 ... -60.95 -60.92 Data variables: TMP (y, x) float16 dask.array<chunksize=(150, 150), meta=np.ndarray> forecast_period timedelta64[ns] ... forecast_reference_time datetime64[ns] ... height float64 ... pressure float64 ... time datetime64[ns] ...
In this case, it's more performant to use xarray.open_dataset. You call it once on the group url to get the grid data––remember, the geospatial grid's always the same so you can use any model run hour for that part. Then call open_dataset again for every run hour you're interested in.
fs = s3fs.S3FileSystem(anon=True)
def load(url, run_hour=None, new_time_dimension=None):
# Download the data from S3. May be lazy.
ds = xr.open_dataset(s3fs.S3Map(url, s3=fs), engine="zarr")
# Add the model run hour in as a dimension
if run_hour is not None:
ds[new_time_dimension] = run_hour
ds = ds.set_coords(new_time_dimension)
# Later on we use metpy functions that expect the grid variables to be x and y
ds = ds.rename(projection_x_coordinate="x", projection_y_coordinate="y")
return ds
def load_combined_dataset(zarr_ids):
# Get the grid data (at a long forecast hour in case the whole time dim is needed)
grid_zarr_id = dataclasses.replace(zarr_ids[0]) # dataclasses.replace is just a fancy copy function
grid_zarr_id.run_hour = grid_zarr_id.run_hour.replace(hour=0)
grid = load(create_s3_group_url(grid_zarr_id))
is_forecast = zarr_ids[0].model_type == "fcst"
new_time_dimension = "reference_time" if is_forecast else "time"
datasets = [load(create_s3_subgroup_url(zarr_id), zarr_id.run_hour, new_time_dimension)
for zarr_id in zarr_ids]
if is_forecast: # Align the time axes of each dataset (b/c forecasts have different lengths)
for dataset in datasets:
dataset["time"] = grid["time"][:len(dataset["time"])]
datasets = xr.align(*datasets, join="outer")
ds = xr.concat(datasets, dim=new_time_dimension, combine_attrs="override")
# Add the geospatial data to the combined dataset
ds["x"] = grid["x"]
ds["y"] = grid["y"]
ds = ds.metpy.assign_crs(projection.to_cf())
ds = ds.metpy.assign_latitude_longitude()
return ds
zarr_ids = [ZarrId(
run_hour=datetime.datetime(2021, 7, 16, 6) + datetime.timedelta(hours=time_delta),
level_type="sfc",
var_level="entire_atmosphere_single_layer",
var_name="PWAT",
model_type="fcst"
)
for time_delta in range(3)] # get 3 hours starting at the given time
load_combined_dataset(zarr_ids)
<xarray.Dataset> Dimensions: (y: 1059, x: 1799, time: 48, reference_time: 3) Coordinates: * y (y) float64 -1.587e+06 -1.584e+06 ... 1.584e+06 1.587e+06 * x (x) float64 -2.698e+06 -2.695e+06 ... 2.693e+06 2.696e+06 * time (time) datetime64[ns] 2021-07-16T01:00:00 ... 2021-07-18 * reference_time (reference_time) datetime64[ns] 2021-07-16T06:00:00 ... 2... metpy_crs object Projection: lambert_conformal_conic latitude (y, x) float64 21.14 21.15 21.15 21.16 ... 47.86 47.85 47.84 longitude (y, x) float64 -122.7 -122.7 -122.7 ... -60.99 -60.95 -60.92 Data variables: PWAT (reference_time, time, y, x) float16 40.81 40.81 ... nan nan
If you're downloading a lot of hours of data and don't need the whole grid, the xarray method will be slow, largely because it downloads various forms of metadata including the coordinates for every hour and variable. Instead, the data is chunked based on geospatial location, so you can download only the necessary chunks of the zarr array. Doing this is lower-level and won't decompress or read metadata automatically.
First, we get the chunk index file:
chunk_index = xr.open_zarr(s3fs.S3Map("s3://hrrrzarr/grid/HRRR_chunk_index.zarr", s3=fs))
chunk_index
<xarray.Dataset> Dimensions: (x: 1799, y: 1059) Coordinates: * x (x) float64 -2.698e+06 -2.695e+06 ... 2.693e+06 2.696e+06 * y (y) float64 -1.587e+06 -1.584e+06 ... 1.584e+06 1.587e+06 Data variables: chunk_id (x, y) object dask.array<chunksize=(225, 265), meta=np.ndarray> chunk_x (x) int32 dask.array<chunksize=(1799,), meta=np.ndarray> chunk_y (y) int32 dask.array<chunksize=(1059,), meta=np.ndarray> in_chunk_x (x) int32 dask.array<chunksize=(1799,), meta=np.ndarray> in_chunk_y (y) int32 dask.array<chunksize=(1059,), meta=np.ndarray> index_x (x) int32 dask.array<chunksize=(1799,), meta=np.ndarray> index_y (y) int32 dask.array<chunksize=(1059,), meta=np.ndarray> latitude (y, x) float64 dask.array<chunksize=(133, 450), meta=np.ndarray> longitude (y, x) float64 dask.array<chunksize=(133, 450), meta=np.ndarray>
Next, we need to get the nearest point:
def get_nearest_point(projection, chunk_index, longitude, latitude):
x, y = projection.transform_point(longitude, latitude, ccrs.PlateCarree())
return chunk_index.sel(x=x, y=y, method="nearest")
nearest_point = get_nearest_point(projection, chunk_index, -111.8910, 40.7608)
chunk_id = nearest_point.chunk_id.values
print(chunk_id)
4.3
Then, we retrieve the chunk. We can't use xarray, since it doesn't understand how to load a single chunk of a zarr array. We could simply retrieve the data from the https url, but if we're going to do this request more than once (e.g. getting the same chunk for every hour of the day), it's faster to use boto3.
import boto3
from botocore import UNSIGNED
from botocore.config import Config
# Don't recreate this resource in a loop! That caused a 3-4x slowdown for me.
s3 = boto3.resource(service_name='s3', region_name='us-west-1', config=Config(signature_version=UNSIGNED))
def retrieve_object(s3, s3_url):
obj = s3.Object('hrrrzarr', s3_url)
return obj.get()['Body'].read()
zarr_id = ZarrId(
run_hour=datetime.datetime(2019, 1, 1, 12),
level_type="sfc",
var_level="2m_above_ground",
var_name="TMP",
model_type="anl"
)
compressed_data = retrieve_object(s3, create_s3_chunk_url(zarr_id, chunk_id))
Now we have to decompress:
import numcodecs as ncd
import numpy as np
def decompress_chunk(zarr_id, compressed_data):
buffer = ncd.blosc.decompress(compressed_data)
dtype = "<f2"
if zarr_id.var_level == "surface" and zarr_id.var_name == "PRES":
dtype = "<f4"
chunk = np.frombuffer(buffer, dtype=dtype)
if zarr_id.model_type == "anl":
data_array = np.reshape(chunk, (150, 150))
else:
entry_size = 22500
data_array = np.reshape(chunk, (len(chunk)//entry_size, 150, 150))
return data_array
chunk_data = decompress_chunk(zarr_id, compressed_data)
Once we have the decompressed data for the whole chunk, we need to index into the chunk to get the value at our point:
chunk_data[nearest_point.in_chunk_y.values, nearest_point.in_chunk_x.values]
262.0
Putting this together to get a time series:
def get_value(zarr_id, chunk_id, nearest_point):
compressed_data = retrieve_object(s3, create_s3_chunk_url(zarr_id, chunk_id))
chunk_data = decompress_chunk(zarr_id, compressed_data)
if zarr_id.model_type == "fcst":
return chunk_data[:, nearest_point.in_chunk_y.values, nearest_point.in_chunk_x.values]
else:
return chunk_data[nearest_point.in_chunk_y.values, nearest_point.in_chunk_x.values]
zarr_ids = [dataclasses.replace(zarr_id, run_hour=zarr_id.run_hour + datetime.timedelta(hours=time_delta))
for time_delta in range(12)]
[get_value(zid, chunk_id, nearest_point) for zid in zarr_ids]
[262.0, 261.2, 261.2, 261.5, 262.8, 264.5, 265.0, 265.8, 266.2, 267.5, 267.5, 267.5]
# forecast example
fcst_zarr_id = ZarrId(
run_hour=datetime.datetime(2019, 1, 1, 12),
level_type="sfc",
var_level="surface",
var_name="PRES",
model_type="fcst"
)
zarr_ids = [dataclasses.replace(fcst_zarr_id, run_hour=zarr_id.run_hour + datetime.timedelta(hours=time_delta))
for time_delta in range(3)]
[get_value(zid, chunk_id, nearest_point) for zid in zarr_ids]
[array([87240., 87380., 87420., 87460., 87430., 87460., 87500., 87450., 87470., 87500., 87490., 87560., 87640., 87660., 87690., 87700., 87710., 87780., 87760., 87770., 87800., 87800., 87780., 87790., 87810., 87840., 87850., 87880., 87900., 87900., 87850., 87790., 87750., 87740., 87750., 87760.], dtype=float32), array([87370., 87410., 87540., 87460., 87420., 87450., 87470., 87440., 87490., 87560., 87610., 87650., 87710., 87680., 87760., 87780., 87760., 87790.], dtype=float32), array([87500., 87500., 87550., 87460., 87530., 87460., 87460., 87470., 87490., 87530., 87620., 87680., 87740., 87700., 87780., 87800., 87730., 87750.], dtype=float32)]
Here we get a subportion of the grid based on latitude/longitude bounds.
lat_top = 39
lat_bottom = 34
lon_top = -107
lon_bottom = -110 # Four Corners region
def check_boundaries(data):
return (lat_bottom < data.latitude) & (data.latitude < lat_top) & (
lon_bottom < data.longitude) & (data.longitude < lon_top)
area = chunk_index.where(check_boundaries, drop=True)
area
<xarray.Dataset> Dimensions: (x: 110, y: 194) Coordinates: * x (x) float64 -1.15e+06 -1.147e+06 ... -8.255e+05 -8.225e+05 * y (y) float64 -4.533e+05 -4.503e+05 ... 1.227e+05 1.257e+05 Data variables: chunk_id (x, y) object dask.array<chunksize=(110, 21), meta=np.ndarray> chunk_x (x, y) float64 dask.array<chunksize=(110, 21), meta=np.ndarray> chunk_y (y, x) float64 dask.array<chunksize=(21, 110), meta=np.ndarray> in_chunk_x (x, y) float64 dask.array<chunksize=(110, 21), meta=np.ndarray> in_chunk_y (y, x) float64 dask.array<chunksize=(21, 110), meta=np.ndarray> index_x (x, y) float64 dask.array<chunksize=(110, 21), meta=np.ndarray> index_y (y, x) float64 dask.array<chunksize=(21, 110), meta=np.ndarray> latitude (y, x) float64 dask.array<chunksize=(21, 110), meta=np.ndarray> longitude (y, x) float64 dask.array<chunksize=(21, 110), meta=np.ndarray>
Note that while we've dropped rows and columns with no datapoints in the box, we still have lots of null datapoints where either the projection x or the projection y coordinate is sometimes within the box, but the combination is not.
Let's get the unique chunk ids from this dataset.
def get_unique(data):
# We have to implement our own "unique" logic since missing values are NaN (a float) and the rest are string
data = data.fillna(None).values.flatten()
data = data[data != None]
return np.unique(data)
chunk_ids = get_unique(area.chunk_id)
Now we download the chunks from AWS, decompress, and join the chunk data with the grid data.
def get_chunk(zarr_id, chunk_id):
# retrieve data as before
compressed_data = retrieve_object(s3, create_s3_chunk_url(zarr_id, chunk_id))
chunk_data = decompress_chunk(zarr_id, compressed_data)
# combine retrieved data with the chunk grid
chunk_xarray = chunk_index.where(lambda x: x.chunk_id == chunk_id, drop=True)
dimensions = ("y", "x") if zarr_id.model_type == "anl" else ("time", "y", "x")
chunk_xarray[zarr_id.var_name] = (dimensions, chunk_data)
return chunk_xarray
def get_chunks_combined(zarr_id, chunk_ids):
chunks = [get_chunk(zarr_id, chunk_id) for chunk_id in chunk_ids]
return xr.merge(chunks)
data = get_chunks_combined(zarr_id, chunk_ids)
data
<xarray.Dataset> Dimensions: (x: 300, y: 300) Coordinates: * x (x) float64 -1.348e+06 -1.345e+06 ... -4.535e+05 -4.505e+05 * y (y) float64 -6.873e+05 -6.843e+05 ... 2.067e+05 2.097e+05 Data variables: chunk_id (x, y) object dask.array<chunksize=(225, 230), meta=np.ndarray> chunk_x (x, y) float64 dask.array<chunksize=(225, 230), meta=np.ndarray> chunk_y (y, x) float64 dask.array<chunksize=(230, 225), meta=np.ndarray> in_chunk_x (x, y) float64 dask.array<chunksize=(225, 230), meta=np.ndarray> in_chunk_y (y, x) float64 dask.array<chunksize=(230, 225), meta=np.ndarray> index_x (x, y) float64 dask.array<chunksize=(225, 230), meta=np.ndarray> index_y (y, x) float64 dask.array<chunksize=(230, 225), meta=np.ndarray> latitude (y, x) float64 dask.array<chunksize=(99, 225), meta=np.ndarray> longitude (y, x) float64 dask.array<chunksize=(99, 225), meta=np.ndarray> TMP (y, x) float16 276.8 276.8 277.0 277.0 ... 258.2 258.5 258.5
This has all the data for the four chunks. If we truly only want to look at the area of interest, we need to apply the filter again.
data.where(check_boundaries, drop=True)
<xarray.Dataset> Dimensions: (x: 110, y: 194) Coordinates: * x (x) float64 -1.15e+06 -1.147e+06 ... -8.255e+05 -8.225e+05 * y (y) float64 -4.533e+05 -4.503e+05 ... 1.227e+05 1.257e+05 Data variables: chunk_id (x, y) object dask.array<chunksize=(110, 21), meta=np.ndarray> chunk_x (x, y) float64 dask.array<chunksize=(110, 21), meta=np.ndarray> chunk_y (y, x) float64 dask.array<chunksize=(21, 110), meta=np.ndarray> in_chunk_x (x, y) float64 dask.array<chunksize=(110, 21), meta=np.ndarray> in_chunk_y (y, x) float64 dask.array<chunksize=(21, 110), meta=np.ndarray> index_x (x, y) float64 dask.array<chunksize=(110, 21), meta=np.ndarray> index_y (y, x) float64 dask.array<chunksize=(21, 110), meta=np.ndarray> latitude (y, x) float64 dask.array<chunksize=(21, 110), meta=np.ndarray> longitude (y, x) float64 dask.array<chunksize=(21, 110), meta=np.ndarray> TMP (y, x) float16 dask.array<chunksize=(21, 110), meta=np.ndarray>
Despite all the NaN's there's still data!
data.TMP.plot()
<matplotlib.collections.QuadMesh at 0x16930b700>
Finishing it off to get data at different times:
start = datetime.datetime(2018, 1, 1, 0)
times = [start + datetime.timedelta(weeks=week_delta) for week_delta in range(2)]
zarr_ids = [dataclasses.replace(zarr_id, run_hour=time) for time in times]
def get_data(zarr_ids, chunk_ids, is_forecast):
datasets = []
for zarr_id in zarr_ids:
data = get_chunks_combined(zarr_id, chunk_ids)
new_time_dimension = "run_time" if is_forecast else "time"
data[new_time_dimension] = zarr_id.run_hour
datasets.append(data)
ds = xr.concat(datasets, dim=new_time_dimension, combine_attrs="override")
return ds
get_data(zarr_ids, chunk_ids, False)
<xarray.Dataset> Dimensions: (x: 300, y: 300, time: 2) Coordinates: * x (x) float64 -1.348e+06 -1.345e+06 ... -4.535e+05 -4.505e+05 * y (y) float64 -6.873e+05 -6.843e+05 ... 2.067e+05 2.097e+05 * time (time) datetime64[ns] 2018-01-01 2018-01-08 Data variables: chunk_id (time, x, y) object dask.array<chunksize=(1, 225, 230), meta=np.ndarray> chunk_x (time, x, y) float64 dask.array<chunksize=(1, 225, 230), meta=np.ndarray> chunk_y (time, y, x) float64 dask.array<chunksize=(1, 230, 225), meta=np.ndarray> in_chunk_x (time, x, y) float64 dask.array<chunksize=(1, 225, 230), meta=np.ndarray> in_chunk_y (time, y, x) float64 dask.array<chunksize=(1, 230, 225), meta=np.ndarray> index_x (time, x, y) float64 dask.array<chunksize=(1, 225, 230), meta=np.ndarray> index_y (time, y, x) float64 dask.array<chunksize=(1, 230, 225), meta=np.ndarray> latitude (time, y, x) float64 dask.array<chunksize=(1, 99, 225), meta=np.ndarray> longitude (time, y, x) float64 dask.array<chunksize=(1, 99, 225), meta=np.ndarray> TMP (time, y, x) float16 295.2 295.2 295.2 ... 277.2 277.2 277.2
In my Jupyter notebook setup, a previous version of the pyproj package gives the following error when it tries looking up the projection info:
CRSError: Invalid datum string: urn:ogc:def:datum:EPSG::6326: (Internal Proj Error: proj_create: SQLite error on SELECT name, ellipsoid_auth_name, ellipsoid_code, prime_meridian_auth_name, prime_meridian_code, publication_date, frame_reference_epoch, deprecated FROM geodetic_datum WHERE auth_name = ? AND code = ?: no such column: publication_date)
I believe this is because I run Jupyter from a conda environment that's different than the kernel Jupyter is using. In any case, there's an easy fix:
import pyproj
pyproj.datadir.set_data_dir("/Users/<me>/.conda/envs/<this notebook's kernel env>/share/proj")