Chapter 2: Earth Observation Datasets

In this chapter, I describe the database of remote sensing-derived data, or Earth Observations (EO), assembled for analysis of the global water cycle. The first type of data are for the water cycle (WC) components. These data describe fluxes, the movement of water, over the earth’s land surfaces, or changes in total water storage. Since the focus of this research is terrestrial hydrology, oceans are not considered. The second type of data describe environmental conditions, and include variables such as elevation, slope, and vegetative cover. Some of these datasets are derived entirely from satellite remote sensing data, while others are blended with in situ data or information from models.

This chapter begins with a discussion of the format and conventions for the datasets. To compare or merge EO datasets, they must have the same spatial and temporal resolution. Next, I describe the EO data sources for the four main WC components. Table 2.1 summarizes the EO datasets used in this study. Following this, I introduce a variety of environmental variables that serve to characterize local conditions and are used as input variables for neural network modeling described in Chapter 4.

The last section of this chapter presents a preliminary analysis of the EO datasets. I evaluated the quality and completeness of each dataset using the tools of exploratory data analysis – maps, time series plots, summary statistics. This helped to find and discard anomalous observations in some datasets and to verify that data had been imported and processed correctly.

Earth Observations of Water Cycle Components

Earth observation refers to the gathering of information about Earth’s physical, chemical and biological systems via remote sensing. Broadly, remote sensing involves monitoring and detecting the physical attributes of a region through the measurement of its reflected and emitted radiation, typically done from satellites or aircraft. In this thesis, I primarily use data collected from earth-orbiting satellites, although some of the datasets I drew upon also include information from in situ observations, models, and other sources.

Here, we are concerned with what EO can tell us about the movement of water above, on, and below the earth’s surface. The movement of water is expressed as a flow, or a flux. In the sciences, a flux is the flow of a substance into or out of a system. As described in Section 1.1.1, the water budget for any land area can be described with four main WC components: (1) Precipitation, P, (2) Evapotranspiration, E, (3) Total Water Storage Change (TWSC), or ΔS, and (4) Runoff, R. With measurements of these four variables, one can quantify the mass or volume of water flowing into and out of any arbitrary geographic region, such as a river basin or a grid cell.

Units

In this thesis, the four water cycle components are expressed as an area-normalized flux density, in units of depth of water per time. Units are in millimeters per month, or mm/mo. GRACE TWSC is not a flux per se, but can be expressed in the same units.

In principle, a variety of units can be used. According to the US Geological Survey, “water-budget equations can be written in terms of volumes (for a fixed time interval), fluxes (volume per time, such as cubic meters per day or acre-feet per year), or flux densities (volume per unit area of land surface per time, such as millimeters per day)” (Healy et al. 2007).

Using a flux density, in units of depth/time, is both a practical and mathematical convenience. It not only simplifies calculations, but it also allows us to directly compare fluxes over regions with different surface areas. The concept of depth per time is perhaps most intuitive with variables such as rainfall and evaporation, as we are accustomed to thinking of the vertical movement of water. Rain gages, or pluviometers, are simple devices that measure the depth of water captured in a given time period. An example is shown in Figure 2.1(a). Similarly, evaporation is most often measured as the change in the depth of water in an evaporation pan, shown in Figure 2.1(b).

In the case of discharge or river flow, it is somewhat less intuitive to think of this as a flux density, in units of depth per time. In this case, the discharge, Q measured at a given location in m³/s is converted to runoff, R a flux density in mm/month by dividing by its basin area (and converting the units):

\[R \left( \frac{\text{length}}{\text{time}}\right) = Q \left( \frac{\text{length}^3}{\text{time}}\right) \times \frac{1} {A \left( \text{length}^2 \right)} \label{eq:runoff}\]

The units for runoff, R are depth per time, just like P or E. We can think of the depth as a thin layer of water that is uniformly distributed over the basin. Unless care is taken to convert Q and A to compatible units, the results of Equation \ref{eq:runoff} will not be easy to interpret. Runoff in mm/month can be calculated from a conventional volumetric flow rate in m³/s by dividing by the land surface area in km², multiplying by the number of days in the month, and an appropriate conversion factor, as follows:

\[ \begin{aligned} R \left( \frac{\text{mm}}{\text{month}} \right) &= Q \frac{\text{m}^3}{\mathrm{s}} \times \frac{1}{A {\text{, km}^2}} \times \frac{d \text{, days}}{\text { month }} \times \frac{\text{km}^2}{10^6 \text{ m}^2} \times \frac{1,000 \text{ mm}}{\text{m}} \ldots \notag \\ & \hspace{0.5cm} \times \frac{60 \text{ s}}{\text{minutes} } \times \frac{60 \text{ minutes}}{\text{hour}} \times \frac{24 \text{ hours}}{\text { day }} = 86.4 \frac{Q \cdot d}{A} \end{aligned} \]

where d is the number of days in the month, a whole number between 28 and 31. Rearranging this equation allows us to calculate basin discharge, Q in m³/s as a function of the runoff, R, in mm/month:

\[\label{eq:discharge} Q \left( \frac{\text{m}^3}{\text{s}}\right)= \frac{R \cdot A}{86.4 \; d}\]

where R is in units of mm/month and A is the basin area in km², and d is the number of days in the month. These are simple operations, but it helps to have them clearly documented, as all of the following calculations depend on having accurate input data.

Time Period and Temporal Resolution

All analyses described in this thesis were done with monthly data. Many remote sensing data products are available at higher temporal resolution (for example, hourly, daily, weekly). And in theory, water budget calculations can be performed at any time scale. However, there were practical and scientific reasons why I chose a monthly time scale. In practice, our ability to produce more frequent updates is limited by the GRACE data for water storage change, which are calculated monthly with a significant time lag (Rodell et al. 2018).

I chose to perform most analyses for the 20-year time period from January 2000 to December 2019, for a total of 240 months. The GRACE satellites were launched in 2002, and the first observations are available for April 2002. Nevertheless, I chose to begin the analysis in January 2000, as it is simple and memorable. Consequently, there are missing records at the beginning of time series in several of the datasets, but this has no effect on the analysis.

When I began my doctoral research in 2021, I would have liked to include more recent data, but I chose to end the analysis in 2019. There is a time lag associated with the publication of many of the datasets. In particular, there was little runoff data available for 2020 and thereafter, as we will see in Section 2.5.1.

For experiments in record extension (predictions of past conditions), I created a 40 year long database of observations from January 1980 to December 2019. Because there were fewer satellites in orbit in the 1980s and 1990s, the data for the first two decades is more sparse.

Spatial Resolution and Format

I used a common spatial projection and resolution for all gridded geospatial data. It is based a common equirectangular projection, or plain geographic latitude and longitude projection (sometimes called plate carrée). This projection is commonly used in the Earth sciences due to its simplicity, although it has certain tradeoffs, i.e., distortion of areas near the poles. One set of experiments was done with data at a spatial resolution of 0.25 degrees. The global grid consists of 720 rows and 1440 columns. Another set of experiments was done with data at a resolution of 0.5 degrees. This format is similar to the “climate modeling grid” (CMG) used by many climate researchers (MODIS Land Team 2021), although it is at a coarser resolution and thus less detailed. In this thesis, I refer to grid cells and pixels interchangeably, and these should be understood as referring to the same concept: a set of regular rectangular areas on the earth’s surface.

The size of grid cells varies with latitude. Near the equator, each 0.25° grid cell is about 27 km on a side and has an area of about 770 km². As one moves north or south away from the equator and toward the poles, grid cells get smaller. At Rome’s latitude of 42° N, grid cells are about 21 km wide (the height is unchanged) and have a surface area of about 576 km². The northernmost basins considered in our study (in Canada, Alaska, Norway, Finland, and Russia), extend to latitudes above 70° N, where a grid cell is less than 10 km wide and has an area of less than 270 km², only 35% of the size of a grid cell near the equator.

Some of the datasets I used were published at a higher spatial resolution, for example 0.05°. In such cases, I upscaled the data to decrease its spatial resolution and to make it consistent with the other datasets. Water storage data from the GRACE satellites are the limiting factor, with data products published at 0.25°. Methods are available to “downscale” these data to a higher spatial resolution, but scientifically, there is little to be gained from doing so, and any additional accuracy would be illusory.

Data Formats

Strictly speaking, the project database is not a single electronic file, but a collection of electronic data stored in an organized file structure with appropriate metadata, or descriptive details. Data were stored as floating-point numbers with double precision in Matlab .mat files. This is not necessarily the most efficient data format in terms of file size or storage space on disk, however, it eliminates the overhead of converting data or changing units. Matlab binary data files (.mat) can be read by Python, R, or other scientific software, often with the use of a library or plugin.

Note that I do not have the rights to republish all the data that I have collected. For example, with regards to river discharge data, the user agreement from the GRDC does not allow one to redistribute the data. However, my understanding is that one may share and distribute derivative works. This includes versions of the data that have been transformed or processed, such as monthly averages. I have created a public repository containing the database used in the modeling described herein. This should be useful to other researchers who wish to verify or replicate these results or conduct related research on the global water cycle. It is freely available at:

Table 2.1: Datasets compiled for the four major components of the water cycle.

Data Set	Begin	End	Temporal resolution	Spatial resolution	Citation	Website
Evapotranspiration
GLEAM v3.5A	1980	present	daily	0.25º	Martens et al. 2017; Miralles et al. 2011.	https://www.gleam.eu/
GLEAM v3.5B	2003	present	daily	0.25º	idem.	https://www.gleam.eu/
ERA5	1950	present	3-hour, daily, monthly	0.25º	Hersbach, et al. 2018.	https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5
Precipitation
GPCP v2.3	1979	present	daily, monthly	2.5°	Adler et al. 2018.	https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.ncdc:C00979
GPM-IMERG	2000-Jun-01	present	daily	0.10º	Huffman et al. 2019	https://disc.gsfc.nasa.gov/datasets/GPM_3IMERGDF_06/summary
MSWEP	1979	present	daily, monthly	0.10°	Beck et al. 2019	http://www.gloh2o.org/mswep/
Total Water Storage Anomaly
GRACE-CSR	2002-Apr-05	present	quasi-monthly*	0.25°	Save, Bettadpur, and Tapley, 2016	http://www2.csr.utexas.edu/grace/RL06_mascons.html
GRACE-JPL	2002-Apr-05	present	quasi-monthly*	0.5°	Landerer 2021	https://grace.jpl.nasa.gov/
GRACE-GSFC	2002-Apr-05	present	quasi-monthly*	0.5º	Loomis, Luthcke, and Sabaka, 2019.	https://earth.gsfc.nasa.gov/geo/data/grace-mascons
Runoff - Synthetic
GRUN	1902	2019	monthly	0.5°	Ghiggi et al. 2021	doi.org/10.6084/m9.figshare.12794075
River Discharge - Observed
GRDC	varies	varies	daily, monthly		BfG 2020.	https://www.bafg.de/GRDC
Australia BOM	1970	2020	daily, monthly		Australia BOM 2020.	http://www.bom.gov.au/water/hrs/index.shtml
GSIM	varies	2016	monthly		Do et al. 2018; Gudmundsson et al. 2018.	https://doi.pangaea.de/10.1594/PANGAEA.887470

Excluded Datasets

The Appendix contains a list and brief descriptions of dozens of EO datasets. There are many datasets in the literature that I chose not to use in my analysis. This includes well-known remote sensing datasets, including some that are considered state of the art, and which are widely used and highly cited within the earth science community. In some cases, I excluded a dataset because it had inadequate temporal coverage, i.e., it was not available over the period from 2000 to 2019. For example, the SM2RAIN precipitation dataset (Massari et al. 2020) begins in 2004, while my goal was to maximize the overlap with GRACE observations, which began in April 2002. I excluded some datasets for insufficient geographic coverage. For example, CMORPH precipitation (P. Xie et al. 2019) covers latitudes 60°S to 60°N, yet many of our study’s river basins are above 60°N and therefore outside of its coverage. In other cases, I excluded a dataset for insufficient temporal coverage.

It is worth briefly discussing what can be gained from adding more data to the analysis. Oftentimes, more data is better. This attitude is especially prevalent in the machine learning community, where large models are fed with massive amounts of data. However, the research described in this thesis focuses on method development. Indeed, it is possible to use the methods described here to create optimized water cycle components using the largest possible amount of available data, perhaps even customizing the inputs by region based on data availability.

Precipitation

Below, I describe the precipitation datasets I included in my analysis of the water cycle. A description of datasets that I considered but did not include in the analysis can be found in the Appendix.

GPCP

The Global Precipitation Climatology Project (GPCP) is a global gridded dataset produced by an international consortium of researchers. It is based on multiple satellite observations that have been merged to estimate precipitation at the global scale (Adler et al. 2018). It has a long record, beginning in 1979. However, it also has a coarse spatial resolution, at 2.5°. GPCP is a well-known dataset that has been used for many studies and referenced in thousands of journal articles (Adler et al. 2018).

GPCP combines both passive microwave and infrared satellite data, as well as in situ data. Each class of data brings certain advantages. For instance, passive microwave data can penetrate clouds and provide precipitation estimates even in cloudy regions, while infrared data offers high spatial resolution and frequent updates due to geostationary positioning, albeit at a lower resolution. The authors also corrected for systematic wind-induced undercatch and wetting losses from rain gauges, as well as for orographic effects. I used version 2.3 of this dataset, which was updated in 2018.

GPM-IMERG

GPM-IMERG is the current multi-satellite precipitation product from NASA (Huffman et al. 2019). It stands for “Integrated Multi-satellitE Retrievals from the Global Precipitation Monitoring (GPM) satellite. GPM-IMERG offers several advancements and improvements over its predecessor TMPA (see Appendix Section A.1.9). GPM-IMERG combines measurements in passive microwave and infrared from over a dozen different satellites, using morphing techniques and a Kalman filter, to provide accurate satellite-based precipitation estimates, supplemented by precipitation gauge analyses. It ingests data from the GPM core observatory satellite, which contains a dual-frequency precipitation radar and the GPM microwave imager. The dual-band precipitation radar provides better estimates of precipitation particle sizes and covers a wider range of precipitation rates compared to the single-band radar on the TRMM satellite (Y. Wang and Wu 2022).

MSWEP

The Multi-Source Weighted-Ensemble Precipitation (MSWEP) dataset is not a pure remote sensing product but an “optimal merging” of gage observations, satellite observations, and reanalysis model output (Beck et al. 2019). This dataset offers high spatial resolution of 0.1° and a maximum temporal resolution of 3 hours, which allows for a detailed analysis of precipitation patterns. MSWEP has been shown by the authors to exhibit more realistic spatial patterns in mean, magnitude, and frequency compared to other precipitation datasets (Beck et al., 2019). They also state that it provides more accurate precipitation estimates in mountainous regions, where other datasets tend to underestimate precipitation amounts. Because of its good spatial and temporal coverage, I chose MSWEP as an input in the analysis.

CPC Global Precipitation

The CPC Global Unified Gauge-Based Analysis of Daily Precipitation is a dataset of rain gage measurements (not remote sensing observations) published by NOAA’s Climate Prediction Center (CPC). I include a brief description here, as it is a gridded dataset that is widely used in large-sample hydrology. This dataset covers 1979 to present at 0.5° resolution, with global coverage of land surfaces (not oceans). The goal of the project was to create a suite of unified precipitation products with consistent quantity and improved quality. This was achieved by combining all information sources available at CPC and using an optimal interpolation technique. The authors state that the dataset was more accurate than existing gage-based datasets available at the time, but that key uncertainties remain. In particular, there may be inconsistencies over regions where station networks changed, and less accuracy in places with high spatial variability in precipitation, such as in mountainous regions.

I used the CPC dataset as an an additional source of data for evaluating the results of this reasearch, to be described in Section 5.4.5.

GHCN

This section describes a dataset of station observations which I used as an additional source of data to evaluate my results. Unlike the other datasets described here, it is not in grid format, nor is it based on remote sensing.

The Global Historical Climatology Network (GHCN) is a dataset reporting weather and climate variables at over 120,000 stations worldwide (Menne et al. 2012). Many stations only report precipitation, but many others report other variables such as temperature, pressure, humidity, cloud cover, etc. The dataset is maintained by the United States National Oceanic and Atmospheric Administration, National Centers for Environmental Information (NOAA NCEI).

The publisher provides a data inventory, containing the start and end dates for stations. I used this to select candidate stations with data from 2002 to 2019. This still left over 21,000 stations. I used Python scripts to download and process these data, calculate monthly averages, and to create a more detailed inventory of stations with precipitation data over the period of my analysis. The data provider has performed extensive and well-documented quality control on these data (Durre, Menne, and Vose 2008; Durre et al. 2010). Nevertheless, there are several subtleties and complications to dealing with precipitation observations. Examples include how to handle quality flags, what to do with precipitation logged as “trace,”, and how to deal with missing daily data.

I used the following rules when compiling monthly precipitation for stations in the GHCN catalog. First, I only included stations where there are at least 60 months (5 years) of data. I only calculated the monthly average P where there are at least 25 days of data. I chose not to throw out months simply because there were a few missing daily observations. I handled missing daily data by assuming the rainfall on that day was the same as the average of other days in the same month. For example, if station data for January had 25 days of valid data, and 6 missing daily observations, the monthly precipitation is: \(P_{\text{month}} = \sum{P_{\text{daily}}} \times \frac{31}{25}\).

There are more complex and sophisticated ways of filling in missing data (also referred to as imputation), but these were not considered here. The GHCN station data will be used for a secondary analysis, to verify that our modeling methods have not degraded the signal in EO datasets too much related to observations. The check against GHCN station data is just one of several evaluations of this kind. Therefore, it was not worth spending a great deal of time performing sophisticated analyses with these data.

The final set of GHCN station data for precipitation encompasses 21,880 stations. The distribution is highly uneven (Figure 2.2(a)). In Africa, South America, there are large blank spaces on the map. In Section 5.4.5 of this thesis, I compare calibrated P at the pixel scale to GHCN observed P. However, the geographic coverage of observations is extremely uneven. As shown in the Figure 2.2(b), there are many 0.5° grid cells where there are no stations at all, and others there are many pixels where there are a dozen or more stations, particularly in Germany, the Netherlands and other northern European countries and the United States (not shown).

Evapotranspiration

As shown in Table 2.1, I used 3 EO datasets of evapotranspiration for the data integration, and observations from flux towers for comparison and evaluation of my results. Each data source is listed in Table 2.1 and described in more detail below.

ERA5 Evapotranspiration

I also included a dataset that is not a purely remote-sensing based product, but based on the assimilation model ERA5, from the European Centre for Medium-Range Weather Forecasts (ECMWF). The model combines historical estimates (from both remote sensing and in situ observations) using an advanced modeling and assimilation system. ERA5 produces many variables describing the atmosphere, land, and ocean, at a resolution of up to a 30 km grid (Guillory 2022). ERA5 estimates of E have been used in many recent hydroclimatic studies (see e.g., Tarek, Brissette, and Arsenault 2020; Singer et al. 2021; Lu et al. 2021). ERA5 hydrological outputs are in units of “m of water per day” and so I multiplied by 1000 to convert to \(\text{kg} \cdot \text{m}^{-2} \cdot \text{day}^{-1}\) or mm/day, and then multiplied by the number of days in each month to obtain units of mm/month.

GLEAM

The Global Land Evaporation Amsterdam Model (GLEAM) is a set of algorithms that estimates the various components that contribute to total evapotranspiration (Martens et al. 2017; Miralles et al. 2011; Hersbach et al. 2018). The authors used an empirical relationship, the Priestley-Taylor equation, to calculate potential ET based on satellite observations of surface net radiation and near-surface air temperature. GLEAM version 3.5a used reanalysis rather than satellite observations and covers 1980 to present. The updated version 3.5b relies more on remote sensing data and has a more limited temporal coverage of 2003 to present. GLEAM has global coverage of land surfaces at 0.25° resolution and a daily time step. Note: the latest version, GLEAM v3.6 was published in September 2022, after I completed much of the analysis described here.

Observed Evapotranspiration at Flux Towers

I used in situ data for validating the results of certain analyses. Several methods have been developed to measure evapotranspiration over land surfaces. The two main methods for measuring evapotranspiration rates at specific locations are with devices called lysimeters or via micrometeorological techniques (Healy et al. 2007). Micrometeorological techniques include eddy correlation, Bowen ratio/energy budget, and aerodynamic profile methods, which involves measuring the vertical flux of water vapor from the land surface to the atmosphere. While such methods are accurate, they are also expensive and require frequent maintenance.

I obtained in situ measurements of ET from flux towers, which belong to a class of micrometeorological measurement devices. Flux towers are outfitted with sensors and instruments mounted at various heights for measuring the vertical flux of water vapor from the land surface to the atmosphere. I selected data for 117 towers from the FluxNet2015 dataset, which compiles data from 212 global towers (Pastorello et al. 2020). Of the 212 stations, 34 did not contain data for latent heat flux, required for inferring evapotranspiration. After quality checking these data, I eliminated another 61 stations, either because the time record did not overlap our period of interest (2002 - 2019), was too short (less than 2 years), or there were other quality issues. This left us with 117 stations, with an average time series length of 5 years. The majority of selected towers are in Europe (51 towers) or North America (46), with fewer in Africa (2), Asia (6), Australia (9), and South America (3). Figure 2.3 shows the location of the selected flux towers.

The FluxNet database reports latent heat flux, in units of W/m². This can be converted to evapotranspiration for inclusion in the water balance model. The latent heat of vaporization of water is the energy required for water molecules to transition from liquid to gas, and varies according to water temperature as given by Shuttleworth (1993):

\[\lambda = 2501 - 2.361 \: T_s \quad\left(\mathrm{~J}\cdot\mathrm{~kg}^{-1}\right)\]

As a simplification, a value of 2,450 J/g is commonly used, and has an error less than 2% over temperatures from 0° to 35°C. Therefore, the latent heat flux in W/m² can be converted to monthly average evapotranspiration as follows:

\[\begin{split} \frac{\mathrm{W}}{\mathrm{m^2}} = \frac{\mathrm{J}}{\mathrm{s} \cdot \mathrm{m^2}} \times \frac{\mathrm{g}}{2,450 \, \mathrm{J}} \times \frac{\mathrm{cm}^3}{1 \, \mathrm{g}} \times \left(\frac{1 \, \mathrm{m}}{100 \, \mathrm{cm}}\right)^3 \times \frac{1,000 \, \mathrm{mm}}{\mathrm{m}} \times \frac{86,400 \, \mathrm{s}}{\mathrm{day}} \\ = 0.0353 \frac{\mathrm{mm}}{\mathrm{day}} \end{split}\]

Care must be taken in comparing E observed at flux towers to gridded hydroclimatic data. The challenge is in resolving the difference in scale differences between in situ data and grid cells. Flux tower observations are point estimates, taken at a single geographic location. One may easily find the pixel that intersects this point on a map to compare gridded data to the tower observations. Yet, the value in a pixel represents an average over an area, over which conditions may vary widely. At the scale of our model grid, a single 0.5° pixel has an area of about 3,000 km² near the equator. The land cover, vegetation, and topography over a grid cell may be substantially different from that of the flux tower site, making it challenging to generalize, and reducing the accuracy of comparisons between modeled fluxes and observations from flux towers. Nevertheless, such comparisons are still useful, and it is encouraging when a model recreates the observed temporal dynamics, even when the magnitudes of observed and modeled fluxes are not comparable.

Total Water Storage Change

Information on total water storage (TWS) for this research comes from the GRACE satellites. The first pair of satellites were in operation from 2002 to 2017, and a follow-on mission began in 2018.

It is convenient to refer to the change in water storage as equivalent to a flux, and it can be expressed in the same units as any other flow. GRACE provides the monthly TWS anomaly, expressed as liquid water equivalent (LWE) thickness surface mass anomaly, in units of cm or mm. GRACE does not estimate the total volume or mass of water in a region, but rather its change with respect to a historical baseline average. Nevertheless, the observations encompass water in all its forms and “represent the full magnitude of land hydrology and land ice” (Landerer 2021).

Compared to precipitation and the other hydrologic variables, GRACE observations have a lower spatial and temporal resolution, limiting our analysis to a monthly time step.

Scientists at various institutions have developed different algorithms and sets of parameters for converting the measurements of the gravity field measured by GRACE into estimates of TWSC. I obtained four GRACE data products, described in Table 2.2, and ultimately selected three of them for the analysis (from CSR, GSFC, and JPL). Each of these are Level 3 solutions⁶ for LWE as measured by GRACE satellites. All the available datasets of TWS are nearly global (–89.5º to +89.5º), land surface only.

Data Set	GRACE-CSR	GRACE-GFZ	GRACE-GSFC	GRACE-JPL
Units	m	m	m	m
Publisher	Univ. of Texas at Austin, Center for Space Research	Geoforschungs-Zentrum, Potsdam, Germany	NASA Goddard Space Flight Center	NASA Jet Propulsion Laboratory
Begin	2002-Apr-05	2002-Apr-04	2002-Apr-05	2002-Apr-05
End	near present	2017-Jun-29	near present	near present
Latitudes covered	–89.5º to +89.5º	–89.5º to +89.5º	–89.5º to +89.5º	–89.5º to +89.5º
Temporal resolution	quasi-monthly	quasi-monthly	quasi-monthly	quasi-monthly
Spatial resolution	0.25°	1.0°	0.5°	0.5°
Notes	"The data are represented on a 1/4 degree lon-lat grid, but they represent the equal-area geodesic grid of size 1x1 degree at the equator, which is the current native resolution of CSR RL06 mascon solutions."	RETIRED. No longer in production after mid 2017. This solution was based on the the conventional spherical harmonic method only, while the newer algorithms use mass concentration grids (mascons).	"This product is comparable to the JPL and CSR mascon products."	Data distributed at resolution of 0.5°, but the 3° mascons on which this solution is bas are evident when plotting.
Website	http://www2.csr.utexas.edu/grace/RL06_mascons.html	https://podaac.jpl.nasa.gov/dataset/TELLUS_GRAC_L3_GFZ_RL06_LND_v04	https://earth.gsfc.nasa.gov/geo/data/grace-mascons	https://grace.jpl.nasa.gov/
Citation	Save et al. (2016)	Landerer and Swenson (2012)	Loomis, et al. (2019)	Watkins et al. (2015)

Among these three solutions shown in Table 2.2, there are differences in the algorithms and processing steps, which can produce slightly different results, particularly at regional scales. There are differences in the methods used to filter the data to remove noise, including atmospheric and oceanic variability. There are also differences in the methods and models used for post-processing. Such processing increases the accuracy of TWS anomalies by adjusting for changes in land surface due to seismic events or the isostatic rebound in regions formerly covered by glaciers that continue to uplift thousands of years after the glaciers have melted or receded. It is common for researchers to use multiple solutions to validate their findings and reduce the impact of any individual solution’s uncertainties. Indeed, our approach relies on a neural network to extract the best information from each dataset.

I explored using the GRACE solution produced by GFZ, the German Research Centre for Geosciences in Potsdam (GeoForschungsZentrum), described by Kusche et al. (2009). However, I determined that this solution was less suitable than the three newer solutions. Original GRACE solutions were based on spherical harmonics are mathematical functions that represent the variations in the Earth’s gravitational field. This method is useful for capturing large-scale features such as the overall distribution of mass and major geophysical phenomena. Conventional spherical harmonic solutions “typically suffered from poor observability of east-west gradients” (Watkins et al. 2015), resulting in pronounced north-south striping patterns in the data. These were usually removed by smoothing the data, which unfortunately causes some loss in signal. Further, the spherical harmonic method does not effectively account for localized variations in mass, especially in regions with strong mass anomalies.

The more recent GRACE solutions are based on calculations based on mass concentrations or mascons. Mascons are anomalies in the distribution of mass within a planet. Here, mascons refer to the regions on Earth where there are significant variations in the distribution of mass, which has a corresponding effect on the Earth’s gravitational field. Such variations complicate the analysis of GRACE data because they can distort the measurements of surface mass changes. The most recent techniques employ a gravity field basis function to separate the contributions in the signal from unequal distribution of the earth’s mass from other factors such as water storage variations.

Guidance published by the GRACE science team states that the mascon-based solutions (CSR, GSFC, JPL) are better for use in hydrology and related fields like glaciology and oceanography, as they are more accurate, especially at smaller scales. The mascon gridded data products are recommended as they “suffer less from leakage errors than harmonic solutions, and do not necessitate empirical filters to remove north-south stripes, lowering the dependence on using scale factors to gain accurate mass estimates” .

Figure 2.4 illustrates the increase in resolution that was obtained with the mascon-based solutions. Here, Save, Bettadpur, and Tapley (2016) used three different GRACE solutions to calculate the trend in total water storage. The newer mascon-based solution from CSR is shown in (a). In (b), the older GFZ solution (labeled TELLUS) is shown. Because of the post-filtering that was applied to eliminate striping, small-scale anomalies are smoothed and details are blurred. In (c), the JPL solution, one can see changes between neighboring 3° mascons.

While all three of the GRACE datasets I chose are based on the same satellite data, the solutions are based on different mascon grids. The details are complex, but interested readers are referred to detailed explanations in Loomis, Luthcke, and Sabaka (2019), Save, Bettadpur, and Tapley (2016), and Watkins et al. (2015). Official guidance from NASA has recommended that users average all three data center’s solutions (Landerer and Cooley 2021). This recommendation is backed by research that showed the ensemble mean (simple arithmetic mean of JPL, CSR, GFZ fields) led to a reduction in noise within the gravity field solutions, considering the scatter present in the available data (Sakumura, Bettadpur, and Bruinsma 2014).

And while the calculations for converting gravity field anomalies to changes in water storage have a solid basis in theory, researchers have not yet found an effective way to ground truth these observations (Kusche et al. 2009; Reager et al. 2015). GRACE observations have a relatively coarse spatial resolution compared to other EO datasets, such as precipitation. While the published datasets have a 0.25° resolution, the data have been spatially filtered to remove random errors and systematic errors (F. W. Landerer and Swenson 2012). Thus, small scale details are not likely to be meaningful, and these data are more accurate over larger regions (Tapley et al. 2004).

In the following sections, I show some exploratory data analysis of the GRACE observations.

Analysis of Missing Data

The GRACE Level 3 data of TWS anomalies are geographically complete, with no gaps or missing pixels that are often seen in other EO datasets. However, there are gaps in the time series, where there are months with no data. In this section, I describe the causes for missing data. In Section 3.2.2, I describe a time series interpolation method to fill in missing data.

Figure 2.5 shows an example of the GRACE observations of total water storage anomaly. Here, the time series data was extracted from a single randomly-chosen pixel over the Amazon basin in Brazil. We can see at a glance that the time series begins in 2002 and extends through the present, with a number of gaps in the observations.

To understand when and why there are gaps in the GRACE dataset, it helps to know about the mission’s history and the timeline of its operations. Figure 2.9 is an overview of available GRACE data. The GRACE satellites were launched in March 2002, and were originally designed for a 5-year lifetime. The first monthly data were published for April and May 2002. “Several months of instrument testing” resulted in no reported water storage data in June and July 2002, and two missing months of data in 2003 and 2004 (Herman et al. 2012). For the next 7 years, there is a continual record with no gaps.

In 2011, the batteries onboard GRACE began to fail. As a result, “the instruments have to be powered off during the maximum eclipse season, thus interrupting the nominal science data flow every 161 days for a period of approximately 3–4 weeks” (Flechtner et al. 2014). As a result, there are 1 to 2 month gaps every 6 months. The worsening condition of batteries resulted in longer data gaps beginning in September 2017 (Müller et al. 2019). GRACE was decommissioned in October 2017. With no power left to correct the orbits, the low-earth orbiting satellites were left to re-enter the atmosphere.

The follow-up mission, GRACE-FO, was launched in May 2018, and the first GRACE-FO data was reported for June 2018. From July 19 to October 16, 2018, an onboard instrument did not function properly, resulting in a gap in science data collection (Svarovsky 2019). Other than this 3-month gap, the record has been continuous from mid 2019 to present. The time period chosen for this thesis is 2000 to 2019, with 20 years or 240 months. In total, there are monthly GRACE data for 181 months during this time period, or 75% of months.

There is considerable missing data in the GRACE record, as shown in Figure 2.6. Not considering the time before GRACE observations begin in April 2002, 17% of months are missing. There are a dozen 1-month gaps, five 2-month gaps, one 3-month gap, and one 11-month gap between GRACE and GRACE-FO.

In Chapter 6, I examine the possibility of reconstructing GRACE-like TWS data for filling gaps and hindcasting for periods prior to 2002.

Temporal Resolution of GRACE

GRACE data are quasi-monthly. That is, published observations of TWS do not correspond precisely to calendar months, in terms of the effective beginning and end date of the observations. According to the data provider, NASA’s Jet Propulsion Laboratory (JPL), “the generation of high-quality monthly gravity field solutions and mass change grids requires accumulation of satellite-to-satellite tracking data for about 30 days. However, within some months, a few days of data may not be used due to instrument issues, calibration campaigns etc.” (NASA Jet Propulsion Laboratory 2023). According to JPL, data users should “carefully assess whether the underlying sub-monthly sampling needs to be matched between the data sets.” In other words, because the GRACE data are not truly monthly, this could lead to errors and uncertainties when trying to combine GRACE with other monthly datasets, as there is something of a time mismatch. Examples of the errors in GRACE observation dates is shown in Table 2.3.

Table 2.3: GRACE observations begin and end dates, showing the errors in the begin date and duration compared to calendar months.

GRACE observation number	Calendar month	GRACE observation begin date	GRACE observation end date	Observation duration (days)	Calendar month duration (days)	Error in duration (days)	Error in begin date (days)
1	2002-04	2002-04-04	2002-04-30	26	30	−4	3
2	2002-05	2002-05-02	2002-05-17	15	31	−16	1
3	2002-08	2002-07-31	2002-08-31	31	31	0	−1
4	2002-09	2002-08-31	2002-09-30	30	30	0	−1
5	2002-10	2002-09-30	2002-10-31	31	31	0	−1
6	2002-11	2002-10-31	2002-11-30	30	30	0	−1
7	2002-12	2002-11-30	2002-12-31	31	31	0	−1
8	2003-01	2002-12-31	2003-01-31	31	31	0	−1
9	2003-02	2003-01-31	2003-02-28	28	28	0	−1
10	2003-03	2003-02-28	2003-03-31	31	31	0	−1
11	2003-04	2003-03-31	2003-04-30	30	30	0	−1
12	2003-05	2003-04-30	2003-05-21	21	31	−10	−1
13	2003-07	2003-06-30	2003-07-31	31	31	0	−1
14	2003-08	2003-07-31	2003-08-31	31	31	0	−1
...	...

Ideally, observations would be aligned with calendar months, but as we have seen this is not always the case. One suggested workaround for the issue is to average other hydrologic data at the same time scale, i.e., calculating the average based on the same days used by GRACE. There are advantages and disadvantages to this approach. The advantage is that it reduces one source of uncertainty, and makes the data more directly comparable. The disadvantage is that the results will be more difficult to interpret and to explain to others. For example, one would need to state that a result is for January 6 to February 3, 2017 rather than simply reporting January 2017. Another disadvantage is that certain datasets are not available at daily time step, and therefore could not be used in the analysis. Notably, two of the important discharge datasets used in my research, GSIM and GRUN, are published at the monthly time step only.

I analyzed the errors in the duration and start date of the quasi-monthly GRACE observations, with results shown in Figure 2.7. The histogram on the left shows the errors in the duration of the GRACE observation compared to the calendar month. For example, the first GRACE observation in April 2022 began on April 4, and ended April 30. This observation has a duration of 27 days, while the calendar month has 30 days, for a duration error of −3 days. The mode of the duration error is +1 day, which occurs in 142 of the 180 observations between 2002 and 2019. The duration errors range from −17 days to +13 days, and have an average of 0.0 days.

Because GRACE observations are roughly aligned with calendar months, I assigned each observation to its closest calendar month, rather than performing some type of interpolation to estimate monthly values from the raw data. This would only further compound the already large uncertainties associated with these data. This same assumption is made by other researchers in the field of global hydrology that work with GRACE data (see e.g., Zaitchik, Rodell, and Reichle 2008; Landerer, Dickey, and Güntner 2010).

On the right in Figure 2.7 is a histogram of the errors in the start date for GRACE observations. The vertical axis of the plot has been truncated to show greater detail among the low-frequency errors. For example, the first GRACE observation begins on the 4th of the month, for an error in the begin date of +3 days. The mode for the begin date error is −1 day, which occurs in 152 of the 180 observations included in our analysis. The start date error has an average of −0.1 days, and ranges from −13 days to +20 days. I concluded, in consultation with my thesis advisor, to assume that GRACE is approximately monthly and to combine it with other true monthly data. As discussed above, the majority of observations are aligned to calendar months to within a day, in terms of both the begin date and duration. Most observations have a relatively small error in the start date or duration of the observation. Errors in timing will contribute some additional uncertainty to GRACE observations.

**Figure 2.7:** Errors in the begin date (left) and duration (right) of GRACE observations compared to calendar months

It was necessary to remove some GRACE observations, as they appeared to be largely redundant, overlapping other observations. Figure 2.8 illustrates the two observations I chose to delete. In these plots, the available GRACE observations are numbered beginning with 1 for the first quasi-monthly record in April 2022, using the same numbering as Table 2.3. The top plot in Figure 2.8 shows that observation number 110 covers the month of October 2011 quite well. The next observation, number 111, contributes little new information. It also does a poor job of representing November 2011, as the observation ends on November 15. Therefore, I decided to remove this observation from our input data. The bottom plot shows a similar situation for observation 145, which I also removed. The plots in Figure 2.8 also show that there are gaps in the GRACE dataset, with frequent missing months. The plots also show that the observations do not coincide perfectly with calendar months, as discussed above.

**Figure 2.8:** The time coverage of select GRACE observations, showing overlapping observations that were removed.

Modeled TWSC

In order to compare and evalute the results of my research, I also gathered predictions of ΔS from recent modeling studies that reconstruct GRACE-like TWSC. Yu Zhang et al. (2018) used a land surface model and data assimilation techniques, first estimating the errors in each water budget component by comparison to in situ observations, then using a constrained Kalman filter to merge the datasets based on their error information, with a goal of minimizing the imbalance. This study produced global gridded datasets at 0.5° resolution, with monthly P, E, R, and ΔS for 1984–2010.

Runoff and River Discharge

Data on runoff provides the fourth and final flux in our simplified water balance, \( P - E - \Delta S - R = 0 \). There is a long history of measuring the flow of water in rivers, also called discharge. Measurements of river levels were probably the first quantification of the water cycle. Sources describe the measurement of the water surface elevation by Egyptians in Pharaonic times as early as 1800 BC (National Research Council 1991, 20). Other sources (e.g.: Pfister 2018; Rosbjerg and Rodda 2019) cite the work and writings of Leonardo da Vinci in the 1500s, who was among the first to measure river velocity and flow, and to speculate about the source of river flows.

In fact, modern methods for measuring streamflow have similarities to ancient methods. Measurements are made of flow velocity (m/s), and the cross-sectional area of the stream (m²). The product of these two measurements is the volumetric flow rate, in m³/s. Because measuring flow velocity and cross-sectional area in flowing rivers is difficult and time-consuming, contemporary hydrographers use a shortcut method. They collect several measurements of v and A over a range of flow conditions, develop site-specific empirical relationships between water surface elevation, or “stage” and discharge. In the near future, the scientific community expects to estimate river discharge based on high-accuracy measurements of water surface elevation and slope via the SWOT satellite launched in December 2022.(Durand et al. 2016; Prigent et al. 2016).

For this research, I used two types of runoff data: (1) estimated runoff from a statistical model, and (2) observations from the terrestrial monitoring of rivers. In the literature, the terms runoff, river flow, and discharge are sometimes used interchangeably. Usage also varies by discipline, often in subtle and confusing ways. Therefore, some clarification is in order.

River flow or discharge is an in-situ measurement, measured at gages (sometimes spelled gauges). River gages are typically installed and managed by national or regional governments or water management agencies. Gages use a variety of technologies to measure the instantaneous volumetric flow rate, expressed in units of volume per time such as m³/s.

Runoff is all the water draining from a given land area. One component of runoff is overland flow – water from rainfall or snowmelt that does not infiltrate into the ground, and flows over the land surface. (Overland flow is also called Hortonian flow, after the pioneering 20th century American hydrologist Robert E. Horton.) Another component of runoff is subsurface flow, or the movement of groundwater out of a land area. Total runoff cannot be observed directly, but is sometimes monitored with tracers or estimated with mass balance approaches.

We follow Ghiggi et al. (2019) in assuming that, “at a monthly timescale the average catchment runoff can be assumed to equal the monthly streamflow [or river discharge] measured at the outlet divided by the catchment area.” This approximation is valid when there is not a significant change in water storage during the month (such as in lakes or reservoirs), and there are no significant losses (such as withdrawals for irrigation or inter-basin transfers).

Under similar conditions (minimal change in storage and losses), one can calculate river discharge (for example for ungaged basins) from the runoff in the upstream area. Ghiggi et al. (2021) refer to “first-order river discharge estimates” obtained by calculating the spatial mean of the gridded runoff and over a basin and multiplying by the drainage area. This calculation is an inexact estimator of river discharge over a given time period, as it fails to consider the differing travel time of river flows in the basin. For example, a rainstorm that produces runoff in the headwaters of a large river basin may take days or weeks to reach the outlet. However, I follow G. Ghiggi et al. (2021) in assuming that, at a monthly timescale, the effect of water routing may be considered negligible except in very large basins.

Astute readers may wonder why Table 2.1 includes 3 products for the first 3 components, but only 1 for R. Our original idea for this research was to create and optimize water budgets using inputs based on observations, i.e., satellite remote sensing products. However, this was not feasible, as some of the major data products available for the four water cycle components tend to be “blended” products including remote sensing data, models, and in situ observations. When it wasn’t practical to use gage observations, because they were too sparse, we chose to use in another kind of pseudo-observation, i.e. the GRUN dataset described below, which is a synthetic dataset based on machine learning but trained on observations. In the future, it may be interesting to use multiple products that estimate runoff. For example, modeled runoff is available from well-known land-surface models (LSMs) like ERA5 or LDAS. There are also other long-integration reanalysis-forced LSM global scale datasets at 0.5° resolution, such as the Global Soil Wetness Project (Dirmeyer et al. 2006).

Observed River Discharge

I sought to develop a large database of global gaged basins that would represent a range of geographic locations, environments, and basin sizes. The availability of river flow data is generally good across North America and western Europe, while observations are more scarce across much of Asia, Africa, and Latin America. Furthermore, in a troubling development for the field of large scale hydrology, river flow measurement has steadily decreased over the last few decades (Fekete et al. 2012).

I selected gages for our analysis based on data quality, geographic coverage, and location. I considered gages with a watershed area of 2,500 km² or greater. This corresponds to about 8 pixels in our 0.25° grid at mid-latitudes around 50°.⁷ This minimum threshold ensures that EO data are averaged across several grid cells, so that our estimates of basin-scale hydrologic fluxes are more robust. This is a particular concern for the water storage change datasets derived from the GRACE satellites which have a lower spatial resolution than many remote sensing datasets in the meteorological and hydrological sciences. I selected gages with a minimum temporal coverage of at least 6 observations during the period from 2002 to 2019.⁸ I also performed quality control of the runoff observations through a variety of plots and statistical summaries.

Figure 2.10 illustrates the location of the 2,056 river gages selected for the analysis.

**Figure 2.10:** Map showing the location and data source of the 2,056 river flow gaging stations used in this analysis.

The spatial coverage of our final 2,056 river gages (and their basins) is uneven across the globe (see Figure 2.10). North America is over-represented with 1,111 gages (more than half the total), as is Europe with 393 gages, while there are only 70 gages in Africa, 178 in Asia, and 195 in South America. China is among the most apparent data gaps. However, other countries are notable for their sparsity of data. For example, we have fewer observations in France (8 gages) than in smaller neighboring countries Switzerland (12 gages) and Belgium (12 gages).

I obtained runoff data from 3 sources. First, I selected 1,737 gages from the Global Runoff Data Center (GRDC) and supplemented it with information from two other sources to fill in white spaces on the map (notably Asia and Australia). The GRDC, operating under the auspices of the World Meteorological Organization (WMO), is housed and operated by the German Federal Institute of Hydrology (BfG) WMO (1989). Their runoff database contains “historical mean daily and monthly discharge data and currently comprises river discharge data of well over 10,000 stations from 159 countries” (BfG 2020). The GRDC database contains 10,361 stations, however the majority of these did not fit our criteria for spatial and temporal coverage. The GRDC acts as a clearinghouse for data but does not perform quality control. According to the GRDC website, “ownership of the data and responsibility for errors is with the data providers.” Among the problems I encountered were duplicate gages, records with few observations or implausible values, and abrupt shifts in the magnitude of flow.

Second, I obtained data for 272 gages from the Global Streamflow Indices and Metadata (GSIM) archive (Do et al. 2018; Gudmundsson et al. 2018). The creators of the GSIM database have expanded upon GRDC’s database by adding information from 11 other publicly available databases. This includes research databases from Europe, Russia, China, and Thailand. It also included information from national databases in the USA, Canada, Brazil, Japan, Spain, Australia, and India. For the full list of sources, see Table 1 on page 768 in Do et al. (2018).

Finally, I obtained runoff data for 47 gages in Australia from the country’s Bureau of Meteorology. I used gages that are a part of the bureau’s set of Hydrologic Reference Stations (Australia BOM 2020). This is a set of 467 “well-maintained river gauges of long, high quality streamflow records managed by Australian and State water agencies. The stations can be used to estimate trends in long-term and seasonal water availability from climate variability and change.” I selected gages that matched our conditions for watershed size and dates, then downloaded data for these sites from the bureau’s website and processed the data with a set of Python scripts.

The river flow data from the three sources were reported as daily averages. I calculated the monthly average by taking the ordinary arithmetic mean of the daily values. I only included months where there were at least 25 daily observations. For the final selection of gages, I only included those with at least 6 monthly average observations during the time period from 2002 to 2019. I performed quality control of observed runoff by making a variety of plots and statistical summaries. The GRDC acts as a clearinghouse for data but does not perform quality control.

Volumetric flow rates in \(\text{m}^3/\text{s}\) were converted to area-normalized fluxes in mm/mo by dividing by the land surface area in \(\text{km}^2\) and multiplying by \(86.4 \times d\), the number of days in the month – the inverse of Equation \ref{eq:discharge}.

The amount of river discharge data in our database is not consistent over time. There are 437 of our gages (out of 2,056) that have complete data coverage – each of these gages has 20 years of data over the 20-year period from 2000–2019. It is more common for our gages to be missing data from one or several years in this period. Figure 2.11 shows the distribution of gages by the number of years of data for the gage. The average length of the record for our gages is 15.6 years, with a median of 17 years. There are 4 gages which have only 1 year’s worth of data.

**Figure 2.11:** Distribution of data length in years for our 2,056 gages.

Our flow database is more complete during the first decade of our study’s time period, from 2000–2009. This is illustrated in Figure 2.12. For example, in the year 2000, we have data at 1,892 gages, or 92% of our total gages. In the second decade, 2010–2019, the number of gages with flow data drops off. The last two years are especially sparse, with data at only 464 gages, or 23% coverage in 2019.

In terms of missing values, none of our time series is 100% complete. Every gage is missing data in one or more of the 240 months from 2000–2019. Nevertheless, more than 800 of the basins are at least 90% complete. About 14% of the basins in our database have one or more records where the runoff is zero, at so-called ephemeral or intermittent rivers. This is not the same as missing data, which is stored in our database as NaN, or “not a number,” a computer code for missing or corrupted data.

**Figure 2.12:** Number of gages with observations, by year.

There are two main reasons for the amount of missing river discharge data. First, there is a lag in reporting of river flow data in many countries, and further there is a delay in sending these data to the GRDC. Second, there has been a decrease globally in the number of flow monitoring gages (Fekete et al. 2012). According to Fekete, “the number of monitoring stations peaked in the 1980s as a response to growing concerns about population growth and its impact on the environment, but as focus shifted toward climate change the commitment to continued operation of in situ monitoring networks diminished.” The hydrologic science community is alarmed and dismayed by this trend. In situ measurements of river flow will never be completely replaced by remote sensing. Even with new platforms like SWOT, the scientific community will always need in situ data to calibrate and validate remote sensing observations.

I also examined the distribution of runoff data and its statistical properties. Figure 2.13 is a set of normal probability plots of the runoff dataset. In this figure, the runoff that has been converted to units to mm/month. Thus, we can compare runoff in basins of different sizes, and plot all of the data on a single figure. We can see that there are many small flows. About 50% of the discharge observations are less than 10 mm/month. However, the distribution is highly skewed, with a long right tail. Very high flows are rare, but flows greather than 600 mm/month occur less than 0.1% of the time.

**Figure 2.13:** Normal probability plots of areal runoff observations in the observed river discharge dataset.

Synthetic Runoff

River discharge observations are limited, as they are only available at gaged locations. As an alternative, researchers have created gridded datasets that estimate runoff using statistical and machine learning methods. I used estimated runoff from GRUN Ensemble (Ghiggi et al. 2021). This dataset is a new version of the first GRUN (presumably for “Global RUNoff”) dataset published in 2019 (Ghiggi et al. 2019). The authors created a global gridded dataset of monthly runoff using a machine-learning approach (random forest model), and based on precipitation and evapotranspiration as predictor variables. For the 2021 GRUN Ensemble project, the authors used input data from 21 different sources, “including a set of atmospheric reanalysis, post-processed reanalysis and interpolated-stations data.”

Ibarra, David, and Tolentino (2021) performed an independent evaluation of GRUN over a set of river basins in the Philippines. The GRUN model did not include data from the Philippines in its calibration or validation datasets, thus this served as a good independent quality check of GRUN. It also allowed testing under a variety of hydrologic conditions, due to the diversity of climates present.⁹ Ibarra, David, and Tolentino (2021) compared GRUN predictions to observed discharge over 55 small tropical catchments with at least 10 years of data, extending back to 1946 in some cases. They found a significant but weak correlation (R = 0.37) and a “somewhat skillful prediction (volumetric efficiency = 0.36 and log(Nash–Sutcliffe efficiency) = 0.45).”¹⁰

I also performed an independent evaluation of GRUN against our 2,056 gages and found that GRUN is a relatively good fit to observed discharge. I first estimated the monthly discharge at the basin outlet by calculating the spatially averaged mean of gridded GRUN runoff. Then I calculated fit statistics comparing the observed and modeled flow time series. Figure 2.14 shows the distribution of two goodness-of-fit indicators comparing the time series of basin-averaged GRUN estimated runoff observed river discharge at 2,056 gaged basins. On the bottom of Figure 2.14, the maps show the geographic distribution of these same indicators.

I found that a median correlation R = 0.84 and median root mean square error, RMSE = 11.8 mm/mo, and 75% of gages had RMSE \(< 19\) mm/mo. I also calculated a common fit indicator for modeled discharge, the Kling-Gupta Efficiency (KGE, explained in more detail in Section 4.1.13). Median KGE is 0.53, and 81% of gages have KGE %gt; -0.41, the point at which a model’s predictions are better than the mean of observations (Knoben, Freer, and Woods 2019). One word of caution is in order about the strength of this comparison. Many of the gages that the authors used to develop the GRUN dataset are the same as the gages I am using to judge its quality. Therefore, this is not a truly independent assessment. My conclusion is that GRUN has unprecedented skill in predicting runoff at the global scale, but that its accuracy is still limited. It appears that predictions are of only fair quality over certain zones, such as southeast Asian tropical island environments, as indicated by the results shown by Ibarra, David, and Tolentino (2021).

**Figure 2.14:** Comparison between GRUN estimated runoff and observed monthly-average river discharge at 2,056 gaged basins. Discharge data covering 2000–2019 was obtained from Australia’s BOM, GRDC, and GSIM.

Modeled Runoff

I collected runoff data from two land-surface models in order to compare them to my results, to be described in Section 6.3. The first was the ERA5-Land model (Muñoz-Sabater et al. 2021). The model, from the ECMWF, consists of “global high resolution numerical integrations of the ECMWF land surface model driven by the downscaled meteorological forcing from the ERA5 climate reanalysis.” The authors state that, compared to previous versions of the model, it offers an improved description of the hydrological cycle, in particular better agreement with observed river discharges.

A second set of simulated runoff came from NASA’s Global Land Data Assimilation System (M. Rodell et al. 2004). GLDAS drives the NOAH Land Surface Model (LSM). This hydrologic model has been operational since 1996 and has undergone continuous improvements to enhance its performance and accuracy. I added surface and subsurface runoff to calculate total runoff. For both datasets, I calculated the spatial mean over the project river basins, and converted units from kg/m² to mm/month.

Commentary on River Discharge Data in Large-Sample Hydrology

Collections of quality-controlled river discharge observations are essential in the field of global hydrology. Historically, researchers have had to spend a great deal of effort compiling and checking runoff data. It appears that many researchers are duplicating each others’ work, which is both inefficient and a barrier to progress in hydrologic research. One cause for this is agencies that publish discharge data impose copyright or other restrictive data sharing agreements on data users. While this seems to be a legal and ethical gray area, most researchers appear to be conservative, and err on the side of not sharing data they have compiled. Furthermore, some national governments, such as India or China, have ceased publishing river discharge data, for reasons of security or national interest (Eyler 2022). I believe that efforts to freely share global discharge data would spur a great deal of interesting work in global hydrology, climatology, remote sensing, and meteorology. In fact, a recent review has called this a new sub-discipline: large-sample hydrology, that “… relies on data from large sets (tens to thousands) of catchments to go beyond individual case studies and derive robust conclusions on hydrological processes and models” (Addor et al. 2017).

It is encouraging that there are a number of efforts underway in this area to compile discharge data, and associate it with meaningful metadata about the gaged watersheds, often with the title CAMELS. In 2015, scientists at the United States Geological Survey (USGS), published a new dataset of flow and other meteorological data for 671 watersheds in the continental United States (Newman et al. 2015). A followup study (Addor et al. 2017) provided watershed attributes (related to topography, climate, land cover, soil, and geology) in a dataset called CAMELS, for “catchment attributes and meteorology for large-sample studies.” Since then, similar datasets have been released by researchers in several different countries (Table 2.4). Finally, a group of researchers compiled these recent datasets into an omnibus dataset they called Caravan, for “a series of CAMELS.” This dataset contains daily flow records for 6,830 gages in 12 countries between the years of 1980 and 2020.

Table 2.4: CAMELS runoff data sources that may be used in future studies.

Dataset Name	Region	Year	# Gages	Reference
CAMELS	Continental USA	2017	671	Addor et al. (2017)
CAMELS-CL	Chile	2018	516	Alvarez-Garreton et al. (2018)
CAMELS-BR	Brazil	2020	3,679	Chagas et al. (2020)
CABRA	Brazil	2021	735	Almagro et al. (2021)
CAMELS-GB	Great Britain	2020	671	Coxon et al. (2020)
CAMELS-Aus	Australia	2021	222	Fowler et al. (2021)
LamaH-CE	Central Europe	2021	859	Fowler et al. (2021)
CARAVAN	multiple	2023	6,830	Kratzert et al. (2023)

Easily-accessible datasets have led to a mini-boom in the use of machine learning in hydrology. Indeed, these data are ripe for exploration, and are ideally suited for practitioners of machine learning to explore. Artificial neural networks have been used to predict river discharge (as a form of black box rainfall-runoff model) since the early 1990s (Peel and McMahon 2020). Recent advances in Long short-term memory (LSTM) networks have been able to make remarkably accurate predictions (Kratzert et al. 2019).

Runoff Data Limitations

In the water budget equation (Eq. 1.1), the runoff term R is meant to quantify the total flux of water leaving the watershed. Here, we use observed or modeled river discharge as a proxy for runoff. This is an imperfect measure on a number of counts. First, it does not account for subsurface flow (groundwater flow) which is not captured by gages (Fekete, Vörösmarty, and Grabs 2002). Second, it does not account for man-made interbasin transfers or other human impacts.

Yet, it is well-understood and well-documented that human alterations have large impacts on the natural water cycle (Vorosmarty et al. 2000; Hanasaki, Kanae, and Oki 2006, see e.g.,). For example, interbasin transfers can have a confounding effect on our water budget calculations. These are engineering projects consisting of pipelines or canals that transfer water from one watershed to another. Some examples of large interbasin transfers include the Tajo-Segura interbasin water transfer in Spain, the California Water Project in the US, and China’s South–North Water Transfer Project. (Note that this is just a few examples and is not at all comprehensive. Many more interbasin transfers exist.)

Our simple water balance model with four variables does not account for man-made water exports from a basin. Where water is imported into a basin, if imported water is used primarily inside homes and discharged to the ocean, it may not have any impact on our analysis. On the other hand, when imported water use is for irrigation, or to fill reservoirs, this could be a significant source of error. Studies conducted at the scale of a region or a watershed should make an effort to account for man-made fluxes into or out of the watershed. However, for this global study, this was not practical. Furthermore, such data are not always readily available, or easy to interpret.

The national hydrologic agency in the United States, the USGS, provides special flags to alert data users when a river gage is affected by diversions or withdrawals. The agency has also published a dataset on a subset of 1,659 gages that are “relatively free of confounding anthropogenic influences,” for the purposes of studying long-term variations in hydrology (Slack and Landwehr 1992). Unfortunately, the GRDC does not provide a similar flag for its discharge database.

Environmental Indices

I also collected observations of ancillary environmental data as inputs to our NN model, listed in Table 2.5. The working hypothesis in my thesis research is that errors in EO data are the consequence of certain environmental conditions. For example, precipitation estimates are often biased in mountainous regions, or in relation to snow cover. If we provide the NN model with inputs that allow it to identify mountainous regions or snow covered regions, it should be able to make suitable corrections. In a previous study, Munier and Aires (2018) used ancillary environmental data to provide local correction factors for water budget components based on vegetation index (NDVI) and an aridity index (average \(P - E\)). Our hypothesis was a NN model fed with more environmental data could perform even better at correcting these errors and closing the water budget. The relationships are likely to be complex and non-linear, which a well-trained NN model should be capable of finding.

Each of the ancillary datasets in Table 2.5 consists of gridded geographic data. Certain of the environmental datasets are static; i.e., they do not vary over time (e.g., elevation, latitude). Other datasets are time-variable, such as solar radiation or vegetation indices. In all cases, I rescaled and reprojected environmental datasets as necessary to the standard project grid and calculated spatial means for river basins as described in Section 3.4. High-resolution data (e.g., 0.05°, 3600 x 7200 pixels), were upscaled to my standard 0.25° grid. While this was not strictly necessary, it lets me use the same data and workflows for computing basin means.

Table 2.5: Ancillary environmental data compiled at the river basin and pixel scale for use as inputs to an NN model.

#	Variable	Units	Source	Min	Median	Max
1	Aridity index	dimensionless	calculated	0.00032	0.57	2.9
2	Elevation, basin mean	meters	Amatulli et al. (2018)	15	520	5300
3	Latitude, basin centroid	decimal degrees	calculated	−50	27	75
4	Slope, basin median	dimensionless	Amatulli et al. (2018)	0	1.2	26
5	Vegetation Index, EVI	dimensionless	Didan (2015)	−0.17	0.21	0.7
6	Irrigated area (percent)	dimensionless	Siebert et al. (2015)	0	0.0006	0.76
7	Longitude, basin centroid	decimal degrees	calculated	−160	27	180
8	Burned area (percent)	dimensionless	Giglio, Schroeder, and Hall (2020)	0	0	0.55
9	Snow cover (percent)	dimensionless	Hall and Riggs (2021)	0	0	100
10	Solar radiation	J/m²	Hogan (2015)	0	18.1 × 10⁶	346 × 10⁶
11	Temperature	°C	Wan, Hook, and Hulley (2021)	−45	26	57
12	Vegetation growth/senescence	dimensionless	calculated	−0.15	0.22	0.66

Aridity Index

The aridity index (or sometimes dryness index) is a widely used measure of the long-term hydro-climatic conditions of a region. It is variously described in the literature as a “degree of dryness” or a “degree of water deficiency.” Definitions vary, but it generally describes the ratio between available water and water demand for evaporation and water use by plants. The aridity index has been used in a wide range of studies and has been shown to be correlated with the runoff in basins (Arora 2002). Methods of calculating the aridity index also vary. Among the most well-known and widely used definitions of the aridity index is the one proposed by Budyko (1974), \(\phi = (E_0 / P)\), where P is precipitation and \(E_0\) is the reference evapotranspiration.

I used a contemporary global dataset from the Consultative Group on International Agricultural Research (CGIAR), which reports both annual average and monthly average aridity at the pixel scale over global land surfaces (Trabucco and Zomer 2019). The annual average aridity index is shown in Figure 2.15. The CGIAR defines the aridity index as:

\[\text{AI} = \frac{\bar{P}}{ \bar{E_\text{o}} } \]

where \(\bar{P}\) is the mean annual precipitation, and \(\bar{E_\text{o}}\) is the mean annual potential evapotranspiration for a reference crop¹¹.

It is worth noting that some researchers call CGIAR’s AI a humidity index. The CGIAR’s definition is counterintuitive as it increases as climates get wetter. For dry environments AI \(\rightarrow 0\), while for wet, humid environments AI \(\rightarrow \infty\).

Topographic Data: Elevation and Slope

I calculated the mean basin elevation and median slope using an open-source dataset of gridded global physiographic data published by Amatulli et al. (2018). This dataset combines several terrain-related variables, intended for environmental and biodiversity modeling.

I calculated the basin latitude and longitude based on the watershed polygon shapefiles in latitude and longitude coordinates using the software QGIS. The latitude and longitude are reported at the centroid of each river basin.

Vegetation

Satellite data is commonly used to identify the location and extent of vegetation by combining information from visible and near infrared wavelengths. These measures take advantage of the fact that chlorophyll in plant leaves absorbs and emits radiation in particular wavelengths. The first such measurement, which is still widely used, is the Normalized Difference Vegetation Index (NDVI). The NDVI is broadly an index of “greenness” that ranges from \(-1.0\) to +1.0. While NDVI is not well correlated with photosynthesis rate or vegetation mass, it is used to “characterize the global range of vegetation states and processes” (NCAR 2023). In other words, one cannot readily compare the NDVI in different areas to determine whether one location has more vegetation or biomass. However, there are rules of thumb for interpreting NDVI. According to the USGS Remote Sensing Phenology Program (Brown 2018), here are typical NDVI values:

I obtained vegetation indices measured by the MODIS instrument onboard two different NASA satellites, Aqua and Terra. The Aqua satellite was launched in May 2002, and the first monthly data are available for July 2002. The Terra satellite was launched in December 1999, and the first monthly data are available for February 2000. Because of the better time coverage of the data from Terra, I chose to use this version for our analysis. Other remote sensing products related to vegetation are available. The most notable are the leaf area index (LAI) and fraction of absorbed photosynthetically active radiation (FAPAR).

I used the newer “enhanced” vegetation index EVI, rather than NDVI, as it “minimizes canopy-soil variations and improves sensitivity over dense vegetation conditions.” Both indices provide a measure of the “composite property of leaf area, chlorophyll and canopy structure” (NCAR 2023). The newer algorithm is thought by experts to be more accurate in areas of dense canopy . Note that I used version 6 data; a newer version 6.1 became available in 2022, but was not yet complete. The EVI takes on theoretical values from –0.2 to +1.0.

I hypothesized that not only the amount of vegetation is important to the water cycle, but also the rate of vegetation growth or senescence. Therefore, I created a new ancillary variable by calculating the monthly rate of change of EVI shown. I refer to this rate herein as “vegetation growth/senescence” and it is shown in the final row of 2.5.

NDVI vs. EVI

Because EVI is a relatively new measurement, I did some analysis to better understand how it is different from NDVI, including creating summary statistics and maps. Figure 2.16 is a map of the correlation of NDVI vs EVI at the global scale. The map compares the correlations between the two time series in every pixel of the map. The green color means that in many places, the time series are highly correlated. However, large disagreement (as shown by orange and red colors) is seen in the Amazon and in the island region of Southeast Asia (Malaysia, Indonesia, Brunei, and Papua New Guinea).

Figure 2.17 is a histogram of the two datasets comparing the average values in all pixels over the period 2000 – 2019. The distributions are dissimilar, and EVI has a lower median value than NDVI, and fewer values in the highest part of the range between 0.6 and 1.

NOAA AVHRR NDVI

For hindcasting applications, we wish to use an NN model to estimate TWSC prior to the launch of the first GRACE satellites in 2002 (this will be described in Section 4.4.2). I attempted to build a model with the most explanatory capability that includes a variety of environmental variables, as described here. Unfortunately, the datasets NDVI and EVI described above are from the MODIS instruments onboard the Aqua and Terra satellites, and these particular datasets are not available before 2000. Therefore, I considered using an older vegetation dataset based on the Advanced Very High Resolution Radiometer (AVHRR).

The AVHRR is an instrument onboard a series of polar-orbiting NOAA weather satellites (NOAA 7, 9, 11, 14, 16, 17, and 18). NOAA publishes data on surface reflectance and NDVI based on AVHRR data. The dataset is daily, has global coverage, and a resolution of 0.05°. The data are available from June 24, 1981 to present (Vermote 2018).

A related dataset is available from NOAA based on an instrument called the Visible Infrared Imaging Radiometer Suite (VIIRS). This instrument was meant to improve upon its predecessor AVHRR (Vermote 2022). It is among the five instruments onboard the Suomi National Polar-orbiting Partnership (SNPP) satellite launched on October 28, 2011.

USGS Landsat Vegetation Indices

Another vegetation dataset with a long record is available from the Landsat satellites, operated by the United States Geological Survey (USGS). These data are very high resolution compared to the datasets described above, with a horizontal resolution of 1 arcsecond, which is equivalent to 1/3600°, or about 30 m near the equator. Vegetation from Landsat is interesting, largely because it spans many years, but it would require a great deal of specialized data processing.

Table 2.6 summarizes available datasets of global vegetation derived from satellite observations. Among the available data, there is a wide variety in the spatial and temporal resolutions. It is not immediately clear how to obtain a long record of EO vegetation data. Creating a high-quality dataset would involve comparing the spatio-temporal patterns among datasets and carefully intercalibrating them so that they can be combined.

Name	Publisher	Begins	Ends	Time Res.	Spatial Res.	Source
NDVI
AVHRR	NOAA	1981	present	day	0.05°	doi: 10.7289/V5ZG6QH9
AVHRR composite	USGS	1989	2019	10 day	1/12°	doi: 10.5066/F7707ZKN
GIMMS NDVI3g	NASA	1981	2015	half-month	1/12°	NASA (2019)
VIIRS	NOAA	2014	present	day	0.05°	doi: 10.25921/GAKH-ST76
Landsat	USGS	1982	present	16 days	1/3600°	USGS (2022)
MODIS/Terra	USGS	Jan 2000	present	day, month	0.05°	doi: 10.5067/MODIS/MOD13C2.006
MODIS/Aqua	USGS	Jun 2002	present	day, month	0.05°	doi: 10.5067/MODIS/MYD13C2.006

EVI
Landsat	USGS	1982	present	16 days	1/3600°	USGS (2022)
MODIS/Terra	USGS	Jan 2000	present	day, month	0.05°	doi: 10.5067/MODIS/MOD13C2.006
MODIS/Aqua	USGS	Jun 2002	present	day, month	0.05°	doi: 10.5067/MODIS/MYD13C2.006

Chu et al. (2022) published an article where they describe how their team created an extended time series of NDVI by combining data from different sensors, including AVHRR. I unsuccessfully attempted to acquire these data. The article states that data will be made available upon request. I requested the dataset from the authors and my messages went either undelivered or ignored. This phenomenon is unfortunately common in scientific publishing. One group of scholars that has examined data sharing practices opined: “statements of data availability upon (reasonable) request are inefficient and should not be allowed by journals.” (Tedersoo et al. 2021).

Other sources of information on vegetation are available from models, but these were not considered here. Reanalysis models such as GLDAS or ERA5 include relatively simple vegetation dynamics based on assimilation of EO of leaf area index (LAI) and other observations. More detailed simulation is done in a class of models referred to as Dynamic Global Vegetation Models (DGVM). These models are designed to simulate the effects of changing climate on vegetation, and resultant changes to the hydrologic and biogeochemical cycles.

In conclusion, due to a lack of available vegetation data with a sufficiently long time record, I used EVI from MODIS/Terra as an input variable for the recent GRACE era (2002–2019), but dropped vegetation as an input variable in hindcasting experiments for 1980–1999.

Irrigated Area

Human activities can have a profound impact on the water cycle. Diversions and abstractions for irrigation may reduce runoff, take water from groundwater storage, and increase evapotranspiration through crop water use. For this reason, I hypothesized that irrigated area may be a useful explanatory variable. I estimated the percent of land area that is irrigated based on a global dataset by Siebert et al. (2015). The authors estimated irrigated area by “combining sub-national irrigation statistics with different data sets on the historical extent of cropland and pasture,” and validated estimates with observations over the western United States. One limitation is that the data are through the year 2005. I chose to extract and use the values for this year, rather than attempting to do some kind of extrapolation of the time series through 2019. Therefore, the data represents a snapshot of one point in time, limiting its accuracy. This dataset is shown in Figure 2.18.

Land Surface Temperature

One could argue that land surface temperature is already indirectly included in my database, as it is among the input variables for evapotranspiration data products. Still, I chose to include this explanatory variable as an input to the neural network model as it has a strong, direct link on the hydrologic cycle. I obtained MODIS/Terra Land-Surface Temperature (Wan, Hook, and Hulley 2021). As these data are available as monthly estimates on a 0.05° grid, including this variable was relatively straightforward.

Solar Radiation

Solar radiation has a strong effect on evaporation, vegetation growth, and the terrestrial energy balance. I obtained gridded monthly data for the variable “surface short-wave (solar) radiation downwards” from the ERA5 model (Muñoz-Sabater et al. 2021). Annual monthly solar radiation in Watts per square meter is shown in Figure 2.19.

Burned Area

Fire plays a role in altering the hydrologic cycle – burning of biomass releases water vapor to the atmosphere, and patterns of evapotranspiration are altered by changes to vegetation. Authors of a recent study concluded that “current and historical fires significantly affect terrestrial ecosystems, which can alter hydrologic fluxes” (Fang Li and Lawrence 2017). The authors attempted to quantify these changes, concluding that fire reduced global evapotranspiration by 0.6 × 10 km³. This averages 0.003 mm/month, a seemingly insignificant amount (see Figure 2.18 for typical fluxes). Yet, fires can have a large local impact at certain places and at certain times. Therefore, I included an ancillary variable of burned area as a proxy for fire activity. I used data from the Terra and Aqua satellites’ Moderate Resolution Imaging Spectroradiometer (MODIS) instrument. The presence of fire activity is determined through measurement of thermal anomalies measured from orbit. The creators of this dataset caution that burned area estimates have high uncertainty “due to nontrivial spatial and temporal sampling issues” (Giglio, Schroeder, and Hall 2020).

Snow Cover

The presence of snow and ice can negatively affect the accuracy of satellite observations of the water cycle (Kidd et al. 2012; Q. Cao et al. 2018). I hypothesized that including data on snow cover would allow the neural network model to make corrections in areas where fluxes are biased. I included a monthly variable on percent snow cover from the MODIS/Terra (Hall and Riggs 2021), spatially averaged and upscaled to our 0.25° project grid.

Preliminary Analysis

In this section, I show the results of some preliminary analysis of the EO datasets. I performed a variety of exploratory data analysis and visualization of the EO datasets by creating maps, time series plots, and statistical summaries. This is a critically important step in the process of ingesting datasets from various sources, in order to verify that the various geographic transformations and unit conversions have been done correctly. This helps to see where datasets are in agreement, and when and where there is more divergence in their estimates. Above, I listed and described many EO datasets describing different components of the hydrologic cycle. I selected datasets for my analyses based on their spatial and temporal coverage and their data quality.

Figure 2.20 shows a snapshot of one month (January 2005) of the EO datasets used as input in our analysis. Some of the source datasets included data over both land and oceans. In these cases, I masked data over the oceans to facilitate visual comparison. Further, I excluded Antarctica from the datasets. The maps in Figure 2.20 show that the different datasets share many similarities in terms of the overall patterns, but many small differences are apparent upon close inspection. For example, precipitation according to GPCP appears smoother, while the other two datasets, which have a higher spatial resolution, show finer-grained patterns of high and low rainfall, particularly evident over the Amazon and southern Africa. Similarly, one can see differences in the spatial patterns of E and ΔS. River discharge, measured at gages, has a much sparser coverage, and the distribution of flows is highly skewed, with measured discharges covering several orders of magnitude from 0 to nearly 1,000 mm/month.

Agreement and Disagreement among Datasets

Boxplots are useful for visualizing the distributions and central tendencies among the datasets. Figure 2.18 shows the distribution of values in the EO datasets used as input in our model. The boxes show the interquartile range, and the whiskers show the 10%-ile and the 90%-ile. Outliers are not shown, nor are the minimum or maximum values. The top boxplot in each set of observations is for all pixels over land within our analysis domain, which excludes Antarctica, Greenland, and areas above 73° North. The lower box in each set shows the distribution across our 2,056 basins. I calculated the basin mean fluxes from the gridded data using an area-weighted averaging method described in the Section 3.4. The statistics in Figure 2.18 were calculated over the 20-year period from 2000 to 2019. The marked differences among EO datasets is further evidence of the need for their calibration.

One can see in Figure 2.18 that, for most variables, the distribution of fluxes is greater over the pixels compared to the basins, with higher highs and lower lows. This is particularly the case for precipitation, but is also seen with evapotranspiration.

Calculating the mean flux over a basin tends to smooth out the extreme values and compress the distribution of observed values. (The gaged basins coverage, in terms of 0.25° pixels, is anywhere from 8 pixels to over 6,000 pixels, with a median of 19 and an average of 122 pixels in a basin.)

There are also differences in the distributions within each component of the water cycle. For example, GPM-IMERG contains higher observations of precipitation, with a higher 75- and 95-percentile than the other two datasets. The monthly water storage change, ΔS, is centered at about zero for each dataset. This is expected, as the storage in pixels and basins tends to fluctuate seasonally, and any long-term trend is small compared to the annual variations in storage. Runoff has the smallest magnitude of any of the hydrologic fluxes, with a low of 0 mm/month (no observed flow) to a 90%-ile of 68 mm/month, lower than the 90%-ile of P or E.

Trends in Total Water Storage

In this section, I analyze trends in total water storage, following methods used by Rodell et al. (2018). The purpose of this was two-fold: First, to verify that our handling of the data is correct by recreating the analysis in a well-known, peer-reviewed study based on GRACE data. The second purpose was to update the analysis by Rodell et al., which was based on observations through 2016. With five additional years of data, do the trends found by Rodell et al. continue or are they reversed? Furthermore, Rodell et al. (2018) used only one of the three available GRACE datasets, the one from JPL (Landerer and Cooley 2021). How does the trend in water storage vary, based on the different datasets? What is the effect on storage trends of averaging the three datasets?

For this analysis, I followed Rodell et al. (2018) in calculating the trend using ordinary least squares regression. I calculated the slope and intercept for the TWS anomaly. I repeated the analysis first for the time period from 2002 to 2016 (matching the analysis by Rodell et al.) and then for an updated time period 2002 to 2021. Maps of the trend in TWS are presented in Figure 2.22. The maps appear to be quite similar, despite some minor differences in the colors. Some differences in the maps could come from the different versions of GRACE data. My calculations and mapping were done with updated GRACE data, release 06, while Rodell et al. used a previous version, release 05. NASA continually updates the algorithms and coefficients used for processing GRACE data and estimating TWS anomalies. According to NASA, “GRACE is a first-of-a-kind mission, so not surprisingly, revisions to the data processing are more frequent than for more mature satellite measurements” (NASA Jet Propulsion Laboratory, 2018).

Figure 2.23 illustrates the trend in GRACE total water storage, as calculated by the three different datasets: CSR, GSFC, and JPL. Maps at the right show the P-value, or the probability that the trend is different from zero. The overall patterns appear similar, but careful comparison of the maps reveals several differences. For example, the trends appear to be somewhat different over southern Africa.

Zooming in on a single continent, such as South America in Figure 2.24, one sees distinct spatial patterns in each of the datasets. These patterns are artifacts of the data and processing used for each dataset. For example, with the CSR dataset, one can see the hexagonal grid of the mascons. The GSFC data has been smoothed with a Gaussian filter. The JPL dataset appears blocky, an artifact of the rectangular mascon 3° × 3° grid.

The levels refer to the amount of processing that has been done to the data. Level 1 is the raw satellite data. Level 2 has been processed to determine the gravity field. Level 3 is further processed to estimate changes in the amount of water in an area.↩︎
The choice of minimum watershed size was a compromise between two competing goals. First, we wanted to have large watersheds that covered many pixels on the 0.25° grid that we used for the first set of experiments. An area of 2,500 km² corresponds to about 8 pixels. This ensures that our estimates of TWSC are fairly robust, as it is averaging the values over several pixels. On the other hand, setting a larger threshold for the watershed area would mean a smaller sample of basins available to train our model.↩︎
There are many gages in the GRDC database with very little data, at least in our time period of interest. The threshold of 6 observations is arbitrary. We felt that basins could provide valuable training data even when there were only a small number of observations. This is particularly true when these gages are in environments or regions where few others are available.↩︎
The Philippines is an island archipelago with a variable climate, stretching across 1,850 kilometers from north to south. Annual rainfall ranges from around 1,000 millimeters in mountain valleys to 5,000 millimeters along the east coasts of the major islands (US Library of Congress 2005).↩︎
The Nash–Sutcliffe efficiency and Kling-Gupta efficiency are goodness-of-fit indicators commonly used in the hydrologic sciences. I describe them in more detail in Sections 4.1.12 and 4.1.13.↩︎
In the case of the CGIAR aridity index, \(E_o\) is based on the widely accepted Food and Agriculture Organization (FAO) method (Allen and Pavelsky 2018). This method uses the Penman-Montieth equation and defines potential evapotranspiration as the “evapotranspiration of a reference crop (E_o) under optimal conditions, having the characteristics of well-watered grass with an assumed height of 12 centimeters, a fixed surface resistance of 70 seconds per meter and an albedo of 0.23.” Inputs to the Penman-Montieth equation include solar radiation, humidity, and wind speeed.↩︎

Chapter 2 Earth Observation Datasets