Data Preparation and Feature Engineering#

Model data comes from the following sources National Resource Conservation Service (NRCS) Snow Telemetry (SNOTEL), California Data Exchange Center (CDEC), Copernicus 90-m DEM, and the NASA Airborne Snow Observatory (ASO). Feature engineering creates new features to capture the influences of the time of year, latitude, longitude, and elevation on SWE dynamics. Below is a generalized workflow for retrieving the data and engineering the following features for input into the machine learning models.

Feature id	Description
WY Week	Numerical ID of the week of the water year
Latitude	Center latitude of the training grid cell
Longitude	Center longitude of the training grid cell
Elevation	DEM elevation of the training grid cell
Northness	Calculated northness of the training cell
SNOTEL SWE	Current week’s observed SWE from SNOTEL
Prev SNOTEL SWE	Previous week’s observed SWE from SNOTEL
Delta SNOTEL SWE	Difference between Previous week’s and current week’s observed SWE from SNOTEL
Previous SWE	Observed SWE from previous week

Note, the Previous SWE feature is an observation for model training and testing but will be predicted when forecating.

Data Processing/Retrievel Overview#

Data processing is likley the most tedious part of the model development pipeline and uses the elevation, slope, and aspect derived from DEM data through nearest-neighbor interpolation to produce values for all training, testing, and inference locations based on the four latitude and longitude corners of each grid cell.

Feature Engineering Overview#

Feature engineering consists of creating terrain (i.e., northness), temporal, and snow features.

The calculation of the northness metric uses the slope and aspect values of each grid cell, using the embedded slope and aspect information within the metric to aid in ML model training and reducing the model dimensionality.

Temporal features support model training relative to seasonal snow accumulation and melt phases, consisting of a week-id as an integer from the beginning of the water year (WY) on October 1st.

Snow observation features utilize in-situ station SWE observations, using the values observed from the current (Snotel/CDEC SWE) and the previous observation (Previous Snotel/CDEC SWE) as inputs. from current and previous observations, Delta SWE highlights the difference to capture the trend, either positive or negative, in SWE dynamics (i.e, melt or accumulation) with respect to each monitoring station ( \(\Delta\) SWE).

Grid cell previous observation features capture the strong serial correlation of snow accumulation and melt on the SWE estimate, the model uses the SWE estimate of the previous week as a feature (Previous SWE). For model training and testing, the Previous SWE input is from NASA ASO datasets while the hindcast uses the previous model SWE estimate.

Below are simple data access and processing examples, please see the National Snow Model Github for complete examples.#

Loading Provided SWE Observations from Snowcast Showdown#

The project supported the participation of the Snowcast Showdown, and while we shared useful code for SNOTEL and CDEC data retrievel (needed for forecasting), the information was provided by the United State Bureau of Reclaimation. We provide the supported modeling materials and provide the respective data processing and feature engineering steps.

%pip install myst-nb pandas==1.4.3 h5py tqdm tables scikit-learn seaborn tensorflow progressbar contextily hydroeval geopandas==0.10.2  nbformat==5.7.0 pystac_client planetary_computer rioxarray matplotlib basemap numpy richdem

import pandas as pd #need to pip install 1.4.3
import json 
from tqdm import tqdm #need to pip install
import warnings; warnings.filterwarnings("ignore")

datapath ='/home/jovyan/shared-public/snow-extrapolation-web/'

#Set up training DF with key metadata per site
#All coordinates of 1 km polygon used to develop ave elevation, ave slope, ave aspect

colnames = ['cell_id', 'Region', 'BR_Coord', 'UR_Coord', 'UL_Coord', 'BL_Coord']
SWEdata = pd.DataFrame(columns = colnames)

#Load training SWE data
TrainSWE = pd.read_csv(f"{datapath}/data/Prediction_Location_Observations.csv")
#drop na and put into modeling df format
TrainSWE = TrainSWE.melt(id_vars=["cell_id"]).dropna()

#Load  SWE location data
with open(f"{datapath}/data/grid_cells.geojson") as f:
    data = json.load(f)
    
#load ground truth values(SNOTEL): training
GM_Train = pd.read_csv(f"{datapath}/data/ground_measures_train_features.csv")
#drop na and put into modeling df format
GM_Train = GM_Train.melt(id_vars=["station_id"]).dropna()

#load ground truth values (SNOTEL): Testing
GM_Test = pd.read_csv(f"{datapath}/data/ground_measures_test_features.csv")
#drop na and put into modeling df format
GM_Test = GM_Test.melt(id_vars=["station_id"]).dropna()

#load ground truth meta
GM_Meta = pd.read_csv(f"{datapath}/data/ground_measures_metadata.csv")

#merge training ground truth location metadata with snotel data
GM_Train = GM_Meta.merge(GM_Train, how='inner', on='station_id')
GM_Train = GM_Train.set_index('station_id')
GM_Train.rename(columns={'name': 'location', 'latitude': 'Lat', 'longitude': 'Long', 'value': 'SWE'}, inplace=True)


#merge testing ground truth location metadata with snotel data
GM_Test = GM_Meta.merge(GM_Test, how='inner', on='station_id')
GM_Test = GM_Test.set_index('station_id')
GM_Test.rename(columns={'name': 'location', 'latitude': 'Lat', 'longitude': 'Long', 'value': 'SWE'}, inplace=True)

#Make a SWE Grid location DF
for i in tqdm(range(len(data["features"]))):
    properties = data["features"][i]["properties"]
    location = data["features"][i]["geometry"]
    DFdata = [properties ["cell_id"],  properties ["region"],location ["coordinates"][0][0] ,
             location ["coordinates"][0][1], location ["coordinates"][0][2], location ["coordinates"][0][3] ]
    df_length = len(SWEdata)
    SWEdata.loc[df_length] = DFdata
    
#Make SWE location and observation DF
#Training
#merge site location metadata with observations
TrainSWE = TrainSWE.merge(SWEdata, how='inner', on='cell_id')
TrainSWE = TrainSWE.set_index('cell_id')
TrainSWE.rename(columns={'variable': 'Date', 'value': 'SWE'}, inplace=True)

#Make sure Date is in datetime data type
TrainSWE['Date'] = pd.to_datetime(TrainSWE['Date'])

Selecting the Sierra Nevada Mountains as a demonstration region for the tutorial#

For the tutorial, we select a subset of the region to reduce the computational burden on the end user which speeds up model development.

#For model tutorial, selecting the Northern Rockies (UCOL)
TrainSWE = TrainSWE[TrainSWE['Region'] =='sierras']
SWEdata = SWEdata[SWEdata['Region'] =='sierras']
GM_Train = GM_Train[GM_Train['state'] =='California']

display(TrainSWE.head(5))
print('There are: ', len(TrainSWE), ' training points in the Sierra dataset at ', len(TrainSWE.index.unique()), ' locations')

display(GM_Train.head(5))
print('There are: ', len(GM_Train.index.unique()), ' monitoring stations in the Sierra dataset providing ', len(GM_Train), ' observations')

#Get Lat Long information
#Bottom right coord
TrainSWE[['BR_Coord_Long','BR_Coord_Lat']] = pd.DataFrame(TrainSWE.BR_Coord.tolist(), index= TrainSWE.index)

#Upper right coord
TrainSWE[['UR_Coord_Long','UR_Coord_Lat']] = pd.DataFrame(TrainSWE.UR_Coord.tolist(), index= TrainSWE.index)

#Upper left coord
TrainSWE[['UL_Coord_Long','UL_Coord_Lat']] = pd.DataFrame(TrainSWE.UL_Coord.tolist(), index= TrainSWE.index)

#Bottom Left coord
TrainSWE[['BL_Coord_Long','BL_Coord_Lat']] = pd.DataFrame(TrainSWE.BL_Coord.tolist(), index= TrainSWE.index)

Getting Copernicus Geospatial Data - 90 m DEM#

We use the copernicus 90 m DEM hosted by Microsoft to provide the geospatial information for each training and testing location. For each corner of each grid, we get the slope, elevation, and aspect, and calculate the average value to form as average 1-km geospatial information. We use the following code to access the DEM:

#Develop a DF to get each site's geospatial information 
geocols = [ 'BR_Coord_Long', 'BR_Coord_Lat', 'UR_Coord_Long', 'UR_Coord_Lat',
       'UL_Coord_Long', 'UL_Coord_Lat', 'BL_Coord_Long', 'BL_Coord_Lat']


Geospatial_df = TrainSWE.copy()
Geospatial_df['rowid'] = Geospatial_df.index
Geospatial_df = Geospatial_df.drop_duplicates(subset = 'rowid')
Geospatial_df = pd.DataFrame(Geospatial_df[geocols])

#for the tutorial, we will just use a few sites for demonstration purposes
Geospatial_df = Geospatial_df.head(5)

#Define the AOI around the cell locations from clockwise

area_of_interest = {
    "type": "Polygon",
    "coordinates": [
        [
            #lower left
            [Geospatial_df['BL_Coord_Long'].min(), Geospatial_df['BL_Coord_Lat'].min()],
            #upper left
            [Geospatial_df['UL_Coord_Long'].min(), Geospatial_df['UL_Coord_Lat'].max()],
            #upper right
            [Geospatial_df['UR_Coord_Long'].max(), Geospatial_df['UR_Coord_Lat'].max()],
            #lower right
            [Geospatial_df['UR_Coord_Long'].max(), Geospatial_df['BR_Coord_Lat'].min()],
            #lower left
            [Geospatial_df['BL_Coord_Long'].min(), Geospatial_df['BL_Coord_Lat'].min()],
        ]
    ],
}

area_of_interest

Geospatial_df

#Make a connection to get 90m Copernicus Digital Elevation Model (DEM) data with the Planetary Computer STAC API
from pystac_client import Client #need to pip install
import planetary_computer #need to pip install

client = Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    ignore_conformance=True,
)


search = client.search(
    collections=["cop-dem-glo-90"],
    intersects=area_of_interest
)

tiles = list(search.get_items())

#Make a DF to connect locations with the larger data tile, and then extract elevations
regions = []

for i in tqdm(range(0, len(tiles))):
    row = [i, tiles[i].id]
    regions.append(row)
regions = pd.DataFrame(columns = ['sliceID', 'tileID'], data = regions)
regions = regions.set_index(regions['tileID'])
del regions['tileID']

regions

The following codes blocks can take some time as we are connecting the geospatial attributes of each corner of each grid, and then taking the average to get averaged grid geospatial attributes.

#added Long,Lat to get polygon points
def GeoStat_func(i, Geospatial_df, regions, elev_L, slope_L, aspect_L, Long, Lat, tile):

    # convert coordinate to raster value
    lon = Geospatial_df.iloc[i][Long]
    lat = Geospatial_df.iloc[i][Lat]

    
    
    #connect point location to geotile
    tileid = 'Copernicus_DSM_COG_30_N' + str(math.floor(lat)) + '_00_W'+str(math.ceil(abs(lon))) +'_00_DEM'
    
    indexid = regions.loc[tileid]['sliceID']
    

   #Assing region
    signed_asset = planetary_computer.sign(tiles[indexid].assets["data"])
    #get elevation data in xarray object
    elevation = rioxarray.open_rasterio(signed_asset.href)

    #create copies to extract other geopysical information
    #Create Duplicate DF's
    slope = elevation.copy()
    aspect = elevation.copy()
        
    
    #transform projection
    transformer = Transformer.from_crs("EPSG:4326", elevation.rio.crs, always_xy=True)
    xx, yy = transformer.transform(lon, lat)
    
    #extract elevation values into numpy array
    tilearray = np.around(elevation.values[0]).astype(int)

    #set tile geo to get slope and set at rdarray
    geo = (math.floor(float(lon)), 90, 0.0, math.ceil(float(lat)), 0.0, -90)
    tilearray = rd.rdarray(tilearray, no_data = -9999)
    tilearray.projection = 'EPSG:4326'
    tilearray.geotransform = geo

    #get slope, note that slope needs to be fixed, way too high
    #get aspect value
    slope_arr = rd.TerrainAttribute(tilearray, attrib='slope_degrees')
    aspect_arr = rd.TerrainAttribute(tilearray, attrib='aspect')

    #save slope and aspect information 
    slope.values[0] = slope_arr
    aspect.values[0] = aspect_arr

    elev = round(elevation.sel(x=xx, y=yy, method="nearest").values[0])
    slop = round(slope.sel(x=xx, y=yy, method="nearest").values[0])
    asp = round(aspect.sel(x=xx, y=yy, method="nearest").values[0])
    
    
    #add point values to list
    elev_L.append(elev)
    slope_L.append(slop)
    aspect_L.append(asp)

import math
import rioxarray #need to pip install
from pyproj import Transformer
import numpy as np
import richdem as rd #mamba install richdem -c conda-forge

BLelev_L = []
BLslope_L = []
BLaspect_L = []

#run the elevation function, added tqdm to show progress
[GeoStat_func(i, Geospatial_df, regions, BLelev_L, BLslope_L, BLaspect_L,
                'BL_Coord_Long', 'BL_Coord_Lat', tiles) for i in tqdm(range(0, len(Geospatial_df)))]


#Save each points elevation in DF
Geospatial_df['BL_Elevation_m'] = BLelev_L
Geospatial_df['BL_slope_Deg'] = BLslope_L
Geospatial_df['BLaspect_L'] = BLaspect_L


ULelev_L = []
ULslope_L = []
ULaspect_L = []

#run the elevation function, added tqdm to show progress
[GeoStat_func(i, Geospatial_df, regions, ULelev_L, ULslope_L, ULaspect_L,
                'UL_Coord_Long', 'UL_Coord_Lat', tiles) for i in tqdm(range(0,len(Geospatial_df)))]


#Save each points elevation in DF
Geospatial_df['UL_Elevation_m'] = ULelev_L
Geospatial_df['UL_slope_Deg'] = ULslope_L
Geospatial_df['ULaspect_L'] = ULaspect_L


URelev_L = []
URslope_L = []
URaspect_L = []

#run the elevation function, added tqdm to show progress
[GeoStat_func(i, Geospatial_df, regions, URelev_L, URslope_L, URaspect_L,
                'UR_Coord_Long', 'UR_Coord_Lat', tiles) for i in tqdm(range(0,len(Geospatial_df)))]


#Save each points elevation in DF
Geospatial_df['UR_Elevation_m'] = URelev_L
Geospatial_df['UR_slope_Deg'] = URslope_L
Geospatial_df['URaspect_L'] = URaspect_L


BRelev_L = []
BRslope_L = []
BRaspect_L = []

#run the elevation function, added tqdm to show progress
[GeoStat_func(i, Geospatial_df, regions, BRelev_L, BRslope_L, BRaspect_L,
                'BR_Coord_Long', 'BR_Coord_Lat', tiles) for i in tqdm(range(0,len(Geospatial_df)))]


#Save each points elevation in DF
Geospatial_df['BR_Elevation_m'] = BRelev_L
Geospatial_df['BR_slope_Deg'] = BRslope_L
Geospatial_df['BRaspect_L'] = BRaspect_L

Geospatial_df

Prepare Ground Measurements#

#Get all unique Snotel sites
Snotel = GM_Train.copy()
Snotel = Snotel.reset_index()
Snotel = Snotel.drop_duplicates(subset = ['station_id'])
Snotel = Snotel.reset_index(drop = True) 
Snotel['Region'] = 'other'

Build DataFrame and Engineer Features#

In this section, we begin building the DataFrames’s, connect geospatial information, and process geospatial information via feature engineering. Essentially, the section connect geospatial information to training and testing data.

#get mean Geospatial data
def mean_Geo(df, geo):
    BL = 'BL'+geo
    UL = 'UL'+geo
    UR = 'UR'+geo
    BR = 'BR'+geo
    
    df[geo] = (df[BL] + df[UL]+ df[UR] + df[BR]) /4

#Get geaspatial means
geospatialcols = ['_Coord_Long', '_Coord_Lat', '_Elevation_m', '_slope_Deg' , 'aspect_L']

#Training data
[mean_Geo(Geospatial_df, i) for i in geospatialcols]

#list of key geospatial component means
geocol = ['_Coord_Long','_Coord_Lat','_Elevation_m','_slope_Deg','aspect_L']
TrainGeo_df = Geospatial_df[geocol].copy()

#adjust column names to be consistent with snotel
TrainGeo_df = TrainGeo_df.rename( columns = {'_Coord_Long':'Long', '_Coord_Lat':'Lat', '_Elevation_m': 'elevation_m',
                               '_slope_Deg':'slope_deg' , 'aspect_L': 'aspect'})

TrainGeo_df

Divide Modeling Domain into Sub-domains#

The Snowcast Showdown modeling domain covered the entire western US. Thus, because of differnces in snowpack characteristics (maritime, coastal transitional, intermountain, and continental) and regional climate patterns, we divided the original domain into 23 sub-domains. The figure is from Haegeli, P (2004), Scale Analysis of avalanche activity on persistent snowpack weakness with respect to large-scale backcountry avalance forecasting. For the the tutorial, we still need to perform the Region_id() function. However, we will focus on the Norther Colorado Rockies region, also refered to as the Upper Colorado River Basin.

#make Region identifier. The data already includes Region, but too many 'other' labels

def Region_id(df):
    
    for i in tqdm(range(0, len(df))):
        # Sierras
        # Northern Sierras
        if -122.5 <= df['Long'][i] <= -119 and 39 <= df['Lat'][i] <= 42:
            loc = 'N_Sierras'
            df['Region'].iloc[i] = loc

        # Southern Sierras
        if -122.5 <= df['Long'][i] <= -117 and 35 <= df['Lat'][i] <= 39:
            loc = 'S_Sierras'
            df['Region'].iloc[i] = loc

#Attach a region id for each location
TrainGeo_df['Region'] = 'other'
Snotel['Region'] = 'other'
GM_Train['Region'] = 'other'
#Fix date variable
GM_Train = GM_Train.rename(columns={'variable':'Date'})

#Assign region to dataframes
Region_id(TrainGeo_df)
Region_id(Snotel)
Region_id(GM_Train)

TrainGeo_df

Snotel

GM_Train

Code for slicing regions into regional DataFrames#

While this step is not needed for the tutorial, it supports the scaling to a larger doamain to ensure different regions are correctly classified.

#subset data by each region into dictionary
RegionTrain = {name: TrainGeo_df.loc[TrainGeo_df['Region'] == name] for name in TrainGeo_df.Region.unique()}
RegionSnotel  = {name: Snotel.loc[Snotel['Region'] == name] for name in Snotel.Region.unique()}

RegionTrain['N_Sierras']

RegionTrain['S_Sierras']

RegionSnotel['N_Sierras'].head()

RegionSnotel['S_Sierras'].head()

#check to make sure no test locations classified as other
print('Training') 
#look at region training sites
for i in RegionTrain.keys():
    print('There are', len(RegionTrain[i]), ' training locations in ', i)
    
print('         ') 
print('SNOTEL') 
#look at region training sites
for i in RegionSnotel.keys():
    print('There are', len(RegionSnotel[i]), ' Snotel locations in ', i)

Plot the NASA ASO and SNOTEL sites#

It is important for any modeling exercise to ensure your data is in the location that you expect. The GeoPlot() function plots the respective observations over a map of the Rocky mountains to visualize the modeling domain.

#This plots the location of all df data points
import matplotlib.pyplot as plt #need to pip install
from mpl_toolkits.basemap import Basemap #pip install basemap
import random


def GeoPlot(df):
    fig = plt.gcf()
    fig.set_size_inches(8, 6)

    #merc also works for projection # Cylindrical Equal Area. https://matplotlib.org/basemap/api/basemap_api.html#module-mpl_toolkits.basemap

    m = Basemap(projection='cea', \
                llcrnrlat=35, urcrnrlat=45, \
                llcrnrlon=-125, urcrnrlon=-115, \
                lat_ts=20, \
                resolution='c')

    m.bluemarble(scale=2)   # full scale will be overkill
    m.drawcoastlines(color='white', linewidth=0.2)  # add coastlines


    # draw coastlines, meridians and parallels.

    m.drawcountries()
    m.drawstates()
    m.drawparallels(np.arange(20,60,10),labels=[1,1,0,0])
    m.drawmeridians(np.arange(-120,-90,10),labels=[0,0,0,1])


    #Make unique color for each regions
    number_of_colors = len(df.keys())
    color = ["#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
                 for i in range(number_of_colors)]

    Location = list(df.keys())
    colordict = {k: v for k, v in zip(Location, color)}


    for i in df.keys():
        x, y = m(np.array(df[i]['Long']), np.array(df[i]['Lat'])) 
        m.scatter(x, y, 10, marker='o', color=colordict[i], label = str(i)) 


    plt.legend(bbox_to_anchor=(1.05, 1.0), loc='upper left')

    plt.title('Locations')
    plt.tight_layout()
    plt.show()

GeoPlot(RegionTrain)

GeoPlot(RegionSnotel)

Merge Geospatial data to SWE observations#

#This function connects stationary geospatial information to observations
def Geo_to_Data(geodf, SWE, id):
    dfcols = ['Long','Lat','elevation_m','slope_deg','aspect','Date','SWE','Region']
    try:
        SWE.pop('Region')
    except:
        print(' ')
    geodf.reset_index(inplace = True)
    SWE.reset_index(inplace = True)
    datadf = SWE.merge(geodf, how='inner', on=id)
    datadf.set_index(id, inplace = True)
    datadf=datadf[dfcols]

    return datadf

#Create a temporal attribute, week_num(), that reflect the week id of the water year, beginning October 1st
def week_num(df):
        #week of water year
    weeklist = []

    for i in tqdm(range(0,len(df))):
        if df['Date'][i].month<11:
            y = df['Date'][i].year-1
        else:
            y = df['Date'][i].year
            
        WY_start = pd.to_datetime(str(y)+'-10-01')
        deltaday = df['Date'][i]-WY_start
        deltaweek = round(deltaday.days/7)
        weeklist.append(deltaweek)


    df['WYWeek'] = weeklist

#Merge Geospatial data to SWE observations
Snotel.set_index('station_id', inplace = True)

#Connect location geospatial attributes to observations
Training = Geo_to_Data(TrainGeo_df, TrainSWE, 'cell_id')

# get snotel station id, region, slope, and aspect to merge with obervations
GM_Snotel_train = GM_Train.copy()
#Make Date in datetime dtype
Training['Date'] = pd.to_datetime(Training['Date'])
GM_Snotel_train['Date'] = pd.to_datetime(GM_Snotel_train['Date'])

#add week number to observations
week_num(Training)

Training

GM_Snotel_train

#Connect observations to regional data
#subset data by each region into dictionary
RegionTrain = {name: Training.loc[Training['Region'] == name] for name in Training.Region.unique()}
RegionSnotel_Train  = {name: GM_Snotel_train.loc[GM_Snotel_train['Region'] == name] for name in GM_Snotel_train.Region.unique()}

RegionSnotel_Train['N_Sierras']

Make the Northness Feature#

#This function defines northness: :  sine(Slope) * cosine(Aspect). this gives you a northness range of -1 to 1.
#Note you'll need to first convert to radians. 
#Some additional if else statements to get around sites with low obervations
def northness(df):    
    
    if len(df) == 8: #This removes single value observations, need to go over and remove these locations from training too
        #Determine northness for site
        #convert to radians
        df = pd.DataFrame(df).T
        
        df['aspect_rad'] = df['aspect']*0.0174533
        df['slope_rad'] = df['slope_deg']*0.0174533
        
        df['northness'] = -9999
        for i in range(0, len(df)):
            df['northness'].iloc[i] = math.sin(df['slope_rad'].iloc[i])*math.cos(df['aspect_rad'].iloc[i])

        #remove slope and aspects to clean df up
        df = df.drop(columns = ['aspect', 'slope_deg', 'aspect_rad', 'slope_rad', 'Region'])
        
        return df
        
    else:
         #convert to radians
        df['aspect_rad'] = df['aspect']*0.0174533
        df['slope_rad'] = df['slope_deg']*0.0174533
        
        df['northness'] = -9999
        for i in range(0, len(df)):
            df['northness'].iloc[i] = math.sin(df['slope_rad'].iloc[i])*math.cos(df['aspect_rad'].iloc[i])

        
         #remove slope and aspects to clean df up
        df = df.drop(columns = ['aspect', 'slope_deg', 'aspect_rad', 'slope_rad', 'Region'])
        
        return df

#make northness feature and delete regions, slope, aspect features for each training and testing cell
for i in tqdm(RegionTrain):
    RegionTrain[i] = northness(RegionTrain[i])
    
#Make dictionary in Regions dict for each region's dictionary of Snotel sites
Regions = list(RegionTrain.keys()).copy()

#Make northness for all Snotel observations
for i in tqdm(Regions):
    
    snotel = i+'_Snotel'
    RegionTrain[snotel] = {site: RegionSnotel_Train[i].loc[site] for site in RegionSnotel_Train[i].index.unique()}
    
    #get training and testing sites that are the same
    train = RegionTrain[snotel].keys()
    
    #make Northing metric
    for j in tqdm(train):
  
    #remove items we do not need
        RegionTrain[snotel][j] = RegionTrain[snotel][j].drop(columns = ['Long', 'Lat'])
    #make date index
        RegionTrain[snotel][j] = RegionTrain[snotel][j].set_index('Date')
        
    #rename columns to represent site info
        colnames = RegionTrain[snotel][j].columns
        sitecolnames = [x +'_'+ j for x in colnames]
        names = dict(zip(colnames, sitecolnames))
        RegionTrain[snotel][j] = RegionTrain[snotel][j].rename(columns = names)
    

for i in tqdm(Regions):
    
    snotel = i+'_Snotel'
    RegionTrain[snotel] = {site: RegionSnotel_Train[i].loc[site] for site in RegionSnotel_Train[i].index.unique()}
    
    #get training and testing sites that are the same
    train = RegionTrain[snotel].keys()

    
    #get obs
    for j in tqdm(train):
   
    #remove items we do not need
        RegionTrain[snotel][j] = RegionTrain[snotel][j].drop(columns = ['Long', 'Lat', 'elevation_m', 'location', 'state' , 'Region'])
    #make date index
        RegionTrain[snotel][j] = RegionTrain[snotel][j].set_index('Date')
        
    #rename columns to represent site info
        colnames = RegionTrain[snotel][j].columns
        sitecolnames = [x +'_'+ j for x in colnames]
        names = dict(zip(colnames, sitecolnames))
        RegionTrain[snotel][j] = RegionTrain[snotel][j].rename(columns = names)
    
    #Remove unused columns
    columns = list(RegionTrain[snotel].keys()).copy()
    for col in columns:
        if len(RegionTrain[snotel][col].columns) >4:
            del RegionTrain[snotel][col]

            
#make a df for training each region, 
for R in tqdm(Regions):
    snotels = R+'_Snotel'
    RegionTrain[R] = RegionTrain[R].reset_index()
    RegionTrain[R] = RegionTrain[R].set_index('Date')
      
    for S in RegionTrain[snotels]:
        RegionTrain[R]= pd.concat([RegionTrain[R], RegionTrain[snotels][S].reindex(RegionTrain[R].index)], axis=1)
    
    RegionTrain[R] = RegionTrain[R].fillna(-9999)
    

RegionTrain['N_Sierras'].head()

Feature Engineering: Previous Week’s SNOTEL SWE#

We use in-situ station SWE observations as features, using the values observed from the current (Snotel/CDEC SWE) and the previous week (Previous Snotel/CDEC SWE) as inputs.

def Prev_SWE_Snotel_Dict(DF, region):
    print(region)
    
    regionsnotel = region+'_Snotel'
    
    sites = DF[regionsnotel].keys()
    
    #week delta  
    weekdelta = pd.Timedelta(7, "d")
    
    for i in tqdm(sites):
        #print(i)
        prevSWE = 'Prev_SWE_' + i
        SWE = 'SWE_'+i
        
        DF[regionsnotel][i][prevSWE] = -9999.99
        
        #need to find the number of columns for ifelse
        dfcols = len(DF[regionsnotel][i].columns)
    

        #if only one observation need to fix
        if len(DF[regionsnotel][i]) == 1:
            DF[regionsnotel][i] = DF[regionsnotel][i].T

        for cell in range(1,len(DF[regionsnotel][i])):

            if DF[regionsnotel][i].index[cell] - DF[regionsnotel][i].index[cell-1] == weekdelta:     
                DF[regionsnotel][i][prevSWE][cell] = DF[regionsnotel][i][SWE][cell-1]
  

for i in Regions:
    Prev_SWE_Snotel_Dict(RegionTrain, i)
    

RegionTrain['N_Sierras_Snotel']['CDEC:ADM'].head()

Feature Engineering: Delta SNOTEL SWE#

Using the observations from the current and previous week, we calculate the difference to capture the trend, either positive or negative, in SWE dynamics (i.e, melt or accumulation) with respect to each monitoring station ( \(\Delta\) SWE).

def Delta_SWE_Snotel_Dict(DF, region):
    # print(region)
    
    regionsnotel = region+'_Snotel'
    
    sites = DF[regionsnotel].keys()
    
    for i in tqdm(sites):
      #  print(i)
        prevSWE = 'Prev_SWE_' + i
        SWE = 'SWE_'+i
        Delta_SWE = 'Delta_'+SWE
        
        DF[regionsnotel][i][Delta_SWE] = DF[regionsnotel][i][SWE] - DF[regionsnotel][i][prevSWE]
        DF[regionsnotel][i].loc[DF[regionsnotel][i][Delta_SWE]>150, Delta_SWE] =-9999.99
    
        

#Add Delta SWE feature
for i in Regions:
    Delta_SWE_Snotel_Dict(RegionTrain, i)

RegionTrain['N_Sierras_Snotel']['CDEC:ADM'].head()

Connect Snotel Observations to NASA ASO#

Connect dataframe of NASA ASO with snotel observations

#make a df for training each region, 
for R in tqdm(Regions):
    snotels = R+'_Snotel'
    RegionTrain[R] = RegionTrain[R].reset_index()
    RegionTrain[R] = RegionTrain[R].set_index('Date')
    
    for S in RegionTrain[snotels]:
        RegionTrain[R]= pd.concat([RegionTrain[R], RegionTrain[snotels][S].reindex(RegionTrain[R].index)], axis=1)
    
    RegionTrain[R] = RegionTrain[R].fillna(-9999)

#Remove unnecessary features
for region in Regions:
    RegionTrain[region] = RegionTrain[region].drop( RegionTrain[region].filter(regex='elevation_m_').columns, axis=1)
    RegionTrain[region] = RegionTrain[region].drop( RegionTrain[region].filter(regex='northness_').columns, axis=1)
    RegionTrain[region] = RegionTrain[region].T.drop_duplicates().T

Target Site’s Previous SWE#

Because of the serial correlation of snow accumulation and melt on the current timesteps SWE prediction, the model uses the SWE estimate of the previous week as a feature (Previous SWE). For model training and testing, the Previous SWE input is from NASA ASO datasets.

def Prev_SWE(df, region):    
    print(region)
    
    df = df[region].reset_index()
    df = df.set_index('cell_id')
    
    #week delta  
    weekdelta = pd.Timedelta(7, "d")

    #set up column for previous weeks SWE
    df['prev_SWE'] = -9999.99
    
    #need to find the number of columns for ifelse
    dfcols = len(df.columns)
    
    #Run through each uniqe site/cell id to calculate previous weeks SWE and add to a new dataframe
    new_df = pd.DataFrame(columns = df.columns)
    
    #find unique sites
    sites = df.index.unique()
 
    for i in tqdm(sites):
        site = df.loc[i].copy()

        #if only one observation need to fix
        if site.shape == (dfcols,):# and len(site) < 162:
            site = site.to_frame().T

        for cell in range(1,len(site)):
            if site['Date'][cell] - site['Date'][cell-1] == weekdelta:     
                site['prev_SWE'][cell] = site['SWE'][cell-1]
        dflist = [new_df, site]
        new_df = pd.concat(dflist)
    new_df = new_df.fillna(-9999)
    
    #Put Prev_SWE next to SWE to confirm operations
    prev_SWE = new_df['prev_SWE'].copy()
    del new_df['prev_SWE']
    new_df.insert(loc = 5,
          column = 'prev_SWE',
          value = prev_SWE)
    
    return new_df

#Run the previous SWE function and save the dataframe, you will now be ready to train the model.
training_df = {}
for region in Regions:
    training_df[region] = Prev_SWE(RegionTrain, region)

training_df['N_Sierras'].head()

training_df['S_Sierras'].head()

Next Chapter

GeoSMART Hackweek

Data Preparation and Feature Engineering

Contents