Part III: Snapping Points to NHD Flowline Data and Calculating River Distance with riverdist
Though it’s a bit late in getting posted, this is part III in a spatial series that is becoming a quasi-workshop in the variety of ways we can work with spatial river & stream data in R. This post will continue using the sf
package, and introduce the riverdist
package, but we’ll still be playing with data on rivers and streams.
To recap, in the first few posts we made a leaflet map, scraped data from tables on a webpage, demonstrated how to calculate distance matrices between point data, and showed how to use functions written by folks at USGS (OWI) as well as the Hydrosheds data to download riverline data. That’s a fair amount to chew on I realize, but I’d like to demo a few additional things that might be useful for folks working with river data.
This post will (try to) show you:
- How to use the
riverdist
package to process a riverline network - How to snap spatial point data to the processed riverline data (which we downloaded in the previous post)
- Finally, how to calculate the river distance between sites (so along the river network)
- and a quick
ggmap
example
I think in a future post I’d like to demo how to implement a wavelet analysis of these flow data, as well as show an example of the very cool geoknife
package to assess patterns of river impairment. Onward!
So, hopefully these posts are useful and it doesn’t feel like this:
Load the Packages
The main packages I’m going to use in this post:
# load libraries
suppressMessages({
library(dplyr); # data munging and piping
library(purrr); # for list functions
library(ggplot2); # plotting
library(ggrepel) # for labeling
library(sf); # spatial simple features
library(USAboundaries); # state/county data
library(Imap); # nice mapping/color functions
#library(geoknife); # USGS tool set (Next post)
#library(dataRetrieval); # USGS tool set (next post)
#library(httr) # scraping webdata (to download flowlines)
})
# Note, to plot `sf` objects with `ggplot`, you need to have the most recent version (>2.2.1). A quick and easy check you can add to your code is an `if` statement: `if (utils::packageVersion("ggplot2") > "2.2.1"))`
Calculating River Distances
To use the riverdist
package, we need to load a shapefile with a river network of some type. This example uses NHD Streamline data, cropped to the North Fork American watershed. I downloaded the flowline data for the North Fork American watershed using the get_flowline
function from our previous post, at Stream Order 1, 2, 3, and 4. The function was from an excellent gist by Laura DeCicco (see here), and I modified it slightly to work with the sf
package.
To save time, I’ve put these data on github here for download. Let’s use HUC8 data from the previous post as well. Download it to a “data
” folder and the code below should work.
Load Flowline & HUC Data
Let’s take a look at the Streamline data for Stream Order=2
. Note the river network was downloaded using a bounding box or extent (max and min lat and lon values for a given shapefile), so it extends outside of the specific HUC8 watershed boundary we’re interested in.
# load the streamline data
load("data/nhd_rivs_ord_1_4.rda")
# load the hucs
huc8 <- read_sf(unzip("data/h8_AMR_BEA_YUB.zip"), quiet = F) %>%
st_transform(crs=4326) #%>%
## Reading layer `h8_AMR_BEA_YUB' from data source `/Users/ryanpeek/Documents/github/ryanpeek.github.io/h8_AMR_BEA_YUB.dbf' using driver `ESRI Shapefile'
## Simple feature collection with 4 features and 5 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -121.5976 ymin: 38.61455 xmax: -119.9829 ymax: 39.77696
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
# warning message here is ok!
# then remove raw files since file is added in memory
file.remove(list.files(pattern = "h8_AMR_BEA_YUB*",recursive = F))
## [1] TRUE TRUE TRUE TRUE
# filter to the NF American only
h8 <- huc8 %>% filter(HU_8_NAME=="North Fork American")
# plot
plot(st_geometry(h8), axes=F, lwd = 3)
plot(st_geometry(rivers2), col="blue", add=T)
title(main = "NF American Watershed & NHD Flowline (Stream order=2)", family="Roboto Condensed")
Crop and Save as Shapefile
In order to use the riverdist
package, we are going to want to trim our flowline data to a single watershed, and get rid of the streams outside of that boundary. Let’s crop the flowline data to the NF American Watershed using sf::st_intersection
. Then we can write the data as a shapefile.
# crop river data by huc8 data with sf function
rivs2 <- st_intersection(rivers2, h8)
# plot to check:
plot(st_geometry(h8), axes=F, lwd=3)
plot(st_geometry(rivs2), col="blue", add=T)
title(main = "NF American Watershed & NHD Flowline (Stream order=2)", family="Roboto Condensed")
# okay looks good lets save:
st_write(rivs2, "data/nfa_rivers2.shp", delete_dsn = T)
## Deleting source `/Users/ryanpeek/Documents/github/ryanpeek.github.io/data/nfa_rivers2.shp' using driver `ESRI Shapefile'
## Writing layer `nfa_rivers2' to data source `/Users/ryanpeek/Documents/github/ryanpeek.github.io/data/nfa_rivers2.shp' using driver `ESRI Shapefile'
## features: 654
## fields: 95
## geometry type: Line String
# "delete_dsn=T" will allow you to overwrite existing file so you don't get an error like this:
### Error in CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), :
### Dataset already exists.
You might see a warning message about attribute field names, that’s okay, it’s just making field names unique. The file should be where you saved it, and have 4 separate files (.dbf
, .prj
, .shp
, .shx
). There can be more associated with a shapefile, but these 4 are the core set that are required.
Clean River Network Topology
Now that we have a river network, we need to clean it and check for errors so that when riverdist
is routing things along the network, distances will be accurate.
The main steps are:
- Load data and transform into a projected coordinate system
- Plot and Check Topology
- Clean the network topology so we can snap points, save data
- Calculate distances along the river network between sites
Load Data and Project
First we need to load the package and project our data. riverdist
won’t work unless data are in a projected coordinate system. If you aren’t sure what the difference between projected (PCS) and geographic coordinate systems (GCS) are, here’s the quickest summary I can provide.
- GCS are based on a spherical earth model (3-dimensional), so use coordinates like latitude & longitude (a datum) in a 3-dimensional space to define locations based on a spheroid model (of which there are several). The datum is used to link the spheroid model to the earth’s surface.
- PCS project locations onto a 2-dimensional Cartesian coordinate plane. That’s why we refer to these as map projections. Different projections have different ways of stretching the 3-d earth’s surface onto a flat 2-d plane. But by putting things onto a flat 2-d surface, calculations (like calculating distances along a river network) are usually easier to do.
# load the package
library(riverdist)
# riverdist only works w/ projected coordinate systems...a few options:
## EPSG:102003 USA_Contiguous_Albers_Equal_Area_Conic
## +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
## EPSG:102004 USA_Contiguous_Lambert_Conformal_Conic
## EPSG:102005 USA_Contiguous_Equidistant_Conic
# read in the network and project with Albers Equal Area
rivs <- line2network(path="data/", layer="nfa_rivers2", reproject = "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs")
##
## Units: m
##
## Removed 36 segments with lengths shorter than the connectivity tolerance.
# should see message about "Removed 36 Segments w short lengths"
# plot of all the segments
plot(x=rivs)
Plot and Check Topology
Ok, so we should see a whole bunch of rainbow colored numbered segments. If we want to look at topology, we can plot the nodes (confluences). Green dots represent connections or confluences between segments, red dots are “headwaters” or isolated branches. Notice there are some red dots in some odd locations.
# check topology (nodes)
topologydots(rivers=rivs)
Cleanup: Insert Vertices and Identify River Mouth
The riverdist
tutorial/vignette is pretty good and will walk folks through the main steps. For larger river networks and the finer scale data you have, the more likely you can end up with weird and even persistent things you need to clean (stream braiding, unconnected segments, etc). Be aware this cleaning of topology is a stepwise process and it can take some time, but once you have everything cleaned up, you can save the data and re-use network in the future.
I’ll try to screen shot this process below. First let’s run the cleanup()
function.
# do a clean up:
rivs_fixed <- cleanup(rivers = rivs)
Initially the cleanup will identify all the segments and attempt to remove duplicate lines and dissolve segments that are split (uncessarily). When it asks whether to Dissolve
or not, select yes.
Dissolving...
Simplified from 615 to 243 segments.
Checking for segments with length less than the connectivity tolerance...
0 microsegments identified.
Checking if splitting segments is needed...
Checking between-vertex lengths...
Maximum distance between vertices is 772
Insert vertices to reduce distances between vertices and increase point snapping precision? (y/n) ?
The Insert vertices
question makes a more evenly distributed network so when we try snap points, there will be more vertices available in the network. The histogram provides a sense of the distribution of the segement sizes. I try to select something near the peak so most segments are a consistent size.
Select “y
”, and pick a number…I picked 100 meters for this example. The output will then say it’s inserting vertices, followed by:
Please identify segment number of river mouth:
Here I’ve selected segment 192, and then selected the mouth as 1. Once the river mouth has been identified, the program will ask if you want to remove any additional segments. I select n here, but many options.
Cleanup: Remove Additional Segments/Braiding
The cleanup function will then continue and ask if you want to remove additional segments, and check for braiding. Just walk through and determine what you want to keep or not. If there are braided segment, you’ll need to pick out the components you want to keep and which to delete. This can be a bit tedious (especially with very fine-scale datasets), but once you walk through, you’ll be able to save your updated network. Here I remove as many of the braided segments as I can (207, 213, 205, 204, 214, 138, 10, 11, 232, 230
).
Once done with removing braided segments, riverdist
will ask if you want to build segment routes. I select y here.
Save the Cleaned Data
According to riverdist
, we should save our cleaned topology file as a Rdata
file. Let’s save our file.
save(rivs_fixed, file = "data/nfa_rivs2_fixed.rda")
Plot Cleaned Topology
Let’s take a look at what we did, and why it matters, by plotting the original river network topology vs. the cleaned version.
load("data/nfa_rivs2_fixed.rda")
# now plot the old vs. the new
par(mfrow=c(1,2))
topologydots(rivers=rivs)
graphics::title("Raw River Topology", family="Roboto Condensed")
topologydots(rivs_fixed)
graphics::title("Clean River Topology", family="Roboto Condensed")
We can see that the Clean river topology has fewer nodes and is a cleaner network, which means it will be more accurate and faster for calculating network distances.
## null device
## 1
Snapping Points to Lines
Now that we have a river network, we can snap points to the nearest line, and use that to calculate distances between sites along the river network. First let’s grab some points! For this example let’s go back and grab all CDEC stations from the NF American watershed. To do that I searched for any/all stations in the American River basin using this interface. The resulting page I ended up with was this. Then I used the web-scraping technique I showed in the first post in this series) to pull that data in.
Get Point Data (for Snapping)
library(rvest)
url <- "https://cdec.water.ca.gov/cgi-progs/staSearch?sta=&sensor=211&dur=&active=&lon1=&lon2=&lat1=&lat2=&elev1=-5&elev2=99000&nearby=&basin_chk=on&basin=AMERICAN+R&hydro=CENTRAL+COAST&county=ALAMEDA&operator=%2B&display=sta"
df <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="main_content"]/div/div[1]/table') %>% # inspect and xpath with chrome
html_table()
df_locs <- df[[1]] # reads in as a list, so we are just eliminating the list and making this a dataframe
head(df_locs)
## ID Station Name River Basin County Longitude
## 1 ABN LAKE AUDRAIN AMERICAN R EL DORADO -120.0370
## 2 ACK ARCADE CK NR DEL PASO HTS AMERICAN R SACRAMENTO -121.3820
## 3 ADR AUBURN DAM RIDGE AMERICAN R PLACER -121.0450
## 4 AFD AMERICAN R BELOW FOLSOM DAM AMERICAN R SACRAMENTO -121.1667
## 5 AFO AMERICAN RIVER AT FAIR OAKS AMERICAN R SACRAMENTO -121.2277
## 6 AHZ AMERICAN R AT HAZEL AVE BRIDGE AMERICAN R SACRAMENTO -121.2240
## Latitude Elevation Feet Operator Map
## 1 38.82000 7300   Placerville Ranger District
## 2 38.63400 50   US Geological Survey
## 3 38.88200 1200   US Bureau of Reclamation
## 4 38.68830 220   US Geological Survey
## 5 38.63546 72   US Geological Survey
## 6 38.63600 100   US Geological Survey
dim(df_locs)
## [1] 139 9
# cleanup data
df_locs <- df_locs %>% rename(station="Station Name", river_basin="River Basin",
lon=Longitude, lat=Latitude, elev_ft=`Elevation Feet`) %>%
select(-Map)
# remove and filter out the weird UNIX code " "
df_locs$elev_ft <- as.numeric(gsub(pattern = " ", replacement = "", df_locs$elev_ft))
# make CDEC data sf object (spatial):
df_locs <- st_as_sf(df_locs,
coords = c("lon", "lat"), # for point data
remove = F, # don't remove these lat/lon cols from df
crs = 4326) # add projection (this is WGS84)
Sweet it works! We have 136 stations in the American watershed…let’s crop down to only the points that occur within our HUC8 watershed. First let’s check that there aren’t NA’s or weird locations in this set…so PLOT the data.
Plot Point Data
# PLOT data
plot(st_geometry(h8), lwd=2)
plot(st_geometry(rivs2), col="blue", add=T)
plot(st_geometry(df_locs), add=T, pch=16, col = "maroon")
graphics::title("All CDEC Stations in American Watershed", family="Roboto Condensed")
Ok, some outside of the watershed boundary. Let’s crop with st_intersection
again.
# let's crop points to only those within the watershed:
df_locs_NFA <- st_intersection(df_locs, h8)
# plot
plot(st_geometry(h8), lwd=2, axes=F)
plot(st_geometry(rivs2), col="blue", add=T)
plot(st_geometry(df_locs_NFA), add=T, pch=21, bg = "orange", cex = 1.5)
graphics::title("Selected CDEC Stations in NF American Watershed", family="Roboto Condensed")
Great, now we have 44 points within the watershed we can work with. Even better, you can see some do not appear to be on a flowline. The next steps will show how to snap these points to the flowline network.
Snap Points to Nearest FlowLine
Finally we can now snap our points to the river network! We need to project our data, so that it will line up with our cleaned river topology, add those projected X/Y data (in UTM), and then snap to our network.
df_locs_NFA <- df_locs_NFA %>%
st_transform(crs = 102003) # convert to Albers Equal Area
# add COORDS in projected form (UTM)
df_locs_NFA$X <- st_coordinates(df_locs_NFA)[,1]
df_locs_NFA$Y <- st_coordinates(df_locs_NFA)[,2]
# run this to snap points to line (and see distrib)
cdec_riv <- xy2segvert(x=df_locs_NFA$X, y=df_locs_NFA$Y, rivers=rivs_fixed)
head(cdec_riv)
## seg vert snapdist
## 1 208 5 699.23560
## 2 202 4 26.78865
## 3 1 95 138.29269
## 4 226 37 133.60740
## 5 51 182 3648.50085
## 6 94 1 789.67343
hist(cdec_riv$snapdist, breaks=50,main="Point Snapping distance (CDEC to Flowline)", col="skyblue3", xlab="Snapping Distance (m)", family="Roboto Condensed")
# add vertices
df_locs_NFA <- bind_cols(df_locs_NFA, cdec_riv)
# MAP IT
plot(x=rivs_fixed, color = FALSE)
points(df_locs_NFA$X, df_locs_NFA$Y, pch=16, col="red") # raw coords
riverpoints(seg=cdec_riv$seg, vert=cdec_riv$vert, rivers=rivs_fixed, pch=22, cex=1.5,
bg="forestgreen") # snapped coords
text(df_locs_NFA$X, df_locs_NFA$Y, labels=df_locs_NFA$vert, pos = 4, family="Roboto Condensed")
Great, now we have a set of points that represent CDEC station locations, snapped to the nearest NHD river segment.
Detect Routes
If we want to calculate the distance between two points and identify the route, we can use detectroute
. Just pick two segment numbers, and then it will return the steps required for the route.
# DETECT ROUTES
detectroute(start=189, end=6, rivers=rivs_fixed)
## [1] 189 187 185 190 200 202 210 209 197 208 201 207 206 205 18 17 6
# then use that information to pick the starting/end segments, and the vertices of interest.
riverdistance(startseg=189, startvert=185, endseg=6, endvert=1,
rivers=rivs_fixed, map=TRUE)
## [1] 41711.18
So the total distance between LOOMIS OBSERVATORY
(Station LMO
) and DEADMAN FLAT
(Station DNF
) is 41.7 km along this river network.
Create Distance Matrix
Once points are snapped to the line network, it is possible to extract distances between site pairs, and export. We create a matrix of distances, and then plot a heatmap below.
# CREATE MATRIX OF DISTS
dmat <- riverdistancemat(df_locs_NFA$seg, df_locs_NFA$vert, rivs_fixed, ID = df_locs_NFA$Locality ) %>% as.matrix
dmat <- dmat/1000 # convert to km
head(dmat)
## 1 2 3 4 5 6 7
## 1 0.000000 2.500407 8.177411 117.01273 74.46732 69.36178 74.46732
## 2 2.500407 0.000000 10.677818 119.51314 76.96773 71.86219 76.96773
## 3 8.177411 10.677818 0.000000 115.94033 66.28991 68.28938 66.28991
## 4 117.012730 119.513137 115.940329 0.00000 182.23024 99.81828 182.23024
## 5 74.467318 76.967725 66.289907 182.23024 0.00000 134.57929 0.00000
## 6 69.361785 71.862192 68.289384 99.81828 134.57929 0.00000 134.57929
## 8 9 10 11 12 13 14
## 1 79.778964 110.62785 97.99252 91.28091 80.41068 80.41068 67.90420
## 2 82.279371 113.12825 100.49292 93.78132 82.91108 82.91108 70.40460
## 3 71.601554 109.55545 96.92012 90.20851 79.33827 79.33827 66.83180
## 4 187.541883 23.27542 123.47621 116.76461 110.86717 110.86717 76.67260
## 5 5.311646 175.84535 163.21002 156.49842 145.62818 145.62818 133.12170
## 6 139.890937 93.43340 80.79807 74.08646 26.52852 26.52852 50.70975
## 15 16 17 18 19 20 21
## 1 80.68478 28.95369 72.70133 101.18455 33.40998 94.61948 96.02822
## 2 83.18518 31.45410 75.20174 103.68496 35.91039 97.11988 98.52863
## 3 79.61238 27.88129 71.62893 100.11215 32.33758 93.54708 94.95582
## 4 106.16847 106.11090 98.18503 54.15488 106.26351 22.39325 20.98451
## 5 145.90228 94.17120 137.91884 166.40206 98.62749 159.83698 161.24573
## 6 63.49033 58.45996 55.50688 83.99010 58.61256 77.42503 78.83377
## 22 23 24 25 26 27 28
## 1 89.96408 118.15806 12.75748 94.32181 101.13001 100.97388 129.77030
## 2 92.46449 120.65847 10.25708 96.82221 103.63041 103.47428 132.27071
## 3 81.78667 117.08566 20.93490 93.24941 100.05761 99.90147 128.69790
## 4 197.72700 12.13283 129.77021 47.29213 54.10033 53.94420 12.75757
## 5 44.42995 183.37557 87.22480 159.53931 166.34751 166.19138 194.98781
## 6 150.07605 100.96361 82.11927 77.12735 83.93555 83.77942 112.57585
## 29 30 31 32 33 34
## 1 79.45254 85.62167 104.28162 5.259094 8.10899320 103.61404
## 2 81.95295 88.12208 106.78203 2.758687 10.60940018 106.11444
## 3 71.27513 77.44426 103.20922 13.436505 0.06841759 95.43663
## 4 187.21546 193.38459 20.48887 122.271824 115.87191134 211.37696
## 5 4.98522 11.15435 169.49913 79.726413 66.35832501 58.07991
## 6 139.56451 145.73364 87.08717 74.620879 68.22096629 163.72601
## 35 36 37 38 39 40 41
## 1 41.50733 89.54933 122.909214 92.08730 56.43929 101.13001 52.39717
## 2 44.00773 92.04974 125.409621 94.58770 58.93969 103.63041 54.89758
## 3 40.43493 88.47693 121.836813 91.01489 48.26188 100.05761 44.21976
## 4 75.50540 42.51965 5.896484 45.05762 164.20220 54.10033 160.16009
## 5 106.72483 154.76684 188.126721 157.30480 73.09735 166.34751 69.05524
## 6 27.85446 72.35488 105.714763 74.89284 116.55126 83.93555 112.50914
## 42 43 44
## 1 85.46204 96.20213 94.07923
## 2 87.96245 98.70254 96.57964
## 3 77.28463 95.12973 85.90182
## 4 193.22496 121.68583 201.84215
## 5 10.99472 161.41964 48.54510
## 6 145.57401 79.00768 154.19121
# library for heatmap
library(fields)
image.plot(1:nrow(dmat), 1:ncol(dmat), dmat, axes = FALSE,
xlab="", ylab="", col = terrain.colors(50))
axis(1, 1:nrow(dmat), rownames(dmat), cex.axis = 0.7, las=3, family="Roboto Condensed")
axis(2, 1:nrow(dmat), colnames(dmat), cex.axis = 0.5, las=1, family="Roboto Condensed")
# this add exact value to each grid square:
#text(expand.grid(1:nrow(dmat), 1:ncol(dmat)), sprintf("%0.1f", dmat/1000), cex=0.6, family="Roboto Condensed")
graphics::title("River Distance Matrix (km) for CDEC Stations in NF American Watershed", cex.main=1, family="Roboto Condensed")
# write the data out
nfa_dists <- as.data.frame(dmat)
# write_tsv(nfa_dists, "data_output/mainstem_distance_matrix_NFA.txt")
Make a ggMap
Finally, we can map these data with a terrain or google map background. Just another alternative to some of the other options shown thus far.
# SIMPLE MAP --------------------------------------------------------------
library(ggmap) # doesn't plot with SF so need to convert
location<-c(mean(st_coordinates(h8)[,1]),
mean(st_coordinates(h8)[,2]))
map1 <- get_map(location=location,crop = F,
color="bw",
maptype="terrain",
source="google",
zoom=9)
terrain_bg <-ggmap(map1)
# convert to GCS (WGS84)
df_locs_NFA <-df_locs_NFA %>% st_transform(crs = 4326)
# convert river network:
rivs_nfa <- st_read("data/nfa_rivers2.shp", quiet = T) %>% st_transform(crs=4326) %>%
as("Spatial") %>% fortify
# convert h8_nfa
h8_nfa.sp <- h8 %>% as("Spatial") %>% fortify
terrain_bg +
scale_color_viridis_d() +
geom_path(data=rivs_nfa, aes(x=long, y=lat, group=group), col="blue")+
labs(x="Longitude (WGS84)", y="Latitude") + theme_bw() +
geom_polygon(data=h8_nfa.sp, aes(x=long, y=lat, group=group), fill=NA, color="maroon", lwd=1.2, alpha=0.5) +
geom_point(data=df_locs_NFA, aes(x=lon, y=lat), fill="yellow", pch=21, size=3) +
coord_fixed(ratio=1.3) +
ggtitle("CDEC Stations in the NF American Watershed, Snapped to NHD Flowline")
Summary
Thanks for making it to the end! If you are reading, sorry that was long, but hopefully it was interesting. There are lots of pieces that can be switched out, and you can imagine being able to snap survey sites, observations, etc. to a flowline network and calculating distances between these points. For some of my work, this is really useful for generating distance matrices that can be compared against population genomic data, when measuring metrics like FST (isolation).
All for now! Stay tuned for another post in a few weeks.