introduction

In this lab, we’ll explore the basics of spatial and geometry operations on vector data in R using the sf package. The following materials are modified from Chapter 4 and Chapter 5of Geocomputation with R by Rovin Lovelace.

prerequisites

rm(list = ls())
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(spData)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)
library(tmap)
## 
## Attaching package: 'tmap'
## The following object is masked from 'package:datasets':
## 
##     rivers
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rmapshaper)
library(smoothr)
## 
## Attaching package: 'smoothr'
## 
## The following object is masked from 'package:stats':
## 
##     smooth

spatial subsetting

Spatial subsetting is the process of converting a spatial object into a new object containing only the features that relate in space to another object. This is analogous the attribute subsetting that we covered last week. There are many ways to spatially subset in R, so we will explore a few.

Let’s start by going back to the New Zealand datasets and find all the high points in the state of Canterbury.

The command

canterbury <- nz %>%
  filter(Name == "Canterbury")

# subsets nz_heights to just the features that intersect Canterbury
c_height <- nz_height[canterbury, ]

tm_shape(nz) +
  tm_polygons() +
  tm_shape(nz_height) +
  tm_dots(fill = "red")

tm_shape(nz) +
  tm_polygons() +
  tm_shape(canterbury) +
  tm_polygons(fill = "blue") +
  tm_shape(c_height) +
  tm_dots(fill = "red")

The default is to subset to features that intersect, but we can use other operations, including finding features that do not intersect.

outside_height <- nz_height[canterbury, , op = st_disjoint]

tm_shape(nz) +
  tm_polygons() +
  tm_shape(canterbury) +
  tm_polygons(fill = "blue") +
  tm_shape(outside_height) +
  tm_dots(fill = "red")

We can perform the same operations using topological operators. These operators return matrices testing the relationship between features.

sel_sgbp <- st_intersects(x = nz_height, y = canterbury)
sel_sgbp
## Sparse geometry binary predicate list of length 101, where the
## predicate was `intersects'
## first 10 elements:
##  1: (empty)
##  2: (empty)
##  3: (empty)
##  4: (empty)
##  5: 1
##  6: 1
##  7: 1
##  8: 1
##  9: 1
##  10: 1
sel_logical <- lengths(sel_sgbp) > 0

c_height2 <- nz_height[sel_logical, ]

tm_shape(nz) +
  tm_polygons() +
  tm_shape(canterbury) +
  tm_polygons(fill = "blue") +
  tm_shape(c_height2) +
  tm_dots(fill = "red")

We can also use the st_filter function in sf

c_height3 <- nz_height %>%
  st_filter(y = canterbury, .predicate = st_intersects)

tm_shape(nz) +
  tm_polygons() +
  tm_shape(canterbury) +
  tm_polygons(fill = "blue") +
  tm_shape(c_height3) +
  tm_dots(fill = "red")

We can change the predicate option to test subset to features that don’t intersect

outside_height2 <- nz_height %>%
  st_filter(y = canterbury, .predicate = st_disjoint)

tm_shape(nz) +
  tm_polygons() +
  tm_shape(canterbury) +
  tm_polygons(fill = "blue") +
  tm_shape(outside_height2) +
  tm_dots(fill = "red")

spatial joining

Where attribute joining depends on both data sets sharing a ‘key’ variable, spatial joining uses the same concept but depends on spatial relationships between data sets.

Let’s test this out by creating 50 points randomly distributed across the world and finding out what countries they call in.

set.seed(2018)
bb <- st_bbox(world)

random_df <- data.frame(
  x = runif(n = 10, min = bb[1], max = bb[3]),
  y = runif(n = 10, min = bb[2], max = bb[4])
)

random_points <- random_df %>%
  st_as_sf(coords = c("x", "y")) %>%
  st_set_crs("EPSG:4326")

tm_shape(world) +
  tm_fill() +
  tm_shape(random_points) +
  tm_dots(fill = "red")

Let’s first use spatial subsetting to find just the countries that contain random points.

world_random <- world[random_points, ]

tm_shape(world) +
  tm_fill() +
  tm_shape(world_random) +
  tm_fill(fill = "red")

Now let’s perform a spatial join to add the info from each country that a point falls into onto the point dataset.

random_joined  <- st_join(random_points, world)

tm_shape(world) +
  tm_fill() +
  tm_shape(random_joined) +
  tm_dots(fill = "name_long")

By default, st_join performs a left join. We change this and instead perform an inner join.

random_joined_inner <- st_join(random_points, world, left = FALSE)

non-overlapping joins

Sometimes we might want join geographic datasets that are strongly related, but do not have overlapping geometries. To demonstrate this, let’s look at data on cycle hire points in London.

tmap_mode("view")
## tmap mode set to 'view'
tm_shape(cycle_hire) +
  tm_dots(col = "blue", alpha = 0.5) +
  tm_shape(cycle_hire_osm) +
  tm_dots(col = "red", alpha = 0.5)
## Deprecated tmap v3 code detected. Code translated to v4
## Deprecated tmap v3 code detected. Code translated to v4

We can check if any of these points overlap.

any(st_touches(cycle_hire, cycle_hire_osm, sparse = FALSE))
## [1] FALSE

Let’s say we need to join the ‘capacity’ variable in ‘cycle_hire_osm’ onto the official ‘target’ data in ‘cycle_hire’. The simplest method is using the topological operator st_is_within_distance().

sel <- st_is_within_distance(cycle_hire, cycle_hire_osm, dist = 20)

summary(lengths(sel) > 0) #summarizes the number of points within 20 meters
##    Mode   FALSE    TRUE 
## logical     304     438

Now, we’d like to add the values from ‘cycle_hire_osm’ onto the ‘cycle_hire’ points.

z <- st_join(cycle_hire, cycle_hire_osm, st_is_within_distance, dist = 20)

nrow(cycle_hire)
## [1] 742
nrow(z)
## [1] 762

Note: the number of rows of the join is larger than the number of rows in the original dataset. Why? Because some points in ‘cycle_hire’ were within 20 meters of multiple points in ‘cycle_hire_osm’. If we wanted to aggregate so we have just one value per original point, we can use the aggregation methods from last week.

z <- z %>%
  group_by(id) %>%
  summarise(capacity = mean(capacity))

spatial aggregation

Similar to attribute data aggregation, spatial data aggregation condenses data (we end up with fewer rows than we started with).

Let’s say we wanted to find the average height of high point in each region of New Zealand. We could use the aggregate function in base R.

nz_agg = aggregate(x = nz_height, by = nz, FUN = mean)

The result of this is an object with the same geometries as the aggregating feature data set (in this case ‘nz’).

nz_agg
## Simple feature collection with 16 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 1090144 ymin: 4748537 xmax: 2089533 ymax: 6191874
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
## First 10 features:
##    t50_fid elevation                       geometry
## 1       NA        NA MULTIPOLYGON (((1745493 600...
## 2       NA        NA MULTIPOLYGON (((1803822 590...
## 3  2408405  2734.333 MULTIPOLYGON (((1860345 585...
## 4       NA        NA MULTIPOLYGON (((2049387 583...
## 5       NA        NA MULTIPOLYGON (((2024489 567...
## 6       NA        NA MULTIPOLYGON (((2024489 567...
## 7       NA        NA MULTIPOLYGON (((1740438 571...
## 8  2408394  2777.000 MULTIPOLYGON (((1866732 566...
## 9       NA        NA MULTIPOLYGON (((1881590 548...
## 10 2368390  2889.455 MULTIPOLYGON (((1557042 531...

We could also use a sf/dplyr approach.

nz_agg <- st_join(nz, nz_height) %>%
  group_by(Name) %>%
  summarise(elevation = mean(elevation, na.rm = TRUE))
nz_agg
## Simple feature collection with 16 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 1090144 ymin: 4748537 xmax: 2089533 ymax: 6191874
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
## # A tibble: 16 × 3
##    Name              elevation                                              geom
##    <chr>                 <dbl>                                <MULTIPOLYGON [m]>
##  1 Auckland               NaN  (((1803822 5900006, 1791443 5900571, 1790082 588…
##  2 Bay of Plenty          NaN  (((2049387 5832785, 2051016 5826423, 2040276 582…
##  3 Canterbury            2995. (((1686902 5353233, 1679996 5344809, 1673699 532…
##  4 Gisborne               NaN  (((2024489 5674920, 2019037 5677334, 2016277 568…
##  5 Hawke's Bay            NaN  (((2024489 5674920, 2024126 5663676, 2032576 565…
##  6 Manawatu-Wanganui     2777  (((1866732 5664323, 1868949 5654440, 1865829 564…
##  7 Marlborough           2720  (((1686902 5353233, 1679241 5359478, 1667754 535…
##  8 Nelson                 NaN  (((1624866 5417556, 1616643 5424521, 1618569 542…
##  9 Northland              NaN  (((1745493 6001802, 1740539 5995066, 1733165 598…
## 10 Otago                 2825  (((1335205 5126878, 1336956 5118634, 1325903 510…
## 11 Southland             2723  (((1229078 5062352, 1221427 5056736, 1217551 503…
## 12 Taranaki               NaN  (((1740438 5714538, 1743867 5711520, 1755759 571…
## 13 Tasman                 NaN  (((1616643 5424521, 1624866 5417556, 1620946 540…
## 14 Waikato               2734. (((1860345 5859665, 1857808 5853929, 1850511 584…
## 15 Wellington             NaN  (((1881590 5489434, 1875693 5479987, 1871588 546…
## 16 West Coast            2889. (((1557042 5319333, 1554239 5309440, 1546356 530…

joining incongruent layers

We might want to aggregate data to geometries that are not congruent (i.e. their boundaries don’t line up). This causes issues when we think about how to summarize associated values.

head(incongruent)
## Simple feature collection with 6 features and 2 fields
## Attribute-geometry relationships: aggregate (1), NA's (1)
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 418580 ymin: 443703 xmax: 422963 ymax: 446978
## Projected CRS: OSGB 1936 / British National Grid
##         level    value                       geometry
## 1 Incongruent 4.037919 MULTIPOLYGON (((420799.6 44...
## 2 Incongruent 5.014419 MULTIPOLYGON (((418664 4464...
## 3 Incongruent 4.933000 MULTIPOLYGON (((419964 4462...
## 4 Incongruent 5.120139 MULTIPOLYGON (((420368 4441...
## 5 Incongruent 6.548912 MULTIPOLYGON (((420419.8 44...
## 6 Incongruent 3.749791 MULTIPOLYGON (((421779 4451...
head(aggregating_zones)
## Simple feature collection with 2 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 417686.2 ymin: 443703.6 xmax: 422959.3 ymax: 447036.8
## Projected CRS: OSGB 1936 / British National Grid
##       geo_code geo_label geo_labelw                       geometry
## 5164 E02002332 Leeds 003       <NA> MULTIPOLYGON (((418731.9 44...
## 6631 E02002333 Leeds 004       <NA> MULTIPOLYGON (((419196.4 44...
tm_shape(incongruent) +
  tm_polygons() +
  tm_shape(aggregating_zones) +
  tm_borders(col = "red")

The simplest method for dealing with this is using area weighted spatial interpolation which transfers values from the ‘incongruent’ object to a new column in ‘aggregating_zones’ in proportion with the area of overlap.

iv <- incongruent["value"]
agg_aw <- st_interpolate_aw(iv, aggregating_zones, extensive = TRUE)
## Warning in st_interpolate_aw.sf(iv, aggregating_zones, extensive = TRUE):
## st_interpolate_aw assumes attributes are constant or uniform over areas of x
tm_shape(agg_aw) +
  tm_fill(fill = "value")

distance relationships

While topological relationships are binary (features either intersect or don’t), distance relationships are continuous.

We can use the following to find the distance between the highest point in NZ and the centroid of the Canterbury region.

nz_highest <- nz_height %>%
  slice_max(n = 1, order_by = elevation)

canterbury_centroid = st_centroid(canterbury)
## Warning: st_centroid assumes attributes are constant over geometries
st_distance(nz_highest, canterbury_centroid)
## Units: [m]
##        [,1]
## [1,] 115540

Note: this function returns distances with units (yay!) and as a matrix, meaning we could find the distance between many locations at once.

simplification

Simplification generalizes vector data (polygons and lines) to assist with plotting and reducing the amount of memory, disk space, and network bandwidth to handle a dataset.

Let’s try simplifying the US states using the Douglas-Peucker algorithm. GEOS assumes a projected CRS, so we first need to project the data, in this case into the US National Atlas Equal Area (epsg = 2163)

us_states2163 = st_transform(us_states, "EPSG:2163")
us_states2163 = us_states2163

us_states_simp1 = st_simplify(us_states2163, dTolerance = 100000)  # 100 km

tm_shape(us_states_simp1) +
  tm_polygons()

To preserve the states’ topology let’s use a simplify function from rmapshaper which uses Visalingam’s algorithm.

#install.packages('rmapshaper')
library(rmapshaper)

# proportion of points to retain (0-1; default 0.05)
us_states_simp2 = rmapshaper::ms_simplify(us_states2163, keep = 0.01,
                                          keep_shapes = TRUE)
## Warning in ms_de_unit(input): Coercing these 'units' columns to class numeric:
## AREA
tm_shape(us_states_simp2) +
  tm_polygons()

Instead of simplifying, we could try smoothing using Gaussian kernel regression.

us_states_simp3 = smoothr::smooth(us_states2163, method = 'ksmooth', smoothness = 6)

tm_shape(us_states_simp3) +
  tm_polygons()

centroids

Centroids identify the center of a spatial feature. Similar to taking an average, there are many ways to compute a centroid. The most common is the geographic centroid.

nz_centroid <- st_centroid(nz)
## Warning: st_centroid assumes attributes are constant over geometries
tm_shape(nz) +
  tm_fill() +
  tm_shape(nz_centroid) +
  tm_dots()

Sometimes centroids fall outside of the boundaries of the objects they were created from. In the case where we need them to fall inside of the feature, we can use point on surface methods.

nz_pos <- st_point_on_surface(nz)
## Warning: st_point_on_surface assumes attributes are constant over geometries
tm_shape(nz) +
  tm_fill() +
  tm_shape(nz_centroid) +
  tm_dots() +
  tm_shape(nz_pos) +
  tm_dots(fill = "red")

buffers

Buffers create polygons representing a set distance from a feature.

seine_buffer <- st_buffer(seine, dist = 5000)

tm_shape(seine_buffer) +
  tm_polygons()

unions

As we saw in the last lab, we can spatially aggregate without explicitly asking R to do so.

world %>%
  group_by(continent) %>%
  summarize(population = sum(pop, na.rm = TRUE))
## Simple feature collection with 8 features and 2 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS:  WGS 84
## # A tibble: 8 × 3
##   continent               population                                        geom
##   <chr>                        <dbl>                              <GEOMETRY [°]>
## 1 Africa                  1154946633 MULTIPOLYGON (((36.86623 22, 36.69069 22.2…
## 2 Antarctica                       0 MULTIPOLYGON (((-180 -89.9, 180 -89.9, 180…
## 3 Asia                    4311408059 MULTIPOLYGON (((36.14976 35.82153, 35.9050…
## 4 Europe                   669036256 MULTIPOLYGON (((26.29 35.29999, 25.74502 3…
## 5 North America            565028684 MULTIPOLYGON (((-82.26815 23.18861, -82.51…
## 6 Oceania                   37757833 MULTIPOLYGON (((166.7932 -15.66881, 167.00…
## 7 Seven seas (open ocean)          0 POLYGON ((68.935 -48.625, 68.8675 -48.83, …
## 8 South America            412060811 MULTIPOLYGON (((-66.95992 -54.89681, -66.4…

What is going on here? Behind the scenes, summarize() is using st_union() to dissolve the boundaries.

us_west <- us_states %>%
  filter(REGION == "West")

us_west_union <- st_union(us_west)

tm_shape(us_west_union) +
  tm_polygons()

st_union() can also take 2 geometries and unite them.

texas <-  us_states %>%
  filter(NAME == "Texas")

texas_union = st_union(us_west_union, texas)

tm_shape(texas_union) +
  tm_polygons()