introduction

In this lab, we’ll explore the basics of manipulating vector data in R using the sf package. The following materials are modified from Chapter 3 of Geocomputation with R by Rovin Lovelace

prerequisites

rm(list = ls())
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(spData)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)
library(tmap)
## 
## Attaching package: 'tmap'
## The following object is masked from 'package:datasets':
## 
##     rivers
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

handling sf objects

Let’s start by looking at how we can construct a sf object
- first we create a geometry for London by supplying a point and CRS
- then we supply some non-geographic attributes

lnd_point = st_point(c(0.1, 51.5))
lnd_geom = st_sfc(lnd_point, crs = 4326)

lnd_attrib = data.frame(
  name = "London",
  temperature = 25,
  date = as.Date("2017-06-21")
)
lnd_sf = st_sf(lnd_attrib, geometry = lnd_geom)

lnd_sf
## Simple feature collection with 1 feature and 3 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 0.1 ymin: 51.5 xmax: 0.1 ymax: 51.5
## Geodetic CRS:  WGS 84
##     name temperature       date         geometry
## 1 London          25 2017-06-21 POINT (0.1 51.5)
class(lnd_sf)
## [1] "sf"         "data.frame"

We can also check out what the CRS looks like.

st_crs(lnd_sf)
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     ENSEMBLE["World Geodetic System 1984 ensemble",
##         MEMBER["World Geodetic System 1984 (Transit)"],
##         MEMBER["World Geodetic System 1984 (G730)"],
##         MEMBER["World Geodetic System 1984 (G873)"],
##         MEMBER["World Geodetic System 1984 (G1150)"],
##         MEMBER["World Geodetic System 1984 (G1674)"],
##         MEMBER["World Geodetic System 1984 (G1762)"],
##         MEMBER["World Geodetic System 1984 (G2139)"],
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ENSEMBLEACCURACY[2.0]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]
st_crs(lnd_sf)$IsGeographic
## [1] TRUE
st_crs(lnd_sf)$proj4string
## [1] "+proj=longlat +datum=WGS84 +no_defs"

Now let’s look at an existing sf object representing countries of the world

class(world)
## [1] "sf"         "tbl_df"     "tbl"        "data.frame"
dim(world)
## [1] 177  11
names(world)
##  [1] "iso_a2"    "name_long" "continent" "region_un" "subregion" "type"     
##  [7] "area_km2"  "pop"       "lifeExp"   "gdpPercap" "geom"

We can see that this object contains both spatial data (“geom” column) and attributes about those geometries. We can perform operations on the attribute data, just like we would with a normal data frame.

summary(world$lifeExp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   50.62   64.96   72.87   70.85   76.78   83.59      10

The geometry column is “sticky”, meaning it will stick around unless we explicitly get rid of it. To convert this object into a data frame, we need to drop the geometry column.

world_df <- st_drop_geometry(world)
class(world_df)
## [1] "tbl_df"     "tbl"        "data.frame"
names(world_df)
##  [1] "iso_a2"    "name_long" "continent" "region_un" "subregion" "type"     
##  [7] "area_km2"  "pop"       "lifeExp"   "gdpPercap"
ncol(world)
## [1] 11
ncol(world_df)
## [1] 10

vector attribute subsetting

The especially great things about sf objects is that we can use tidyverse functions on them!

We can select columns…

world %>%
  select(name_long, pop)
## Simple feature collection with 177 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS:  WGS 84
## # A tibble: 177 × 3
##    name_long              pop                                               geom
##    <chr>                <dbl>                                 <MULTIPOLYGON [°]>
##  1 Fiji                885806 (((-180 -16.55522, -179.9174 -16.50178, -179.7933…
##  2 Tanzania          52234869 (((33.90371 -0.95, 31.86617 -1.02736, 30.76986 -1…
##  3 Western Sahara          NA (((-8.66559 27.65643, -8.817828 27.65643, -8.7948…
##  4 Canada            35535348 (((-132.71 54.04001, -133.18 54.16998, -133.2397 …
##  5 United States    318622525 (((-171.7317 63.78252, -171.7911 63.40585, -171.5…
##  6 Kazakhstan        17288285 (((87.35997 49.21498, 86.82936 49.82667, 85.54127…
##  7 Uzbekistan        30757700 (((55.96819 41.30864, 57.09639 41.32231, 56.93222…
##  8 Papua New Guinea   7755785 (((141.0002 -2.600151, 141.0171 -5.859022, 141.03…
##  9 Indonesia        255131116 (((104.37 -1.084843, 104.0108 -1.059212, 103.4376…
## 10 Argentina         42981515 (((-68.63401 -52.63637, -68.63335 -54.8695, -67.5…
## # ℹ 167 more rows

Or remove columns…

world %>%
  select(-subregion, -area_km2)
## Simple feature collection with 177 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS:  WGS 84
## # A tibble: 177 × 9
##    iso_a2 name_long        continent   region_un type      pop lifeExp gdpPercap
##    <chr>  <chr>            <chr>       <chr>     <chr>   <dbl>   <dbl>     <dbl>
##  1 FJ     Fiji             Oceania     Oceania   Sove…  8.86e5    70.0     8222.
##  2 TZ     Tanzania         Africa      Africa    Sove…  5.22e7    64.2     2402.
##  3 EH     Western Sahara   Africa      Africa    Inde… NA         NA         NA 
##  4 CA     Canada           North Amer… Americas  Sove…  3.55e7    82.0    43079.
##  5 US     United States    North Amer… Americas  Coun…  3.19e8    78.8    51922.
##  6 KZ     Kazakhstan       Asia        Asia      Sove…  1.73e7    71.6    23587.
##  7 UZ     Uzbekistan       Asia        Asia      Sove…  3.08e7    71.0     5371.
##  8 PG     Papua New Guinea Oceania     Oceania   Sove…  7.76e6    65.2     3709.
##  9 ID     Indonesia        Asia        Asia      Sove…  2.55e8    68.9    10003.
## 10 AR     Argentina        South Amer… Americas  Sove…  4.30e7    76.3    18798.
## # ℹ 167 more rows
## # ℹ 1 more variable: geom <MULTIPOLYGON [°]>

Or select AND rename columns

world %>%
  select(name = name_long, population = pop)
## Simple feature collection with 177 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS:  WGS 84
## # A tibble: 177 × 3
##    name             population                                              geom
##    <chr>                 <dbl>                                <MULTIPOLYGON [°]>
##  1 Fiji                 885806 (((-180 -16.55522, -179.9174 -16.50178, -179.793…
##  2 Tanzania           52234869 (((33.90371 -0.95, 31.86617 -1.02736, 30.76986 -…
##  3 Western Sahara           NA (((-8.66559 27.65643, -8.817828 27.65643, -8.794…
##  4 Canada             35535348 (((-132.71 54.04001, -133.18 54.16998, -133.2397…
##  5 United States     318622525 (((-171.7317 63.78252, -171.7911 63.40585, -171.…
##  6 Kazakhstan         17288285 (((87.35997 49.21498, 86.82936 49.82667, 85.5412…
##  7 Uzbekistan         30757700 (((55.96819 41.30864, 57.09639 41.32231, 56.9322…
##  8 Papua New Guinea    7755785 (((141.0002 -2.600151, 141.0171 -5.859022, 141.0…
##  9 Indonesia         255131116 (((104.37 -1.084843, 104.0108 -1.059212, 103.437…
## 10 Argentina          42981515 (((-68.63401 -52.63637, -68.63335 -54.8695, -67.…
## # ℹ 167 more rows

Or filter observations based on variables

world1 <- world %>%
  filter(area_km2 < 10000)
summary(world1$area_km2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2417    4412    6207    5986    7614    9225
world2 <- world %>%
  filter(lifeExp >= 80)
nrow(world2)
## [1] 24

chaining commands with pipes

Because we can use dplyr functions with sf objects, we can chain together commands using the pipe operator.

Let’s try to find the country in Asia with the highest life expectancy

world %>%
  filter(continent == "Asia") %>%
  select(name_long, continent, lifeExp) %>%
  slice_max(lifeExp)
## Simple feature collection with 1 feature and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 129.4085 ymin: 31.02958 xmax: 145.5431 ymax: 45.55148
## Geodetic CRS:  WGS 84
## # A tibble: 1 × 4
##   name_long continent lifeExp                                               geom
##   <chr>     <chr>       <dbl>                                 <MULTIPOLYGON [°]>
## 1 Japan     Asia         83.6 (((132.3712 33.46364, 132.3631 32.98938, 133.0149…

vector attribute aggregation

Aggregation is the process of summarizing data with one or more ‘grouping’ variables. For example, using the ‘world’ which provides information on countries of the world, we might want to aggregate to the level of continents. It is important to note that aggregating data attributes is a different process from aggregating geographic data, which we will cover later.

Let’s try to find the total population within each continent.

world %>%
  group_by(continent) %>%
  summarize(population = sum(pop, na.rm = TRUE))
## Simple feature collection with 8 features and 2 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS:  WGS 84
## # A tibble: 8 × 3
##   continent               population                                        geom
##   <chr>                        <dbl>                              <GEOMETRY [°]>
## 1 Africa                  1154946633 MULTIPOLYGON (((36.86623 22, 36.69069 22.2…
## 2 Antarctica                       0 MULTIPOLYGON (((-180 -89.9, 180 -89.9, 180…
## 3 Asia                    4311408059 MULTIPOLYGON (((36.14976 35.82153, 35.9050…
## 4 Europe                   669036256 MULTIPOLYGON (((26.29 35.29999, 25.74502 3…
## 5 North America            565028684 MULTIPOLYGON (((-82.26815 23.18861, -82.51…
## 6 Oceania                   37757833 MULTIPOLYGON (((166.7932 -15.66881, 167.00…
## 7 Seven seas (open ocean)          0 POLYGON ((68.935 -48.625, 68.8675 -48.83, …
## 8 South America            412060811 MULTIPOLYGON (((-66.95992 -54.89681, -66.4…

Let’s also find the total area and number of countries in each continent

world %>%
  group_by(continent) %>%
  summarize(population = sum(pop, na.rm = TRUE),
            area_km2 = sum(area_km2, na.rm = TRUE),
            n_countries = n())
## Simple feature collection with 8 features and 4 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS:  WGS 84
## # A tibble: 8 × 5
##   continent            population area_km2 n_countries                      geom
##   <chr>                     <dbl>    <dbl>       <int>            <GEOMETRY [°]>
## 1 Africa               1154946633   2.99e7          51 MULTIPOLYGON (((36.86623…
## 2 Antarctica                    0   1.23e7           1 MULTIPOLYGON (((-180 -89…
## 3 Asia                 4311408059   3.13e7          47 MULTIPOLYGON (((36.14976…
## 4 Europe                669036256   2.31e7          39 MULTIPOLYGON (((26.29 35…
## 5 North America         565028684   2.45e7          18 MULTIPOLYGON (((-82.2681…
## 6 Oceania                37757833   8.50e6           7 MULTIPOLYGON (((166.7932…
## 7 Seven seas (open oc…          0   1.16e4           1 POLYGON ((68.935 -48.625…
## 8 South America         412060811   1.78e7          13 MULTIPOLYGON (((-66.9599…

Building on this, let’s find the population density of each continent, find the continent’s with highest density and arrange by the number of countries. WE’ll drop the geometry column to speed things up.

world %>%
  st_drop_geometry() %>%
  group_by(continent) %>%
  summarize(population = sum(pop, na.rm = TRUE),
            area_km2 = sum(area_km2, na.rm = TRUE),
            n_countries = n()) %>%
  mutate(density = round(population/area_km2)) %>%
  slice_max(density, n = 3) %>%
  arrange(desc(n_countries))
## # A tibble: 3 × 5
##   continent population  area_km2 n_countries density
##   <chr>          <dbl>     <dbl>       <int>   <dbl>
## 1 Africa    1154946633 29946198.          51      39
## 2 Asia      4311408059 31252459.          47     138
## 3 Europe     669036256 23065219.          39      29

vector attribute joining

A critical part of many data science workflows is combining data sets based on common attributes. In R, we do this using multiple join functions, which follow SQL conventions.

Let’s start by looking a data set on national coffee production

head(coffee_data)
## # A tibble: 6 × 3
##   name_long                coffee_production_2016 coffee_production_2017
##   <chr>                                     <int>                  <int>
## 1 Angola                                       NA                     NA
## 2 Bolivia                                       3                      4
## 3 Brazil                                     3277                   2786
## 4 Burundi                                      37                     38
## 5 Cameroon                                      8                      6
## 6 Central African Republic                     NA                     NA
nrow(coffee_data)
## [1] 47
nrow(world)
## [1] 177

We can combine this with the world data set, but joining based on country’s names

world_coffee <- left_join(world, coffee_data,
                          by = "name_long")

names(world_coffee)
##  [1] "iso_a2"                 "name_long"              "continent"             
##  [4] "region_un"              "subregion"              "type"                  
##  [7] "area_km2"               "pop"                    "lifeExp"               
## [10] "gdpPercap"              "geom"                   "coffee_production_2016"
## [13] "coffee_production_2017"

And plot what this looks like…

tm_shape(world_coffee) +
  tm_polygons(fill = "coffee_production_2017")

If we just wanted to keep countries that do have coffee data, we could use an inner join

world_coffee_inner <- inner_join(world, coffee_data)
## Joining with `by = join_by(name_long)`
nrow(world_coffee_inner)
## [1] 45

It looks like we lost some countries with coffee data, so let’s figure out what’s going on. We can find rows that didn’t match using the setdiff function.

setdiff(coffee_data$name_long, world$name_long)
## [1] "Congo, Dem. Rep. of" "Others"

We see that one of the issues is that the two data sets use different naming conventions for the Democratic Republic of the Congo. We can use a string matching function to figure out what the DRC is called in the world data set.

drc = stringr::str_subset(world$name_long, "Dem*.+Congo")

Now we can update the coffee data set with the matching name for the DRC.

coffee_data$name_long[grepl("Congo,", coffee_data$name_long)] = drc

And we can try the inner join again and hopefully the DRC now matches.

world_coffee_inner <- inner_join(world, coffee_data)
## Joining with `by = join_by(name_long)`
nrow(world_coffee_inner)
## [1] 46

Let’s visualize what a the inner join did to our spatial object

tm_shape(world_coffee_inner) +
  tm_polygons(fill = "coffee_production_2017",
              title = "Coffee production (2017)") +
  tm_layout(legend.outside = TRUE)
## Deprecated tmap v3 code detected. Code translated to v4

And let’s test what would happen if we flipped the order of the data sets in the join

coffee_world = left_join(coffee_data, world, by = "name_long")
class(coffee_world)
## [1] "tbl_df"     "tbl"        "data.frame"
names(coffee_world)
##  [1] "name_long"              "coffee_production_2016" "coffee_production_2017"
##  [4] "iso_a2"                 "continent"              "region_un"             
##  [7] "subregion"              "type"                   "area_km2"              
## [10] "pop"                    "lifeExp"                "gdpPercap"             
## [13] "geom"