In this lab, we’ll explore the basics of manipulating vector data in R using the sf package. The following materials are modified from Chapter 3 of Geocomputation with R by Rovin Lovelace
rm(list = ls())
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(spData)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
## (status 2 uses the sf package in place of rgdal)
library(tmap)
##
## Attaching package: 'tmap'
## The following object is masked from 'package:datasets':
##
## rivers
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Let’s start by looking at how we can construct a sf
object
- first we create a geometry for London by supplying a point and
CRS
- then we supply some non-geographic attributes
lnd_point = st_point(c(0.1, 51.5))
lnd_geom = st_sfc(lnd_point, crs = 4326)
lnd_attrib = data.frame(
name = "London",
temperature = 25,
date = as.Date("2017-06-21")
)
lnd_sf = st_sf(lnd_attrib, geometry = lnd_geom)
lnd_sf
## Simple feature collection with 1 feature and 3 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 0.1 ymin: 51.5 xmax: 0.1 ymax: 51.5
## Geodetic CRS: WGS 84
## name temperature date geometry
## 1 London 25 2017-06-21 POINT (0.1 51.5)
class(lnd_sf)
## [1] "sf" "data.frame"
We can also check out what the CRS looks like.
st_crs(lnd_sf)
## Coordinate Reference System:
## User input: EPSG:4326
## wkt:
## GEOGCRS["WGS 84",
## ENSEMBLE["World Geodetic System 1984 ensemble",
## MEMBER["World Geodetic System 1984 (Transit)"],
## MEMBER["World Geodetic System 1984 (G730)"],
## MEMBER["World Geodetic System 1984 (G873)"],
## MEMBER["World Geodetic System 1984 (G1150)"],
## MEMBER["World Geodetic System 1984 (G1674)"],
## MEMBER["World Geodetic System 1984 (G1762)"],
## MEMBER["World Geodetic System 1984 (G2139)"],
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ENSEMBLEACCURACY[2.0]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["Horizontal component of 3D system."],
## AREA["World."],
## BBOX[-90,-180,90,180]],
## ID["EPSG",4326]]
st_crs(lnd_sf)$IsGeographic
## [1] TRUE
st_crs(lnd_sf)$proj4string
## [1] "+proj=longlat +datum=WGS84 +no_defs"
Now let’s look at an existing sf object representing countries of the world
class(world)
## [1] "sf" "tbl_df" "tbl" "data.frame"
dim(world)
## [1] 177 11
names(world)
## [1] "iso_a2" "name_long" "continent" "region_un" "subregion" "type"
## [7] "area_km2" "pop" "lifeExp" "gdpPercap" "geom"
We can see that this object contains both spatial data (“geom” column) and attributes about those geometries. We can perform operations on the attribute data, just like we would with a normal data frame.
summary(world$lifeExp)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 50.62 64.96 72.87 70.85 76.78 83.59 10
The geometry column is “sticky”, meaning it will stick around unless we explicitly get rid of it. To convert this object into a data frame, we need to drop the geometry column.
world_df <- st_drop_geometry(world)
class(world_df)
## [1] "tbl_df" "tbl" "data.frame"
names(world_df)
## [1] "iso_a2" "name_long" "continent" "region_un" "subregion" "type"
## [7] "area_km2" "pop" "lifeExp" "gdpPercap"
ncol(world)
## [1] 11
ncol(world_df)
## [1] 10
The especially great things about sf objects is that we can use tidyverse functions on them!
We can select columns…
world %>%
select(name_long, pop)
## Simple feature collection with 177 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 177 × 3
## name_long pop geom
## <chr> <dbl> <MULTIPOLYGON [°]>
## 1 Fiji 885806 (((-180 -16.55522, -179.9174 -16.50178, -179.7933…
## 2 Tanzania 52234869 (((33.90371 -0.95, 31.86617 -1.02736, 30.76986 -1…
## 3 Western Sahara NA (((-8.66559 27.65643, -8.817828 27.65643, -8.7948…
## 4 Canada 35535348 (((-132.71 54.04001, -133.18 54.16998, -133.2397 …
## 5 United States 318622525 (((-171.7317 63.78252, -171.7911 63.40585, -171.5…
## 6 Kazakhstan 17288285 (((87.35997 49.21498, 86.82936 49.82667, 85.54127…
## 7 Uzbekistan 30757700 (((55.96819 41.30864, 57.09639 41.32231, 56.93222…
## 8 Papua New Guinea 7755785 (((141.0002 -2.600151, 141.0171 -5.859022, 141.03…
## 9 Indonesia 255131116 (((104.37 -1.084843, 104.0108 -1.059212, 103.4376…
## 10 Argentina 42981515 (((-68.63401 -52.63637, -68.63335 -54.8695, -67.5…
## # ℹ 167 more rows
Or remove columns…
world %>%
select(-subregion, -area_km2)
## Simple feature collection with 177 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 177 × 9
## iso_a2 name_long continent region_un type pop lifeExp gdpPercap
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 FJ Fiji Oceania Oceania Sove… 8.86e5 70.0 8222.
## 2 TZ Tanzania Africa Africa Sove… 5.22e7 64.2 2402.
## 3 EH Western Sahara Africa Africa Inde… NA NA NA
## 4 CA Canada North Amer… Americas Sove… 3.55e7 82.0 43079.
## 5 US United States North Amer… Americas Coun… 3.19e8 78.8 51922.
## 6 KZ Kazakhstan Asia Asia Sove… 1.73e7 71.6 23587.
## 7 UZ Uzbekistan Asia Asia Sove… 3.08e7 71.0 5371.
## 8 PG Papua New Guinea Oceania Oceania Sove… 7.76e6 65.2 3709.
## 9 ID Indonesia Asia Asia Sove… 2.55e8 68.9 10003.
## 10 AR Argentina South Amer… Americas Sove… 4.30e7 76.3 18798.
## # ℹ 167 more rows
## # ℹ 1 more variable: geom <MULTIPOLYGON [°]>
Or select AND rename columns
world %>%
select(name = name_long, population = pop)
## Simple feature collection with 177 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 177 × 3
## name population geom
## <chr> <dbl> <MULTIPOLYGON [°]>
## 1 Fiji 885806 (((-180 -16.55522, -179.9174 -16.50178, -179.793…
## 2 Tanzania 52234869 (((33.90371 -0.95, 31.86617 -1.02736, 30.76986 -…
## 3 Western Sahara NA (((-8.66559 27.65643, -8.817828 27.65643, -8.794…
## 4 Canada 35535348 (((-132.71 54.04001, -133.18 54.16998, -133.2397…
## 5 United States 318622525 (((-171.7317 63.78252, -171.7911 63.40585, -171.…
## 6 Kazakhstan 17288285 (((87.35997 49.21498, 86.82936 49.82667, 85.5412…
## 7 Uzbekistan 30757700 (((55.96819 41.30864, 57.09639 41.32231, 56.9322…
## 8 Papua New Guinea 7755785 (((141.0002 -2.600151, 141.0171 -5.859022, 141.0…
## 9 Indonesia 255131116 (((104.37 -1.084843, 104.0108 -1.059212, 103.437…
## 10 Argentina 42981515 (((-68.63401 -52.63637, -68.63335 -54.8695, -67.…
## # ℹ 167 more rows
Or filter observations based on variables
world1 <- world %>%
filter(area_km2 < 10000)
summary(world1$area_km2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2417 4412 6207 5986 7614 9225
world2 <- world %>%
filter(lifeExp >= 80)
nrow(world2)
## [1] 24
Because we can use dplyr functions with sf objects, we can chain together commands using the pipe operator.
Let’s try to find the country in Asia with the highest life expectancy
world %>%
filter(continent == "Asia") %>%
select(name_long, continent, lifeExp) %>%
slice_max(lifeExp)
## Simple feature collection with 1 feature and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 129.4085 ymin: 31.02958 xmax: 145.5431 ymax: 45.55148
## Geodetic CRS: WGS 84
## # A tibble: 1 × 4
## name_long continent lifeExp geom
## <chr> <chr> <dbl> <MULTIPOLYGON [°]>
## 1 Japan Asia 83.6 (((132.3712 33.46364, 132.3631 32.98938, 133.0149…
Aggregation is the process of summarizing data with one or more ‘grouping’ variables. For example, using the ‘world’ which provides information on countries of the world, we might want to aggregate to the level of continents. It is important to note that aggregating data attributes is a different process from aggregating geographic data, which we will cover later.
Let’s try to find the total population within each continent.
world %>%
group_by(continent) %>%
summarize(population = sum(pop, na.rm = TRUE))
## Simple feature collection with 8 features and 2 fields
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 8 × 3
## continent population geom
## <chr> <dbl> <GEOMETRY [°]>
## 1 Africa 1154946633 MULTIPOLYGON (((36.86623 22, 36.69069 22.2…
## 2 Antarctica 0 MULTIPOLYGON (((-180 -89.9, 180 -89.9, 180…
## 3 Asia 4311408059 MULTIPOLYGON (((36.14976 35.82153, 35.9050…
## 4 Europe 669036256 MULTIPOLYGON (((26.29 35.29999, 25.74502 3…
## 5 North America 565028684 MULTIPOLYGON (((-82.26815 23.18861, -82.51…
## 6 Oceania 37757833 MULTIPOLYGON (((166.7932 -15.66881, 167.00…
## 7 Seven seas (open ocean) 0 POLYGON ((68.935 -48.625, 68.8675 -48.83, …
## 8 South America 412060811 MULTIPOLYGON (((-66.95992 -54.89681, -66.4…
Let’s also find the total area and number of countries in each continent
world %>%
group_by(continent) %>%
summarize(population = sum(pop, na.rm = TRUE),
area_km2 = sum(area_km2, na.rm = TRUE),
n_countries = n())
## Simple feature collection with 8 features and 4 fields
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 8 × 5
## continent population area_km2 n_countries geom
## <chr> <dbl> <dbl> <int> <GEOMETRY [°]>
## 1 Africa 1154946633 2.99e7 51 MULTIPOLYGON (((36.86623…
## 2 Antarctica 0 1.23e7 1 MULTIPOLYGON (((-180 -89…
## 3 Asia 4311408059 3.13e7 47 MULTIPOLYGON (((36.14976…
## 4 Europe 669036256 2.31e7 39 MULTIPOLYGON (((26.29 35…
## 5 North America 565028684 2.45e7 18 MULTIPOLYGON (((-82.2681…
## 6 Oceania 37757833 8.50e6 7 MULTIPOLYGON (((166.7932…
## 7 Seven seas (open oc… 0 1.16e4 1 POLYGON ((68.935 -48.625…
## 8 South America 412060811 1.78e7 13 MULTIPOLYGON (((-66.9599…
Building on this, let’s find the population density of each continent, find the continent’s with highest density and arrange by the number of countries. WE’ll drop the geometry column to speed things up.
world %>%
st_drop_geometry() %>%
group_by(continent) %>%
summarize(population = sum(pop, na.rm = TRUE),
area_km2 = sum(area_km2, na.rm = TRUE),
n_countries = n()) %>%
mutate(density = round(population/area_km2)) %>%
slice_max(density, n = 3) %>%
arrange(desc(n_countries))
## # A tibble: 3 × 5
## continent population area_km2 n_countries density
## <chr> <dbl> <dbl> <int> <dbl>
## 1 Africa 1154946633 29946198. 51 39
## 2 Asia 4311408059 31252459. 47 138
## 3 Europe 669036256 23065219. 39 29
A critical part of many data science workflows is combining data sets based on common attributes. In R, we do this using multiple join functions, which follow SQL conventions.
Let’s start by looking a data set on national coffee production
head(coffee_data)
## # A tibble: 6 × 3
## name_long coffee_production_2016 coffee_production_2017
## <chr> <int> <int>
## 1 Angola NA NA
## 2 Bolivia 3 4
## 3 Brazil 3277 2786
## 4 Burundi 37 38
## 5 Cameroon 8 6
## 6 Central African Republic NA NA
nrow(coffee_data)
## [1] 47
nrow(world)
## [1] 177
We can combine this with the world data set, but joining based on country’s names
world_coffee <- left_join(world, coffee_data,
by = "name_long")
names(world_coffee)
## [1] "iso_a2" "name_long" "continent"
## [4] "region_un" "subregion" "type"
## [7] "area_km2" "pop" "lifeExp"
## [10] "gdpPercap" "geom" "coffee_production_2016"
## [13] "coffee_production_2017"
And plot what this looks like…
tm_shape(world_coffee) +
tm_polygons(fill = "coffee_production_2017")
If we just wanted to keep countries that do have coffee data, we could use an inner join
world_coffee_inner <- inner_join(world, coffee_data)
## Joining with `by = join_by(name_long)`
nrow(world_coffee_inner)
## [1] 45
It looks like we lost some countries with coffee data, so let’s figure out what’s going on. We can find rows that didn’t match using the setdiff function.
setdiff(coffee_data$name_long, world$name_long)
## [1] "Congo, Dem. Rep. of" "Others"
We see that one of the issues is that the two data sets use different naming conventions for the Democratic Republic of the Congo. We can use a string matching function to figure out what the DRC is called in the world data set.
drc = stringr::str_subset(world$name_long, "Dem*.+Congo")
Now we can update the coffee data set with the matching name for the DRC.
coffee_data$name_long[grepl("Congo,", coffee_data$name_long)] = drc
And we can try the inner join again and hopefully the DRC now matches.
world_coffee_inner <- inner_join(world, coffee_data)
## Joining with `by = join_by(name_long)`
nrow(world_coffee_inner)
## [1] 46
Let’s visualize what a the inner join did to our spatial object
tm_shape(world_coffee_inner) +
tm_polygons(fill = "coffee_production_2017",
title = "Coffee production (2017)") +
tm_layout(legend.outside = TRUE)
## Deprecated tmap v3 code detected. Code translated to v4
And let’s test what would happen if we flipped the order of the data sets in the join
coffee_world = left_join(coffee_data, world, by = "name_long")
class(coffee_world)
## [1] "tbl_df" "tbl" "data.frame"
names(coffee_world)
## [1] "name_long" "coffee_production_2016" "coffee_production_2017"
## [4] "iso_a2" "continent" "region_un"
## [7] "subregion" "type" "area_km2"
## [10] "pop" "lifeExp" "gdpPercap"
## [13] "geom"