The public availability of species distribution data has increased substantially in the last 10 years: occurrence information, mostly in the form of geographic coordinate records for species across the tree of life, representing hundreds of years of biological collection effort are now available. The global Biodiversity Information Facility (www.gbif.org) is one of the largest data providers, hosting more than one billion records (Sept 2018) from a large variety of sources.
After this exercise you will be able to retrieve species occurrence information from GBIF from within R. You will be equipt with example data from your group of interest for the follow upcoming exercises. See https://ropensci.org/tutorials/rgbif_tutorial.html for a more exhaustive tutorial on the rgbif package.
We will use the rgbif package to obtain occurrence records from GBIF. You can find the relevant functions for each task in the parentheses. You can get help on each function by typing ?FUNCTIONNAME
.
rgbif
package and download the occurrence data for one of the species of your choice. (name_suggest
, occ_Search
)head
, plot
).name_suggest
, occ_count
)limit
argument. (occ_search
)write_csv
, write_delim
)In this exercise we will use the rgbif library for communication with GBIF and the tidyverse library for data management.
In the following tutorial, we will go through the questions one-by-one. The suggested answers are by no means the only correct ones.
GBIF hosts a large number of records and downloading all records might take some time (also the download limit using occ_search
is 250,000), so it is worth checking first how many records are available. We do this using the return
argument of the occ_search
function, which will only return meta-data on the record. Chose a species from your project taxon, for demonstration will download records for the Malvaceae family. We’ll first download data for a single, wide-spread species, Ceiba pentandra:
# Search occurrence records
dat <- occ_search(scientificName = "Wittmackia patentissima", return = "data",
limit = 1000)
nrow(dat) # Check the number of records
head(dat) # Check the data
plot(dat$decimalLatitude ~ dat$decimalLongitude) # Look at the georeferenced records
So luckily there are a good number of records available. An as the quick visualization shows, a lot of the have geographic coordinates. See exercise eight for more detailed plotting. In the next exercise we will see how to reduce the amount of information and quality check the data. But let’s first download more relevant data for the project.
For your project, we are interested not only in one species, but a larger taxonomic group. You can search for higher rank taxa using GBIF’s taxonKey. The taxonKey is a unique identifier for each taxon; we can obtain it from the taxon name via the name _suggest
function. Since higher taxa might have a lot of records and downloading might take a lot of time, we will first check how many records are available. Here we will look at the entire genus Ceiba.
# Use the name_suggest function to get the gbif taxon key
tax_key <- name_suggest(q = "Magnoliopsida", rank = "Class")
# Sometimes groups have multiple taxon keys, in this case three, so we will
# check how many records are available for them
lapply(tax_key$key, "occ_count")
# Here the firsrt one is relevant, check for your group!
tax_key <- tax_key$key[1]
occ_count(tax_key, country = "DE")
There are more than five million records available from Brazil. This is too much for this exercise and also occ_Search
is limited to 200000 records. Hence we will further limit the geographic extent. To do this you can use the Well-known-text format (WKT) to specify an area. Here we use a very simple rectangle, feel free to experiment. The download may take some minutes.
dat <- occ_search(taxonKey = tax_key, return = "data", country = "BR", hasCoordinate = T,
limit = 1000)
That leaves us with records. If you are satisfied for your group you can go to the next step and save the data to the working directory. The limit for record searching using rgbif is 250,000 records, if your group has more records you may limit the geographic area to the north east of Brazil. To do this you can use the Well-known-text format (WKT) to specify an area. Here we use a very simple rectangle, feel free to experiment
study_a <- "POLYGON((-35 -4.5, -38.5 -4.5, -38.5 -7, -35 -7, -35 -4.5))"
dat_ne <- occ_search(taxonKey = tax_key, return = "data", hasCoordinate = T,
geometry = study_a, limit = 1000)
If you have a .kml or.shp file for which you want to download records you can import this into R using the readOGR
function of the rgdal
library and convert it into WKT format using writeWKT
from the rgeos
package.
amz <- readOGR("inst/Amazonia.kml")
# or for shape files: amz <- readOGR('inst', layer = 'Amazonia')
rgeos::writeWKT(amz)
# Or, best use the extent of the shape, since it is simple:
ex <- raster::extent(amz)
ex <- as(ex, "SpatialPolygons")
ex <- rgeos::writeWKT(ex)
Alternatively, you can download data for a list of taxa.
gen_list <- c("Ceiba", "Eriotheca")
tax_key <- lapply(gen_list, function(k) {
name_suggest(q = k, rank = "Genus")
})
tax_key <- unlist(lapply(tax_key, "[[", "key"))
unlist(lapply(tax_key, "occ_count"))
dat_ne <- occ_search(taxonKey = tax_key, return = "data", hasCoordinate = T,
limit = 1000, country = "BR")
dat_ne <- lapply(dat_ne, "as.data.frame")
dat_ne <- bind_rows(dat_ne)
getwd()
.write_csv(dat_ne, path = "inst/gbif_occurrences.csv")
If you want to use records from GBIF for publication, please make sure you cite them properly, using a DOI, you can get a DOI by using occ_download
.