Background

The public availability of species distribution data has increased substantially in the last 10 years: occurrence information, mostly in the form of geographic coordinate records for species across the tree of life, representing hundreds of years of biological collection effort are now available. The global Biodiversity Information Facility (www.gbif.org) is one of the largest data providers, hosting more than one billion records (Sept 2018) from a large variety of sources.

Outcomes

After this exercise you will be able to retrieve species occurrence information from GBIF from within R. You will be equipt with example data from your group of interest for the follow upcoming exercises. See https://ropensci.org/tutorials/rgbif_tutorial.html for a more exhaustive tutorial on the rgbif package.

Exercise

We will use the rgbif package to obtain occurrence records from GBIF. You can find the relevant functions for each task in the parentheses. You can get help on each function by typing ?FUNCTIONNAME.

Familiarize yourself with the rgbif package and download the occurrence data for one of the species of your choice. (name_suggest, occ_Search)
Look at the data and a quick plot (head, plot).
Now find out how many occurrences for flowering plants are availble from Brazil. (name_suggest, occ_count)
Download all records for flowering plants from Rio Grande do Norte (more or less). Make sure to keep an eye on the limit argument. (occ_search)
Save the downloaded data as .txt or .csv to the working directory. (write_csv, write_delim)

Library setup

In this exercise we will use the rgbif library for communication with GBIF and the tidyverse library for data management.

Tutorial

In the following tutorial, we will go through the questions one-by-one. The suggested answers are by no means the only correct ones.

1. Download data for a single species we saw in the filedand save them in a data.frame

GBIF hosts a large number of records and downloading all records might take some time (also the download limit using occ_search is 250,000), so it is worth checking first how many records are available. We do this using the return argument of the occ_search function, which will only return meta-data on the record. Chose a species from your project taxon, for demonstration will download records for the Malvaceae family. We’ll first download data for a single, wide-spread species, Ceiba pentandra:

# Search occurrence records
dat <- occ_search(scientificName = "Wittmackia patentissima", return = "data", 
    limit = 1000)

2. Explore the downloaded data

nrow(dat)  # Check the number of records
head(dat)  # Check the data
plot(dat$decimalLatitude ~ dat$decimalLongitude)  # Look at the georeferenced records

So luckily there are a good number of records available. An as the quick visualization shows, a lot of the have geographic coordinates. See exercise eight for more detailed plotting. In the next exercise we will see how to reduce the amount of information and quality check the data. But let’s first download more relevant data for the project.

3. How many records are available for your group of interest?

For your project, we are interested not only in one species, but a larger taxonomic group. You can search for higher rank taxa using GBIF’s taxonKey. The taxonKey is a unique identifier for each taxon; we can obtain it from the taxon name via the name _suggest function. Since higher taxa might have a lot of records and downloading might take a lot of time, we will first check how many records are available. Here we will look at the entire genus Ceiba.

# Use the name_suggest function to get the gbif taxon key
tax_key <- name_suggest(q = "Magnoliopsida", rank = "Class")

# Sometimes groups have multiple taxon keys, in this case three, so we will
# check how many records are available for them
lapply(tax_key$key, "occ_count")

# Here the firsrt one is relevant, check for your group!
tax_key <- tax_key$key[1]

occ_count(tax_key, country = "DE")

There are more than five million records available from Brazil. This is too much for this exercise and also occ_Search is limited to 200000 records. Hence we will further limit the geographic extent. To do this you can use the Well-known-text format (WKT) to specify an area. Here we use a very simple rectangle, feel free to experiment. The download may take some minutes.

Download all data for your group from Brazil, or if the group is very large from part of that area. You can now download the records for Brazil. Since we want to work with coordinates, we will only download those records that do have coordinates.

dat <- occ_search(taxonKey = tax_key, return = "data", country = "BR", hasCoordinate = T, 
    limit = 1000)

That leaves us with records. If you are satisfied for your group you can go to the next step and save the data to the working directory. The limit for record searching using rgbif is 250,000 records, if your group has more records you may limit the geographic area to the north east of Brazil. To do this you can use the Well-known-text format (WKT) to specify an area. Here we use a very simple rectangle, feel free to experiment

study_a <- "POLYGON((-35 -4.5, -38.5 -4.5, -38.5 -7, -35 -7, -35 -4.5))"

dat_ne <- occ_search(taxonKey = tax_key, return = "data", hasCoordinate = T, 
    geometry = study_a, limit = 1000)

If you have a .kml or.shp file for which you want to download records you can import this into R using the readOGR function of the rgdal library and convert it into WKT format using writeWKT from the rgeos package.

amz <- readOGR("inst/Amazonia.kml")
# or for shape files: amz <- readOGR('inst', layer = 'Amazonia')
rgeos::writeWKT(amz)
# Or, best use the extent of the shape, since it is simple:
ex <- raster::extent(amz)
ex <- as(ex, "SpatialPolygons")
ex <- rgeos::writeWKT(ex)

Alternatively, you can download data for a list of taxa.

gen_list <- c("Ceiba", "Eriotheca")

tax_key <- lapply(gen_list, function(k) {
    name_suggest(q = k, rank = "Genus")
})
tax_key <- unlist(lapply(tax_key, "[[", "key"))

unlist(lapply(tax_key, "occ_count"))

dat_ne <- occ_search(taxonKey = tax_key, return = "data", hasCoordinate = T, 
    limit = 1000, country = "BR")
dat_ne <- lapply(dat_ne, "as.data.frame")

dat_ne <- bind_rows(dat_ne)

Save the downloaded data as .csv to the working directory. You can save the results as text file to the working directory. To identify your working directory use getwd().

write_csv(dat_ne, path = "inst/gbif_occurrences.csv")

If you want to use records from GBIF for publication, please make sure you cite them properly, using a DOI, you can get a DOI by using occ_download.

Downloading occurrences from GBIF