Join two data.frames using the Leipzig Catalogue of Plants (LCVP)

Join two data.frames based on the taxonomic resolution of plant taxa names listed in the "Leipzig Catalogue of Vascular Plants" (LCVP). Inspired by the dplyr:join function.

lcvp_join(
  x,
  y,
  sp_columns,
  max_distance = 0.2,
  genus_fuzzy = FALSE,
  grammar_check = FALSE,
  type = "full",
  solve_duplicated = FALSE,
  func_numeric = mean,
  func_character = .keep_all,
  func_logical = any
)

Arguments

x, y

data.frames to join.

sp_columns

A character vector indicating the column names in x and y with respective species names to join by.

For example, c("species", "Species_name").

max_distance

It represents the maximum string distance allowed for a match when comparing the submitted name with the closest name matches in the LCVP. The distance used is a generalized Levenshtein distance that indicates the total number of insertions, deletions, and substitutions allowed to match the two names. It can be expressed as an integer or as the fraction of the binomial name. For example, a name with length 10, and a max_distance = 0.1, allow only one change (insertion, deletion, or substitution). A max_distance = 2, allows two changes.

genus_fuzzy

If TRUE, the fuzzy match algorithm based on max_distance will also be applied to the genus (note that this may considerably increase computational time). If FALSE, fuzzy match will only apply to the epithet.

grammar_check

if TRUE, the algorithm will try to fix common latin grammar mistakes.

type

What type of join should be done: "full" (default), "left", "right" or "inner". * "full" return all rows and all columns from both x and y. * "left" return all rows from x. * "right" return all rows from y. * "inner" return all rows from x where there are matching species in y.

solve_duplicated

if TRUE, it will summarize duplicated output names given a function for each column class.

See lcvp_solve_dups for details.

func_numeric

A function to summarize numeric columns if solve_duplicated = TRUE. Default will return the mean.

func_character

A function to summarize character columns if solve_duplicated = TRUE. Default will keep all unique strings separated by comma.

func_logical

A function to summarize logical columns if solve_duplicated = TRUE. Default will return TRUE if any is TRUE.

Value

A data.frame with the columns in both tables. The rows will depend on the type selected. For "inner", a subset of x rows. For "left", all x rows. For "right", a subset of x rows, followed by unmatched y rows. For "full", all x rows, followed by unmatched y rows.

Details

The function add the columns from y to x based on the list of species name in both tables. It first standardizes the species names in both tables based on the "Leipzig Catalogue of Vascular Plants" (LCVP) using the algorithm in lcvp_search. Note that lcvp_join can also deal with misspelling by fuzzy matching species name given a max_distance choice. These standardized names of both tables are then matched using the algorithm in lcvp_match. The type "full" join will keep all species and add NAs to missing values. No NA is added in "inner", "left" and "right" options.

Duplicated taxonomic resolution may occur if two inputs are now synonyms. If solve_duplicated is TRUE the lcvp_solve_dups function is applied to merge duplicated output names.

References

Freiberg, M., Winter, M., Gentile, A. et al. LCVP, The Leipzig catalogue of vascular plants, a new taxonomic reference list for all known vascular plants. Sci Data 7, 416 (2020). https://doi.org/10.1038/s41597-020-00702-z

Author

Bruno Vilela & Alexander Ziska

Examples

# Ensure that LCVP package is available before running the example.
# If it is not, see the `lcvplants` package vignette for details
# on installing the required data package.
if (requireNamespace("LCVP", quietly = TRUE)) { # Do not run this

# Create data.frame1
splist1 <- sample(apply(LCVP::tab_lcvp[2:10, 2:3], 1, paste, collapse = " "))
x <- data.frame("Species" = splist1, "Trait1" = runif(length(splist1)))

# Create data.frame2
splist2 <-sample(apply(LCVP::tab_lcvp[11:3, 2:3], 1, paste, collapse = " "))
y <- data.frame("Species" = splist2,
"Trait2" = runif(length(splist2)),
"Trait3" = runif(length(splist2)),
"Trait4" = sample(c("a", "b"), length(splist2), replace = TRUE),
"Trait5" = sample(c(TRUE, FALSE), length(splist2), replace = TRUE))

# Full join
lcvp_join(x, y, c("Species", "Species"), type = "full")

# Left join
lcvp_join(x, y, c("Species", "Species"), type = "left")

# Right join
lcvp_join(x, y, c("Species", "Species"), type = "right")

# Inner join and solve duplicates
lcvp_join(x, y, c("Species", "Species"),
type = "inner", solve_duplicated = TRUE)

}