R/lcvp_join.R
lcvp_join.Rd
Join two data.frames based on the taxonomic resolution of plant taxa names
listed in the "Leipzig Catalogue of Vascular Plants" (LCVP). Inspired by the
dplyr:join
function.
lcvp_join(
x,
y,
sp_columns,
max_distance = 0.2,
genus_fuzzy = FALSE,
grammar_check = FALSE,
type = "full",
solve_duplicated = FALSE,
func_numeric = mean,
func_character = .keep_all,
func_logical = any
)
data.frames to join.
A character vector indicating the column names in x and y with respective species names to join by.
For example, c("species", "Species_name").
It represents the maximum string distance allowed for a match when comparing the submitted name with the closest name matches in the LCVP. The distance used is a generalized Levenshtein distance that indicates the total number of insertions, deletions, and substitutions allowed to match the two names. It can be expressed as an integer or as the fraction of the binomial name. For example, a name with length 10, and a max_distance = 0.1, allow only one change (insertion, deletion, or substitution). A max_distance = 2, allows two changes.
If TRUE, the fuzzy match algorithm based on max_distance will also be applied to the genus (note that this may considerably increase computational time). If FALSE, fuzzy match will only apply to the epithet.
if TRUE, the algorithm will try to fix common latin grammar mistakes.
What type of join should be done: "full" (default), "left", "right" or "inner". * "full" return all rows and all columns from both x and y. * "left" return all rows from x. * "right" return all rows from y. * "inner" return all rows from x where there are matching species in y.
if TRUE, it will summarize duplicated output names given a function for each column class.
See lcvp_solve_dups
for details.
A function to summarize numeric columns if solve_duplicated = TRUE. Default will return the mean.
A function to summarize character columns if solve_duplicated = TRUE. Default will keep all unique strings separated by comma.
A function to summarize logical columns if solve_duplicated = TRUE. Default will return TRUE if any is TRUE.
A data.frame with the columns in both tables. The rows will depend on
the type
selected. For "inner", a subset of x rows. For "left", all x
rows. For "right", a subset of x rows, followed by unmatched y rows. For
"full", all x rows, followed by unmatched y rows.
The function add the columns from y to x based on the list of species
name in both tables. It first standardizes the species names in both tables
based on the "Leipzig Catalogue of Vascular Plants" (LCVP) using the algorithm
in lcvp_search
. Note that
lcvp_join
can also deal with misspelling by fuzzy matching species name
given a max_distance
choice. These standardized names of both tables
are then matched using the algorithm in
lcvp_match
. The type "full" join will keep
all species and add NAs to missing values. No NA is added in "inner", "left"
and "right" options.
Duplicated taxonomic resolution may occur if two inputs are now synonyms. If
solve_duplicated
is TRUE
the
lcvp_solve_dups
function is applied to
merge duplicated output names.
Freiberg, M., Winter, M., Gentile, A. et al. LCVP, The Leipzig catalogue of vascular plants, a new taxonomic reference list for all known vascular plants. Sci Data 7, 416 (2020). https://doi.org/10.1038/s41597-020-00702-z
# Ensure that LCVP package is available before running the example.
# If it is not, see the `lcvplants` package vignette for details
# on installing the required data package.
if (requireNamespace("LCVP", quietly = TRUE)) { # Do not run this
# Create data.frame1
splist1 <- sample(apply(LCVP::tab_lcvp[2:10, 2:3], 1, paste, collapse = " "))
x <- data.frame("Species" = splist1, "Trait1" = runif(length(splist1)))
# Create data.frame2
splist2 <-sample(apply(LCVP::tab_lcvp[11:3, 2:3], 1, paste, collapse = " "))
y <- data.frame("Species" = splist2,
"Trait2" = runif(length(splist2)),
"Trait3" = runif(length(splist2)),
"Trait4" = sample(c("a", "b"), length(splist2), replace = TRUE),
"Trait5" = sample(c(TRUE, FALSE), length(splist2), replace = TRUE))
# Full join
lcvp_join(x, y, c("Species", "Species"), type = "full")
# Left join
lcvp_join(x, y, c("Species", "Species"), type = "left")
# Right join
lcvp_join(x, y, c("Species", "Species"), type = "right")
# Inner join and solve duplicates
lcvp_join(x, y, c("Species", "Species"),
type = "inner", solve_duplicated = TRUE)
}