vignettes/taxonomic_resolution_using_lcvplants.Rmd
taxonomic_resolution_using_lcvplants.Rmd
When comparing or merging any list of plant species, for instance for
compiling regional species lists, merging occurrence lists with trait
data or phylogenies, the taxonomic names must be matched to avoid
artificial inflation due to synonyms, data loss or erroneous matches due
to homonyms. The lcvplants
package facilitates this process
by automatizing large-scale taxonomic harmonization of plant names by
fuzzy matching and synonymy resolution against the Leipzig Catalogue of
Vascular Plants as taxonomic backbone.
The Leipzig Catalogue of Vascular Plants (LCVP) is a novel global taxonomic backbone, updating The Plant List, comprising more than 1,300,000 names and 350,000 accepted taxa names. We described the LCVP in detail in the related scientific publication (Freiberg et al, 2020).
You can install lcvplants
from github using the devtools
package You may need to install devtools first, so if you do not have it
installed, run the following code.
install.packages("devtools")
To use lcvplants
you also need the data of the LCVP
package, which you can install in the same way.
devtools::install_github("idiv-biodiversity/lcvplants")
devtools::install_github("idiv-biodiversity/LCVP")
The basic function in lcvplants
is
lcvp_search
. It allows users to input a list of species
name that will be matched to the names in the LCVP data. It returns a
data.frame indicating the full name matched — including the authority
name and removing possible orthographic errors —, the taxonomic status,
the accepted name, family and order.
lcvp_search("Hibiscus vitifolius")
Search | global.Id | Input.Genus | Input.Epitheton | Rank | Input.Subspecies.Epitheton | Input.Authors | Status | globalId.of.Output.Taxon | Output.Taxon | Family | Order | Literature | Comments |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Hibiscus vitifolius | 604882 | Hibiscus | vitifolius | species | nil | L. | accepted | 604882 | Hibiscus vitifolius L. | Malvaceae | Malvales |
The lcvp_search
algorithm will first try to exactly
match the binomial names provided by the user. If no match is found, it
will try to find the closest name given the maximal transformation cost
defined in the argument max_distance
(see ?base::agrep for
more details). When the argument show_correct = TRUE
a
column is added to the final result indicating whether the binomial name
was exactly matched (TRUE), or if it is misspelled (FALSE).
lcvp_search("Hibiscus vitifoliuse",
max_distance = 0.1,
show_correct = TRUE)
Search | global.Id | Input.Genus | Input.Epitheton | Rank | Input.Subspecies.Epitheton | Input.Authors | Status | globalId.of.Output.Taxon | Output.Taxon | Family | Order | Literature | Comments | Correct |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Hibiscus vitifoliuse | 604882 | Hibiscus | vitifolius | species | nil | L. | accepted | 604882 | Hibiscus vitifolius L. | Malvaceae | Malvales | FALSE |
If more than one name is fuzzy matched (more than one closest match
found), only the accepted or the first name will be returned. The
function lcvp_fuzzy_search
can be used to return all the
closest results when argument keep_closest = TRUE
, or all
names within the max_distance
defined when
keep_closest = FALSE
.
lcvp_fuzzy_search("Hibiscus vitifoliuse",
max_distance = 0.1,
keep_closest = TRUE)
global.Id | Input.Genus | Input.Epitheton | Rank | Input.Subspecies.Epitheton | Input.Authors | Status | globalId.of.Output.Taxon | Output.Taxon | Family | Order | Literature | Comments | Name.Distance |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
604882 | Hibiscus | vitifolius | species | nil | L. | accepted | 604882 | Hibiscus vitifolius L. | Malvaceae | Malvales | 1 | ||
604883 | Hibiscus | vitifolius | species | nil | Mill. | synonym | 603660 | Hibiscus cannabinus L. | Malvaceae | Malvales | 1 |
Both lcvp_search
and lcvp_fuzzy_search
allows multiple species search. If you need to run for many species
consider setting the argument progress_bar = TRUE
to track
the execution progress.
splist <- c(
"Hibiscus abelmoschus var. betulifolius Mast.",
"Hibiscus abutiloides Willd.",
"Hibiscus aculeatus",
"Hibiscus acuminatus",
"Hibiscus furcatuis"
)
x <- lcvp_search(splist, max_distance = 0.2, show_correct = TRUE)
x
Search | global.Id | Input.Genus | Input.Epitheton | Rank | Input.Subspecies.Epitheton | Input.Authors | Status | globalId.of.Output.Taxon | Output.Taxon | Family | Order | Literature | Comments | Correct |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Hibiscus abelmoschus var. betulifolius Mast. | 603424 | Hibiscus | abelmoschus | var. | betulifolius | Mast. | synonym | 518 | Abelmoschus moschatus Medik. | Malvaceae | Malvales | TRUE | ||
Hibiscus abutiloides Willd. | 603426 | Hibiscus | abutiloides | species | nil | Willd. | synonym | 1211289 | Talipariti tiliaceum (L.) Fryxell | Malvaceae | Malvales | TRUE | ||
Hibiscus aculeatus | 603439 | Hibiscus | aculeatus | species | nil | Walter | accepted | 603439 | Hibiscus aculeatus Walter | Malvaceae | Malvales | TRUE | ||
Hibiscus acuminatus | 603440 | Hibiscus | acuminatus | species | nil | Cav. | synonym | 683056 | Kosteletzkya acuminata (Cav.) Britten | Malvaceae | Malvales | TRUE | ||
Hibiscus furcatuis | 603937 | Hibiscus | furcatus | species | nil | Roxb. | accepted | 603937 | Hibiscus furcatus Roxb. | Malvaceae | Malvales | FALSE |
The function lcvp_summary
gives a report on the
searching results using the lcvp_search
function.
Indicating the number of species searched, how many were matched. Among
the matched names, it indicates how many were exactly or fuzzy matched.
Then it checks how many author and infracategory names were exactly
matched. Note that if authors or infracategory is not provided, it will
be considered a no match.
lcvp_summary(x)
#> Species searched: 5
#> Species matched: 5 (100%)
#> Species exactly matched: 4 (80%)
#> Species fuzzy matched: 1 (20%)
#> Authors exactly matched: 2 (40%)
#> Infracategories exactly matched: 5 (100%)
Notice that the warning message indicated that more than one name was matched for some species. You can easily access these species using the following code.
sps_mult <- attr(x, "matched_mult")
sps_mult
#> [1] "Hibiscus aculeatus" "Hibiscus furcatuis"
And you can quickly use the lcvp_fuzzy_search
on
them.
lcvp_fuzzy_search(sps_mult)
global.Id | Input.Genus | Input.Epitheton | Rank | Input.Subspecies.Epitheton | Input.Authors | Status | globalId.of.Output.Taxon | Output.Taxon | Family | Order | Literature | Comments | Name.Distance |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
603436 | Hibiscus | aculeatus | species | nil | F.Dietr. | synonym | 604041 | Hibiscus heterophyllus Vent. | Malvaceae | Malvales | 0 | ||
603437 | Hibiscus | aculeatus | species | nil | G.Don | synonym | 604734 | Hibiscus surattensis L. | Malvaceae | Malvales | 0 | ||
603438 | Hibiscus | aculeatus | species | nil | Roxb. | synonym | 603937 | Hibiscus furcatus Roxb. | Malvaceae | Malvales | 0 | ||
603439 | Hibiscus | aculeatus | species | nil | Walter | accepted | 603439 | Hibiscus aculeatus Walter | Malvaceae | Malvales | 0 | ||
603934 | Hibiscus | furcatus | species | nil | Craib | unresolved | 604561 | Hibiscus rosa-sinensis L. | — | — | 1 | ||
603935 | Hibiscus | furcatus | species | nil | Harv. | synonym | 603462 | Hibiscus altissimus Hornby | Malvaceae | Malvales | 1 | ||
603936 | Hibiscus | furcatus | species | nil | Mullend. | synonym | 604338 | Hibiscus noldeae Baker.f. | Malvaceae | Malvales | 1 | ||
603937 | Hibiscus | furcatus | species | nil | Roxb. | accepted | 603937 | Hibiscus furcatus Roxb. | Malvaceae | Malvales | 1 | ||
603940 | Hibiscus | furcatus | species | nil | Wall. | synonym | 603937 | Hibiscus furcatus Roxb. | Malvaceae | Malvales | 1 | ||
603941 | Hibiscus | furcatus | species | nil | Willd. | synonym | 604573 | Hibiscus rostellatus Guill. & Perr. | Malvaceae | Malvales | 1 |
Users can search all plant taxa names listed in the “Leipzig
Catalogue of Vascular Plants” (LCVP) by order, family, genus or author,
using the lcvp_group_search
function.
# Search by Genus
x <- lcvp_group_search(c("AA", "Adansonia"), search_by = "Genus")
head(x)
global.Id | Input.Genus | Input.Epitheton | Rank | Input.Subspecies.Epitheton | Input.Authors | Status | globalId.of.Output.Taxon | Output.Taxon | Family | Order | Literature | Comments |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Aa | argyrolepis | species | nil | (Rchb.f.) Rchb.f. | accepted | 1 | Aa argyrolepis (Rchb.f.) Rchb.f. | Orchidaceae | Asparagales | ||
2 | Aa | aurantiaca | species | nil | D.Trujillo | accepted | 2 | Aa aurantiaca D.Trujillo | Orchidaceae | Asparagales | Lankesteriana 2011.11.1 1-8; | |
3 | Aa | brevis | species | nil | Schltr. | synonym | 819078 | Myrosmodes breve (Schltr.) Garay | Orchidaceae | Asparagales | ||
4 | Aa | calceata | species | nil | (Rchb.f.) Schltr. | accepted | 4 | Aa calceata (Rchb.f.) Schltr. | Orchidaceae | Asparagales | ||
5 | Aa | chiogena | species | nil | Schltr. | synonym | 819080 | Myrosmodes chiogena (Schltr.) C.A.Vargas | Orchidaceae | Asparagales | ||
6 | Aa | colombiana | species | nil | Schltr. | accepted | 6 | Aa colombiana Schltr. | Orchidaceae | Asparagales | Ann Bot 2009.104.3 403-416; |
Users can choose to keep only certain taxonomic stauts, this includes “accepted”, “synonym”, “unresolved”, and “external”.
# Search by Author and keep only accepted names
x <- lcvp_group_search("Schltr.", search_by = "Author", status = "accepted")
head(x)
global.Id | Input.Genus | Input.Epitheton | Rank | Input.Subspecies.Epitheton | Input.Authors | Status | globalId.of.Output.Taxon | Output.Taxon | Family | Order | Literature | Comments | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 4 | Aa | calceata | species | nil | (Rchb.f.) Schltr. | accepted | 4 | Aa calceata (Rchb.f.) Schltr. | Orchidaceae | Asparagales | ||
4 | 6 | Aa | colombiana | species | nil | Schltr. | accepted | 6 | Aa colombiana Schltr. | Orchidaceae | Asparagales | Ann Bot 2009.104.3 403-416; | |
5 | 7 | Aa | denticulata | species | nil | Schltr. | accepted | 7 | Aa denticulata Schltr. | Orchidaceae | Asparagales | ||
6 | 8 | Aa | erosa | species | nil | (Rchb.f.) Schltr. | accepted | 8 | Aa erosa (Rchb.f.) Schltr. | Orchidaceae | Asparagales | ||
7 | 9 | Aa | fiebrigii | species | nil | (Schltr.) Schltr. | accepted | 9 | Aa fiebrigii (Schltr.) Schltr. | Orchidaceae | Asparagales | ||
9 | 14 | Aa | hieronymi | species | nil | (Cogn.) Schltr. | accepted | 14 | Aa hieronymi (Cogn.) Schltr. | Orchidaceae | Asparagales |
The output from lcvp_search
,
lcvp_fuzzy_search
and lcvp_group_search
is a
data.frame
(or list of data.frames) with the following
columns:
If no match is found for one species it will return NA for the columns in the LCVP table. But, if no match is found for all species the function will return NULL and a warning message.
In many situations, researchers want to compare and match two lists
of species name coming from different sources (e.g. phylogenies, spatial
data, trait data, regional lists). The function lcvp_match
can be used for that. It matches and compares two name lists based on
the taxonomic resolution of plant taxa names listed in the LCVP.
# Generate two lists of species name
splist1 <- sample(apply(LCVP::tab_lcvp[2:10, 2:3], 1, paste, collapse = " "))
splist2 <- sample(apply(LCVP::tab_lcvp[11:3, 2:3], 1, paste, collapse = " "))
Ordered is based on the splist1, and followed by non-matched names in splist2.
# Match both lists
x <- lcvp_match(splist1, splist2)
head(x)
Species.List.1 | Species.List.2 | global.Id | Input.Genus | Input.Epitheton | Rank | Input.Subspecies.Epitheton | Input.Authors | Status | globalId.of.Output.Taxon | Output.Taxon | Family | Order | Literature | Comments | Match.Position.2to1 | Duplicated.Output.Position | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | Aa figueroi | Aa figueroi | 10 | Aa | figueroi | species | nil | Szlach. & S.Nowak | accepted | 10 | Aa figueroi Szlach. & S.Nowak | Orchidaceae | Asparagales | 7 | NA | ||
8 | Aa erosa | Aa erosa | 8 | Aa | erosa | species | nil | (Rchb.f.) Schltr. | accepted | 8 | Aa erosa (Rchb.f.) Schltr. | Orchidaceae | Asparagales | 3 | NA | ||
5 | Aa chiogena | Aa chiogena | 5 | Aa | chiogena | species | nil | Schltr. | synonym | 819080 | Myrosmodes chiogena (Schltr.) C.A.Vargas | Orchidaceae | Asparagales | 8 | NA | ||
7 | Aa denticulata | Aa denticulata | 7 | Aa | denticulata | species | nil | Schltr. | accepted | 7 | Aa denticulata Schltr. | Orchidaceae | Asparagales | 5 | NA | ||
4 | Aa calceata | Aa calceata | 4 | Aa | calceata | species | nil | (Rchb.f.) Schltr. | accepted | 4 | Aa calceata (Rchb.f.) Schltr. | Orchidaceae | Asparagales | 6 | NA | ||
6 | Aa colombiana | Aa colombiana | 6 | Aa | colombiana | species | nil | Schltr. | accepted | 6 | Aa colombiana Schltr. | Orchidaceae | Asparagales | Ann Bot 2009.104.3 403-416; | 2 | NA |
If include_all = FALSE
, non-matched names in splist2 are
not included. And users can use the column
Match.Position.2to1
to reorder splist2 to match
splist1.
# Match both lists
matchLists <- lcvp_match(splist1, splist2, include_all = FALSE)
splist2_reordered <- splist2[matchLists$Match.Position.2to1]
The lcvp_join
function provides a quicker way to join
two tables based on the taxonomic resolution of plant taxa names listed
in the LCVP. It is inspired by the dplyr::join
function.
The function add the columns from a first table to a second table based
on the list of species name in both tables. It first standardizes the
species names in both tables based on the “Leipzig Catalogue of Vascular
Plants” (LCVP) using the algorithm in lcvp_search
. These
standardized names of both tables are then matched using the algorithm
in lcvp_match
. The argument type
indicates the
kind of join to be done. The option “full” join will keep all species,
“left” return all rows from the first table and drops the non-matches
from the second, “right” do the same for the second table, and “inner”
keep only matched species.
# Create data.frame1
splist1 <- sample(apply(LCVP::tab_lcvp[2:10, 2:3], 1, paste, collapse = " "))
tbl1 <-
data.frame("Species" = splist1, "Trait1" = runif(length(splist1)))
# Create data.frame2
splist2 <- sample(apply(LCVP::tab_lcvp[11:3, 2:3], 1, paste, collapse = " "))
tbl2 <- data.frame(
"Species" = splist2,
"Trait2" = runif(length(splist2)),
"Trait3" = runif(length(splist2)),
"Trait4" = sample(c("a", "b"), length(splist2), replace = TRUE),
"Trait5" = sample(c(TRUE, FALSE), length(splist2), replace = TRUE)
)
head(x)
Species.List.1 | Species.List.2 | global.Id | Input.Epitheton | Rank | Input.Subspecies.Epitheton | Input.Authors | Status | globalId.of.Output.Taxon | Output.Taxon | Family | Order | Literature | Comments | Trait1 | Trait2 | Trait3 | Trait4 | Trait5 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Aa aurantiaca | NA | 2 | aurantiaca | species | nil | D.Trujillo | accepted | 2 | Aa aurantiaca D.Trujillo | Orchidaceae | Asparagales | Lankesteriana 2011.11.1 1-8; | 0.5655017 | NA | NA | NA | NA | |
Aa fiebrigii | Aa fiebrigii | 9 | fiebrigii | species | nil | (Schltr.) Schltr. | accepted | 9 | Aa fiebrigii (Schltr.) Schltr. | Orchidaceae | Asparagales | 0.5269464 | 0.5306969 | 0.3474194 | a | FALSE | ||
Aa figueroi | Aa figueroi | 10 | figueroi | species | nil | Szlach. & S.Nowak | accepted | 10 | Aa figueroi Szlach. & S.Nowak | Orchidaceae | Asparagales | 0.3552532 | 0.9947610 | 0.4730215 | b | FALSE | ||
Aa colombiana | Aa colombiana | 6 | colombiana | species | nil | Schltr. | accepted | 6 | Aa colombiana Schltr. | Orchidaceae | Asparagales | Ann Bot 2009.104.3 403-416; | 0.8892958 | 0.9176120 | 0.6219212 | a | FALSE | |
Aa brevis | Aa brevis | 3 | brevis | species | nil | Schltr. | synonym | 819078 | Myrosmodes breve (Schltr.) Garay | Orchidaceae | Asparagales | 0.5377264 | NA | NA | NA | NA | ||
Aa calceata | Aa calceata | 4 | calceata | species | nil | (Rchb.f.) Schltr. | accepted | 4 | Aa calceata (Rchb.f.) Schltr. | Orchidaceae | Asparagales | 0.2948948 | 0.6416213 | 0.7076298 | a | FALSE |
Because some names may turn out to be synonyms based on the LCVP
taxonomic resolution, users may opt to solve duplicated species names by
summarizing traits given provided functions for common classes of
variables (numeric, character, and logical). This can be done by turning
the option solve_duplicated = TRUE
, and the algorithm will
combine duplicated output names (Output.Taxon
column) based
on users-defined functions to summarize the information.
head(x)
Species.List.1 | Species.List.2 | global.Id | Input.Epitheton | Rank | Input.Subspecies.Epitheton | Input.Authors | Status | globalId.of.Output.Taxon | Output.Taxon | Family | Order | Literature | Comments | Trait1 | Trait2 | Trait3 | Trait4 | Trait5 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Aa aurantiaca | NA | 2 | aurantiaca | species | nil | D.Trujillo | accepted | 2 | Aa aurantiaca D.Trujillo | Orchidaceae | Asparagales | Lankesteriana 2011.11.1 1-8; | 0.5655017 | NA | NA | NA | NA | |
Aa fiebrigii | Aa fiebrigii | 9 | fiebrigii | species | nil | (Schltr.) Schltr. | accepted | 9 | Aa fiebrigii (Schltr.) Schltr. | Orchidaceae | Asparagales | 0.5269464 | 0.5306969 | 0.3474194 | a | FALSE | ||
Aa figueroi | Aa figueroi | 10 | figueroi | species | nil | Szlach. & S.Nowak | accepted | 10 | Aa figueroi Szlach. & S.Nowak | Orchidaceae | Asparagales | 0.3552532 | 0.9947610 | 0.4730215 | b | FALSE | ||
Aa colombiana | Aa colombiana | 6 | colombiana | species | nil | Schltr. | accepted | 6 | Aa colombiana Schltr. | Orchidaceae | Asparagales | Ann Bot 2009.104.3 403-416; | 0.8892958 | 0.9176120 | 0.6219212 | a | FALSE | |
Aa brevis | Aa brevis | 3 | brevis | species | nil | Schltr. | synonym | 819078 | Myrosmodes breve (Schltr.) Garay | Orchidaceae | Asparagales | 0.5377264 | NA | NA | NA | NA | ||
Aa calceata | Aa calceata | 4 | calceata | species | nil | (Rchb.f.) Schltr. | accepted | 4 | Aa calceata (Rchb.f.) Schltr. | Orchidaceae | Asparagales | 0.2948948 | 0.6416213 | 0.7076298 | a | FALSE |
Because some users may want to solve duplicated names outside the
lcvp_join
function, the algorithm is available as an
individual function names lcvp_solve_dups
.
# Create a data.frame with duplicated names and different traits
splist <- sample(apply(LCVP::tab_lcvp[1:100, 2:3], 1, paste, collapse = " "))
search <- lcvp_search(splist)
tbl <- data.frame("Species" = search$Output.Taxon,
"Trait1" = runif(length(splist)),
"Trait2" = sample(c("a", "b"), length(splist), replace = TRUE),
"Trait3" = sample(c(TRUE, FALSE), length(splist), replace = TRUE))
# Solve with default parameters
x <- lcvp_solve_dups(tbl, 1)
head(x)
Species | Trait1 | Trait2 | Trait3 |
---|---|---|---|
Triplophyllum varians (T.Moore) Holttum | 0.9759653 | b | FALSE |
Abacopteris nudata (Roxb. ex Griff.) S.E.Fawc. & A.R.Sm. | 0.3822746 | a, b | TRUE |
Aa maderoi Schltr. | 0.3413011 | b | FALSE |
Breynia brevipes (Müll.Arg.) Chakrab. & N.P.Balakr. | 0.1605878 | b | FALSE |
Breynia androgyna (L.) Chakrab. & N.P.Balakr. | 0.2620146 | a, b | TRUE |
Abacopteris insularis K.Iwats. | 0.5997044 | a | FALSE |
Some users may want to use lcvplants functions to perform taxonomic harmonization for many thousands of species and wish to use the entire computational capacity available to speed up the processing time. Check this article, where we show how to run lcvp_search or lcvp_fuzzy_search in parallel in an efficient way to reduce computational time.
Freiberg, M., Winter, M., Gentile, A. et al. LCVP, The Leipzig catalogue of vascular plants, a new taxonomic reference list for all known vascular plants. Sci Data 7, 416 (2020). https://doi.org/10.1038/s41597-020-00702-z