Background

When comparing or merging any list of plant species, for instance for compiling regional species lists, merging occurrence lists with trait data or phylogenies, the taxonomic names must be matched to avoid artificial inflation due to synonyms, data loss or erroneous matches due to homonyms. The lcvplants package facilitates this process by automatizing large-scale taxonomic harmonization of plant names by fuzzy matching and synonymy resolution against the Leipzig Catalogue of Vascular Plants as taxonomic backbone.

The Leipzig Catalogue of Vascular Plants (LCVP) is a novel global taxonomic backbone, updating The Plant List, comprising more than 1,300,000 names and 350,000 accepted taxa names. We described the LCVP in detail in the related scientific publication (Freiberg et al, 2020).

Running lcvplants

Installation

You can install lcvplants from github using the devtools package You may need to install devtools first, so if you do not have it installed, run the following code.

install.packages("devtools")

To use lcvplants you also need the data of the LCVP package, which you can install in the same way.

devtools::install_github("idiv-biodiversity/lcvplants")
devtools::install_github("idiv-biodiversity/LCVP")

Taxonomic resolution

The basic function in lcvplants is lcvp_search. It allows users to input a list of species name that will be matched to the names in the LCVP data. It returns a data.frame indicating the full name matched — including the authority name and removing possible orthographic errors —, the taxonomic status, the accepted name, family and order.

lcvp_search("Hibiscus vitifolius")
Search global.Id Input.Genus Input.Epitheton Rank Input.Subspecies.Epitheton Input.Authors Status globalId.of.Output.Taxon Output.Taxon Family Order Literature Comments
Hibiscus vitifolius 604882 Hibiscus vitifolius species nil L. accepted 604882 Hibiscus vitifolius L. Malvaceae Malvales

The lcvp_search algorithm will first try to exactly match the binomial names provided by the user. If no match is found, it will try to find the closest name given the maximal transformation cost defined in the argument max_distance (see ?base::agrep for more details). When the argument show_correct = TRUE a column is added to the final result indicating whether the binomial name was exactly matched (TRUE), or if it is misspelled (FALSE).

lcvp_search("Hibiscus vitifoliuse", 
            max_distance = 0.1, 
            show_correct = TRUE)
Search global.Id Input.Genus Input.Epitheton Rank Input.Subspecies.Epitheton Input.Authors Status globalId.of.Output.Taxon Output.Taxon Family Order Literature Comments Correct
Hibiscus vitifoliuse 604882 Hibiscus vitifolius species nil L. accepted 604882 Hibiscus vitifolius L. Malvaceae Malvales FALSE

If more than one name is fuzzy matched (more than one closest match found), only the accepted or the first name will be returned. The function lcvp_fuzzy_search can be used to return all the closest results when argument keep_closest = TRUE, or all names within the max_distance defined when keep_closest = FALSE.

lcvp_fuzzy_search("Hibiscus vitifoliuse", 
                  max_distance = 0.1, 
                  keep_closest = TRUE)
global.Id Input.Genus Input.Epitheton Rank Input.Subspecies.Epitheton Input.Authors Status globalId.of.Output.Taxon Output.Taxon Family Order Literature Comments Name.Distance
604882 Hibiscus vitifolius species nil L. accepted 604882 Hibiscus vitifolius L. Malvaceae Malvales 1
604883 Hibiscus vitifolius species nil Mill. synonym 603660 Hibiscus cannabinus L. Malvaceae Malvales 1

Both lcvp_search and lcvp_fuzzy_search allows multiple species search. If you need to run for many species consider setting the argument progress_bar = TRUE to track the execution progress.

splist <- c(
  "Hibiscus abelmoschus var. betulifolius Mast.",
  "Hibiscus abutiloides Willd.",
  "Hibiscus aculeatus",
  "Hibiscus acuminatus",
  "Hibiscus furcatuis" 
)
x <- lcvp_search(splist, max_distance = 0.2, show_correct = TRUE)
x
Search global.Id Input.Genus Input.Epitheton Rank Input.Subspecies.Epitheton Input.Authors Status globalId.of.Output.Taxon Output.Taxon Family Order Literature Comments Correct
Hibiscus abelmoschus var. betulifolius Mast. 603424 Hibiscus abelmoschus var. betulifolius Mast. synonym 518 Abelmoschus moschatus Medik. Malvaceae Malvales TRUE
Hibiscus abutiloides Willd. 603426 Hibiscus abutiloides species nil Willd. synonym 1211289 Talipariti tiliaceum (L.) Fryxell Malvaceae Malvales TRUE
Hibiscus aculeatus 603439 Hibiscus aculeatus species nil Walter accepted 603439 Hibiscus aculeatus Walter Malvaceae Malvales TRUE
Hibiscus acuminatus 603440 Hibiscus acuminatus species nil Cav. synonym 683056 Kosteletzkya acuminata (Cav.) Britten Malvaceae Malvales TRUE
Hibiscus furcatuis 603937 Hibiscus furcatus species nil Roxb. accepted 603937 Hibiscus furcatus Roxb. Malvaceae Malvales FALSE

The function lcvp_summary gives a report on the searching results using the lcvp_search function. Indicating the number of species searched, how many were matched. Among the matched names, it indicates how many were exactly or fuzzy matched. Then it checks how many author and infracategory names were exactly matched. Note that if authors or infracategory is not provided, it will be considered a no match.

lcvp_summary(x)
#> Species searched: 5
#> Species matched: 5 (100%)
#> Species exactly matched: 4 (80%)
#> Species fuzzy matched: 1 (20%)
#> Authors exactly matched: 2 (40%)
#> Infracategories exactly matched: 5 (100%)

Notice that the warning message indicated that more than one name was matched for some species. You can easily access these species using the following code.

sps_mult <- attr(x, "matched_mult")
sps_mult
#> [1] "Hibiscus aculeatus" "Hibiscus furcatuis"

And you can quickly use the lcvp_fuzzy_search on them.

global.Id Input.Genus Input.Epitheton Rank Input.Subspecies.Epitheton Input.Authors Status globalId.of.Output.Taxon Output.Taxon Family Order Literature Comments Name.Distance
603436 Hibiscus aculeatus species nil F.Dietr. synonym 604041 Hibiscus heterophyllus Vent. Malvaceae Malvales 0
603437 Hibiscus aculeatus species nil G.Don synonym 604734 Hibiscus surattensis L. Malvaceae Malvales 0
603438 Hibiscus aculeatus species nil Roxb. synonym 603937 Hibiscus furcatus Roxb. Malvaceae Malvales 0
603439 Hibiscus aculeatus species nil Walter accepted 603439 Hibiscus aculeatus Walter Malvaceae Malvales 0
603934 Hibiscus furcatus species nil Craib unresolved 604561 Hibiscus rosa-sinensis L. 1
603935 Hibiscus furcatus species nil Harv. synonym 603462 Hibiscus altissimus Hornby Malvaceae Malvales 1
603936 Hibiscus furcatus species nil Mullend. synonym 604338 Hibiscus noldeae Baker.f. Malvaceae Malvales 1
603937 Hibiscus furcatus species nil Roxb. accepted 603937 Hibiscus furcatus Roxb. Malvaceae Malvales 1
603940 Hibiscus furcatus species nil Wall. synonym 603937 Hibiscus furcatus Roxb. Malvaceae Malvales 1
603941 Hibiscus furcatus species nil Willd. synonym 604573 Hibiscus rostellatus Guill. & Perr. Malvaceae Malvales 1

Search all vascular plant names of a Genus, Family, Order or Author

Users can search all plant taxa names listed in the “Leipzig Catalogue of Vascular Plants” (LCVP) by order, family, genus or author, using the lcvp_group_search function.

# Search by Genus
x <- lcvp_group_search(c("AA", "Adansonia"), search_by = "Genus")
head(x)
global.Id Input.Genus Input.Epitheton Rank Input.Subspecies.Epitheton Input.Authors Status globalId.of.Output.Taxon Output.Taxon Family Order Literature Comments
1 Aa argyrolepis species nil (Rchb.f.) Rchb.f. accepted 1 Aa argyrolepis (Rchb.f.) Rchb.f. Orchidaceae Asparagales
2 Aa aurantiaca species nil D.Trujillo accepted 2 Aa aurantiaca D.Trujillo Orchidaceae Asparagales Lankesteriana 2011.11.1 1-8;
3 Aa brevis species nil Schltr. synonym 819078 Myrosmodes breve (Schltr.) Garay Orchidaceae Asparagales
4 Aa calceata species nil (Rchb.f.) Schltr. accepted 4 Aa calceata (Rchb.f.) Schltr. Orchidaceae Asparagales
5 Aa chiogena species nil Schltr. synonym 819080 Myrosmodes chiogena (Schltr.) C.A.Vargas Orchidaceae Asparagales
6 Aa colombiana species nil Schltr. accepted 6 Aa colombiana Schltr. Orchidaceae Asparagales Ann Bot 2009.104.3 403-416;

Users can choose to keep only certain taxonomic stauts, this includes “accepted”, “synonym”, “unresolved”, and “external”.

# Search by Author and keep only accepted names
x <- lcvp_group_search("Schltr.", search_by = "Author", status = "accepted")
head(x)
global.Id Input.Genus Input.Epitheton Rank Input.Subspecies.Epitheton Input.Authors Status globalId.of.Output.Taxon Output.Taxon Family Order Literature Comments
2 4 Aa calceata species nil (Rchb.f.) Schltr. accepted 4 Aa calceata (Rchb.f.) Schltr. Orchidaceae Asparagales
4 6 Aa colombiana species nil Schltr. accepted 6 Aa colombiana Schltr. Orchidaceae Asparagales Ann Bot 2009.104.3 403-416;
5 7 Aa denticulata species nil Schltr. accepted 7 Aa denticulata Schltr. Orchidaceae Asparagales
6 8 Aa erosa species nil (Rchb.f.) Schltr. accepted 8 Aa erosa (Rchb.f.) Schltr. Orchidaceae Asparagales
7 9 Aa fiebrigii species nil (Schltr.) Schltr. accepted 9 Aa fiebrigii (Schltr.) Schltr. Orchidaceae Asparagales
9 14 Aa hieronymi species nil (Cogn.) Schltr. accepted 14 Aa hieronymi (Cogn.) Schltr. Orchidaceae Asparagales

Basic output information

The output from lcvp_search, lcvp_fuzzy_search and lcvp_group_search is a data.frame (or list of data.frames) with the following columns:

  • global.Id: The fixed species id of the input taxon in the Leipzig Catalogue of Vascular Plants (LCVP).
  • Input.Genus: A character vector. The input genus of the corresponding vascular plant species name listed in LCVP.
  • Input.Epitheton: A character vector. The input epitheton of the corresponding vascular plant species name listed in LCVP.
  • Rank: A character vector. The taxonomic rank (“species”, subspecies: “subsp.”, variety: “var.”, subvariety: “subvar.”, “forma”, or subforma: “subf.”) of the corresponding vascular plant species name listed in LCVP.
  • Input.Subspecies.Epitheton: A character vector. If the indicated rank is below species, the subspecies epitheton input of the corresponding vascular plant species name listed in LCVP. If the rank is “species”, the input is “nil”.
  • Input.Authors: A character vector. The taxonomic authority input of the corresponding vascular plant species name listed in LCVP.
  • Status: A character vector. description if a taxon is classified as ‘valid’, ‘synonym’, ‘unresolved’, ‘external’ or ‘blanks’. The ‘unresolved’ rank means that the status of the plant name could be either valid or synonym, but the information available does not allow a definitive decision. ‘External’ in an extra rank which lists names outside the scope of this publication but useful to keep on this updated list. ‘Blanks’ means that the respective name exists in bibliography but it is neither clear where it came from valid, synonym or unresolved. (see the main text Freiberg et al. for more details)
  • globalId.of.Output.Taxon: The fixed species id of the output taxon in LCVP.
  • Output.Taxon: A character vector. The list of the accepted plant taxa names according to the LCVP.
  • Family: A character vector. The corresponding family name of the Input.Taxon, staying empty if the Status is unresolved.
  • Order: A character vector. The corresponding order name of the Input.Taxon, staying empty if the Status is unresolved.
  • Literature: A character vector. The bibliography used.
  • Comments: A character vector. Further taxonomic comments.

If no match is found for one species it will return NA for the columns in the LCVP table. But, if no match is found for all species the function will return NULL and a warning message.

Matching and joining two lists of vascular plant names

Match

In many situations, researchers want to compare and match two lists of species name coming from different sources (e.g. phylogenies, spatial data, trait data, regional lists). The function lcvp_match can be used for that. It matches and compares two name lists based on the taxonomic resolution of plant taxa names listed in the LCVP.

# Generate two lists of species name
splist1 <- sample(apply(LCVP::tab_lcvp[2:10, 2:3], 1, paste, collapse = " "))
splist2 <- sample(apply(LCVP::tab_lcvp[11:3, 2:3], 1, paste, collapse = " "))

Ordered is based on the splist1, and followed by non-matched names in splist2.

# Match both lists
x <- lcvp_match(splist1, splist2)
head(x)
Species.List.1 Species.List.2 global.Id Input.Genus Input.Epitheton Rank Input.Subspecies.Epitheton Input.Authors Status globalId.of.Output.Taxon Output.Taxon Family Order Literature Comments Match.Position.2to1 Duplicated.Output.Position
10 Aa figueroi Aa figueroi 10 Aa figueroi species nil Szlach. & S.Nowak accepted 10 Aa figueroi Szlach. & S.Nowak Orchidaceae Asparagales 7 NA
8 Aa erosa Aa erosa 8 Aa erosa species nil (Rchb.f.) Schltr. accepted 8 Aa erosa (Rchb.f.) Schltr. Orchidaceae Asparagales 3 NA
5 Aa chiogena Aa chiogena 5 Aa chiogena species nil Schltr. synonym 819080 Myrosmodes chiogena (Schltr.) C.A.Vargas Orchidaceae Asparagales 8 NA
7 Aa denticulata Aa denticulata 7 Aa denticulata species nil Schltr. accepted 7 Aa denticulata Schltr. Orchidaceae Asparagales 5 NA
4 Aa calceata Aa calceata 4 Aa calceata species nil (Rchb.f.) Schltr. accepted 4 Aa calceata (Rchb.f.) Schltr. Orchidaceae Asparagales 6 NA
6 Aa colombiana Aa colombiana 6 Aa colombiana species nil Schltr. accepted 6 Aa colombiana Schltr. Orchidaceae Asparagales Ann Bot 2009.104.3 403-416; 2 NA

If include_all = FALSE, non-matched names in splist2 are not included. And users can use the column Match.Position.2to1 to reorder splist2 to match splist1.

# Match both lists
matchLists <- lcvp_match(splist1, splist2, include_all = FALSE)
splist2_reordered <- splist2[matchLists$Match.Position.2to1]

Join

The lcvp_join function provides a quicker way to join two tables based on the taxonomic resolution of plant taxa names listed in the LCVP. It is inspired by the dplyr::join function. The function add the columns from a first table to a second table based on the list of species name in both tables. It first standardizes the species names in both tables based on the “Leipzig Catalogue of Vascular Plants” (LCVP) using the algorithm in lcvp_search. These standardized names of both tables are then matched using the algorithm in lcvp_match. The argument type indicates the kind of join to be done. The option “full” join will keep all species, “left” return all rows from the first table and drops the non-matches from the second, “right” do the same for the second table, and “inner” keep only matched species.

# Create data.frame1
splist1 <- sample(apply(LCVP::tab_lcvp[2:10, 2:3], 1, paste, collapse = " "))
tbl1 <-
  data.frame("Species" = splist1, "Trait1" = runif(length(splist1)))

# Create data.frame2
splist2 <- sample(apply(LCVP::tab_lcvp[11:3, 2:3], 1, paste, collapse = " "))
tbl2 <- data.frame(
  "Species" = splist2,
  "Trait2" = runif(length(splist2)),
  "Trait3" = runif(length(splist2)),
  "Trait4" = sample(c("a", "b"), length(splist2), replace = TRUE),
  "Trait5" = sample(c(TRUE, FALSE), length(splist2), replace = TRUE)
)
# Full join the two tables
x <- lcvp_join(tbl1, tbl2, c("Species", "Species"), type = "full")
head(x)
Species.List.1 Species.List.2 global.Id Input.Epitheton Rank Input.Subspecies.Epitheton Input.Authors Status globalId.of.Output.Taxon Output.Taxon Family Order Literature Comments Trait1 Trait2 Trait3 Trait4 Trait5
Aa aurantiaca NA 2 aurantiaca species nil D.Trujillo accepted 2 Aa aurantiaca D.Trujillo Orchidaceae Asparagales Lankesteriana 2011.11.1 1-8; 0.5655017 NA NA NA NA
Aa fiebrigii Aa fiebrigii 9 fiebrigii species nil (Schltr.) Schltr. accepted 9 Aa fiebrigii (Schltr.) Schltr. Orchidaceae Asparagales 0.5269464 0.5306969 0.3474194 a FALSE
Aa figueroi Aa figueroi 10 figueroi species nil Szlach. & S.Nowak accepted 10 Aa figueroi Szlach. & S.Nowak Orchidaceae Asparagales 0.3552532 0.9947610 0.4730215 b FALSE
Aa colombiana Aa colombiana 6 colombiana species nil Schltr. accepted 6 Aa colombiana Schltr. Orchidaceae Asparagales Ann Bot 2009.104.3 403-416; 0.8892958 0.9176120 0.6219212 a FALSE
Aa brevis Aa brevis 3 brevis species nil Schltr. synonym 819078 Myrosmodes breve (Schltr.) Garay Orchidaceae Asparagales 0.5377264 NA NA NA NA
Aa calceata Aa calceata 4 calceata species nil (Rchb.f.) Schltr. accepted 4 Aa calceata (Rchb.f.) Schltr. Orchidaceae Asparagales 0.2948948 0.6416213 0.7076298 a FALSE

Because some names may turn out to be synonyms based on the LCVP taxonomic resolution, users may opt to solve duplicated species names by summarizing traits given provided functions for common classes of variables (numeric, character, and logical). This can be done by turning the option solve_duplicated = TRUE, and the algorithm will combine duplicated output names (Output.Taxon column) based on users-defined functions to summarize the information.

x <- lcvp_join(tbl1, tbl2, c("Species", "Species"), type = "full",
               solve_duplicated = TRUE)
head(x)
Species.List.1 Species.List.2 global.Id Input.Epitheton Rank Input.Subspecies.Epitheton Input.Authors Status globalId.of.Output.Taxon Output.Taxon Family Order Literature Comments Trait1 Trait2 Trait3 Trait4 Trait5
Aa aurantiaca NA 2 aurantiaca species nil D.Trujillo accepted 2 Aa aurantiaca D.Trujillo Orchidaceae Asparagales Lankesteriana 2011.11.1 1-8; 0.5655017 NA NA NA NA
Aa fiebrigii Aa fiebrigii 9 fiebrigii species nil (Schltr.) Schltr. accepted 9 Aa fiebrigii (Schltr.) Schltr. Orchidaceae Asparagales 0.5269464 0.5306969 0.3474194 a FALSE
Aa figueroi Aa figueroi 10 figueroi species nil Szlach. & S.Nowak accepted 10 Aa figueroi Szlach. & S.Nowak Orchidaceae Asparagales 0.3552532 0.9947610 0.4730215 b FALSE
Aa colombiana Aa colombiana 6 colombiana species nil Schltr. accepted 6 Aa colombiana Schltr. Orchidaceae Asparagales Ann Bot 2009.104.3 403-416; 0.8892958 0.9176120 0.6219212 a FALSE
Aa brevis Aa brevis 3 brevis species nil Schltr. synonym 819078 Myrosmodes breve (Schltr.) Garay Orchidaceae Asparagales 0.5377264 NA NA NA NA
Aa calceata Aa calceata 4 calceata species nil (Rchb.f.) Schltr. accepted 4 Aa calceata (Rchb.f.) Schltr. Orchidaceae Asparagales 0.2948948 0.6416213 0.7076298 a FALSE

Because some users may want to solve duplicated names outside the lcvp_join function, the algorithm is available as an individual function names lcvp_solve_dups.

# Create a data.frame with duplicated names and different traits
splist <- sample(apply(LCVP::tab_lcvp[1:100, 2:3], 1, paste, collapse = " "))
search <- lcvp_search(splist)

tbl <- data.frame("Species" = search$Output.Taxon,
"Trait1" = runif(length(splist)),
"Trait2" = sample(c("a", "b"), length(splist), replace = TRUE),
"Trait3" = sample(c(TRUE, FALSE), length(splist), replace = TRUE))
# Solve with default parameters
x <- lcvp_solve_dups(tbl, 1)
head(x)
Species Trait1 Trait2 Trait3
Triplophyllum varians (T.Moore) Holttum 0.9759653 b FALSE
Abacopteris nudata (Roxb. ex Griff.) S.E.Fawc. & A.R.Sm. 0.3822746 a, b TRUE
Aa maderoi Schltr. 0.3413011 b FALSE
Breynia brevipes (Müll.Arg.) Chakrab. & N.P.Balakr. 0.1605878 b FALSE
Breynia androgyna (L.) Chakrab. & N.P.Balakr. 0.2620146 a, b TRUE
Abacopteris insularis K.Iwats. 0.5997044 a FALSE

Parallel programing with lcvplants

Some users may want to use lcvplants functions to perform taxonomic harmonization for many thousands of species and wish to use the entire computational capacity available to speed up the processing time. Check this article, where we show how to run lcvp_search or lcvp_fuzzy_search in parallel in an efficient way to reduce computational time.

References

Freiberg, M., Winter, M., Gentile, A. et al. LCVP, The Leipzig catalogue of vascular plants, a new taxonomic reference list for all known vascular plants. Sci Data 7, 416 (2020). https://doi.org/10.1038/s41597-020-00702-z