This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.
Load libraries:
library(tidyverse) # To do data science
library(magrittr) # To use %<>% pipes
library(here) # To find files
library(janitor) # To clean input data
library(readxl) # To read Excel files
library(digest) # To generate hashes
library(rgbif) # To use GBIF services
Set file paths (all paths should be relative to this script):
# Raw files:
raw_data_file = "../data/raw/checklist-of-rusts.tsv"
# Processed files:
dwc_taxon_file = "../data/processed/taxon.csv"
dwc_distribution_file = "../data/processed/distribution.csv"
dwc_relationship_file = "../data/processed/resourcerelationship.csv"
Create a data frame raw_data
from the source data:
# Read the source data:
raw_data <- read_delim(raw_data_file, delim = "\t")
Clean the data somewhat: remove empty rows if present
raw_data %<>% remove_empty_rows() # Remove empty rows
The checklist includes information about rust fungi (scientificName
) and their associated host plants (hostPlant
). Although the publication of belgian rust fungi is our central focus, it is also informative to include the the host taxa in the taxon core. For this, we need to extract the host plant information from hostPlant
in raw data
and we add them to the checklist under scientificName
.
Extract the scientific names from all species in raw_data
(including the scientific names of the rust fungi as an intermediate step. We need this information later for the mapping of the Resource Relationship extension, see paragraph 5):
host_plant <- raw_data %>% select(scientificName, hostPlant)
Split hostPlant
on scientific name, using “,” as a separator:
host_plant %<>% separate(
hostPlant,
sep = ", ",
into = c(paste("species", c(1:20), sep = "_")),
fill = "right")
Gather all host species in one row (omitting “NA”):
host_plant %<>% gather("interaction", "scientificName_host", starts_with("species"), na.rm = TRUE)
Remove leading whitespaces:
host_plant %<>% mutate(scientificName_host = str_trim(scientificName_host))
This dataframe gives a nice structured overview of all interactions between the host plant and the parasite. We clean it and we save the information in a new dataframe interactions
, as we will need this to construct the Resource Relationship extension (see paragraph 5):
interactions <- host_plant %<>% select(-interaction) %<>%
rename("scientificName_parasite" = "scientificName")
Select unique scientificNames:
host_plant %<>% distinct(scientificName_host) %<>%
arrange(scientificName_host)
Rename scientificName_host
to enable binding by rows:
host_plant %<>% rename(scientificName = scientificName_host)
Add a column resourceRelation
to raw_data
and host_plant
to indicate whether the species in scientificName
is a parasite or a host:
raw_data %<>% mutate(resourceRelation = "parasite")
host_plant %<>% mutate(resourceRelation = "host")
Bind raw_data
and host_plant
by rows:
raw_data %<>% bind_rows(host_plant)
To uniquely identify a taxon in the Taxon Core and reference taxa in the Extensions, we need a taxonID
. Since we need it in all generated files, we generate it here in the raw data frame. It is a combination of dataset-shortname:taxon:
and a hash based on the scientific name. As long as the scientific name doesn’t change, the ID will be stable:
# Vectorize the digest function (The digest() function isn't vectorized. So if you pass in a vector, you get one value for the whole vector rather than a digest for each element of the vector):
vdigest <- Vectorize(digest)
# Generate taxonID:
raw_data %<>% mutate(taxon_id = paste("uredinales-belgium-checklist", "taxon", vdigest(scientificName, algo="md5"), sep=":"))
Preview data:
head(raw_data)
This checklist is a compilation of three published volumes of the Catalogue des Uredinales de Belgique (The rust fungi of Belgium)
. For each rust fungus, a volume number (part
) and a page number (page
) is provided. This provides a concise description of the species, most of the information is about taxonomy, observations and the hosts. We need to integrate these citations in both the taxon core and the extensions. So, we add bibliographicCitation
to raw_data
.
First, we complement the volume number with the full reference (given in the metadata). For this, we create the data frame full_reference
:
To join both data frames, raw_part
in full_reference
should be an integer:
full_reference %<>% mutate(part = as.integer(part))
Join both data frames:
raw_data <- left_join(raw_data, full_reference, by = "part")
For each rust fungus taxon, the bibliographicCitation
will contain the full reference with page number. For all host species, we refer to “Vanderweyen, A., & Fraiture, A. (2009, 2012) Catalogue des Uredinales de Belgique” (excluding volumne and page number). This is because the complete reference (with volumne and page number) applies to the rust fungus, not the host plant.
raw_data %<>% mutate(bibliographicCitation = case_when(
resourceRelation == "parasite" ~ paste(full_reference, "Page:", page),
resourceRelation == "host" ~ "Vanderweyen A & Fraiture A (2009 & 2012) Catalogue des Uredinales de Belgique. Lejeunia, Revue de Botanique."))
Remove full_reference
:
raw_data %<>% select(-full_reference)
Add prefix raw_
to all column names in raw_data
to avoid name clashes with Darwin Core terms:
colnames(raw_data) <- paste0("raw_", colnames(raw_data))
Preview data:
raw_data %>% head()
taxon <- raw_data
raw_scientificName
includes information about the genus, specific epithet, infraspecific epithet, (bracket)authorship and taxonRank. The full scientific name will be mapped under scientificName
in the taxon core. For enhanced readability, we parse raw_scientificName
into its different components using the parsenames() function from rgbif.
parsed_names <- parsenames(taxon $ raw_scientificName)
Overview of the different compenents:
head(parsed_names)
Select the columns needed for further mapping in the taxon core, i.e. genusorabove
, specificepithet
, authorship
, bracketauthorship
, rankmarker
and infraspecificepithet
(these will be mapped to, respectively, genus
, specificEpithet
, scientificNameAuthorship
, scientificNameAuthorship
, taxonRank
and infraspecificEpithet
). We also select scientificname
as we need this to merge parsed_names
with scientificName
in the Taxon Core.
parsed_names %<>% select(scientificname, genusorabove, specificepithet, authorship, bracketauthorship, rankmarker, infraspecificepithet)
The parsenames() function from rgbif does not identify all taxonomic ranks. We need to manually add all taxonomic ranks for which the rankmarker was left empty:
parsed_names %>% select(scientificname, rankmarker) %>%
filter(is.na(rankmarker))
Almost all these taxa are genera, only one taxon (Salix fragilis x Salix pendandra
) is a hybrid. We manually add these taxonomic ranks to parsed_names
:
parsed_names %<>% mutate(rankmarker = case_when(
is.na(rankmarker) & scientificname != "Salix fragilis x Salix pentandra" ~ "genus",
scientificname == "Salix fragilis x Salix pentandra" ~ "hybrid",
TRUE ~ rankmarker))
Some columns contain NA’s. We replace these by blanks:
parsed_names %<>% replace_na(list(
specificepithet = "",
infraspecificepithet = "",
authorship = "",
bracketauthorship = ""))
Add the prefix pn_
to the column names of parsed_names
(to remove these columns after the mapping process):
colnames(parsed_names) <- paste0("pn_", colnames(parsed_names))
Amount of species in the Taxon Core before merge:
## [1] 755
Merge parsed_names
with the Taxon Core:
taxon <- inner_join(taxon, parsed_names, by = c("raw_scientificName" = "pn_scientificname"))
Is the amount of species in the Taxon Core the same as before the merge (should be TRUE)?
## [1] TRUE
Map the source data to Darwin Core Taxon:
taxon %<>% mutate(language = "en")
taxon %<>% mutate(license = "http://creativecommons.org/publicdomain/zero/1.0/")
taxon %<>% mutate(rightsHolder = "Botanical Garden Meise")
taxon %<>% mutate(bibliographicCitation = raw_bibliographicCitation)
taxon %<>% mutate(institutionID = "http://biocol.org/urn:lsid:biocol.org:col:15605")
taxon %<>% mutate(datasetID = "https://doi.org/10.15468/2dboyn")
taxon %<>% mutate(datasetName = "Catalogue of the Rust Fungi of Belgium")
taxon %<>% mutate(taxonID = raw_taxon_id)
taxon %<>% mutate(scientificName = raw_scientificName)
taxon %<>% mutate(kingdom = case_when(
raw_resourceRelation == "parasite" ~ "Fungi",
raw_resourceRelation == "host" ~ "Plantae"))
taxon %<>% mutate(phylum = case_when(
raw_resourceRelation == "parasite" ~ "Basidiomycota",
raw_resourceRelation == "host" ~ ""))
taxon %<>% mutate(order = case_when(
raw_resourceRelation == "parasite" ~ "Uredinales",
raw_resourceRelation == "host" ~ ""))
taxon %<>% mutate(family = raw_family) %<>% replace_na(replace = list(family = "")) #' Remove NA's for host species
taxon %<>% mutate(genus = pn_genusorabove) %<>% replace_na(replace = list(genus = "")) #' Remove NA's for host species
taxon %<>% mutate(specificEpithet = pn_specificepithet) %<>% replace_na(replace = list(specificEpithet = "")) #' Remove NA's for host species
taxon %<>% mutate(infraspecificEpithet = pn_infraspecificepithet) %<>% replace_na(replace = list(infraspecificEpithet = "")) #' Remove NA's for host species
Information for taxonRank
is contained in pn_rankmarker
:
taxon %>% distinct(pn_rankmarker)
Generate taxonRank
by recoding abbreviations:
taxon %<>% mutate(taxonRank = recode(pn_rankmarker,
"sp." = "species",
"var." = "variety",
"subsp." = "subspecies",
"fam." = "family",
"f." = "form"))
taxon %<>% mutate(nomenclaturalCode = "ICN")
Remove the original columns:
taxon %<>% select(-starts_with("raw_"), -starts_with("pn_"))
Preview data:
head(taxon)
Save to CSV:
write_csv(taxon, dwc_taxon_file, na = "")
Map the data to Species Distribution.
The distribution applies to rust fungi only. We remove the host plants from raw_data
and create distribution
:
distribution <- raw_data %>% filter(raw_resourceRelation == "parasite")
distribution %<>% mutate(taxonID = raw_taxon_id)
distribution %<>% mutate(locationID = "ISO_3166-2:BE")
distribution %<>% mutate(locality = "Belgium")
distribution %<>% mutate(countryCode = "BE")
Information for occurrenceStatus
is contained in raw_occurrenceStatus
:
distribution %>% distinct(raw_occurrenceStatus)
These values are conform the IUCN definitions.
distribution %<>% mutate(occurrenceStatus = raw_occurrenceStatus)
establishmentMeans information is contained in raw_establishmentMeans
:
distribution %>% distinct(raw_establishmentMeans)
alien
is not part of the establishmentMeans vocabulary. Here, we use introduced
:
distribution %<>% mutate(establishmentMeans = recode(raw_establishmentMeans,
"alien" = "introduced",
.missing = ""))
eventDate information is contained in raw_from
(date of first observation) and raw_to
(date of last observation). We will map this information as a ISO 8601 date range (start_date / end_date). For this, we need to clean and merge the information in raw_from
and raw_to
.
raw_from
contains the information for the first observation. At this point, the following date formats are given: dd.m.yyyy
, m.yyyy
, m/yyyy
and yyyy
, with the information for the days and years expressed as arabic numericals and for the months as arabic or roman numericals. We need to split raw_from
to convert the roman numericals.
distribution %<>% separate(raw_from,
into = c("day", "month", "year"),
remove = FALSE,
extra = "merge",
fill = "left")
Convert roman to arabic numericals in month
:
distribution %<>% mutate(month = recode(month,
"I" = "01",
"II" = "02",
"III" = "03",
"IV" = "04",
"V" = "05",
"VI" = "06",
"VII" = "07",
"VIII"= "08",
"IX" = "09",
"X" = "10",
"XI" = "11",
"XII" = "12"))
We remerge the year, month and day information for the date of first observation and express this in a ISO 8601 format. we need to add the leading zeros to months (using sprintf
), we combine year, month and date information with “-” (using paste
) and we strip the trailing “-NA” if day or month is “NA” (using gsub
):
distribution %<>% mutate(start_date =
gsub("-NA", "",
paste(year,
sprintf("%02d", as.integer(month)),
sprintf("%02d", as.integer(day)),
sep = "-")))
Compare the first records of start_date
with raw_from
:
distribution %>% select(raw_from, start_date) %>% head(n = 20)
remove day
, month
and year
:
distribution %<>% select(-day, -month, -year)
The information for the date of the last observation is contained in raw_to
. The same principles and steps can be repeated for the mapping of raw_to
into end_date
Separate raw_to
into day
, month
and year
:
distribution %<>% separate(raw_to,
into = c("day", "month", "year"),
remove = FALSE,
extra = "merge",
fill = "left")
Convert roman to arabic numericals in month
:
distribution %<>% mutate(month = recode(month,
"I" = "01",
"II" = "02",
"III" = "03",
"IV" = "04",
"V" = "05",
"VI" = "06",
"VII" = "07",
"VIII"= "08",
"IX" = "09",
"X" = "10",
"XI" = "11",
"XII" = "12"))
we need add the leading zeros to months (using sprintf
), we combine year, month and date information with “-” (using paste
) and we strip the trailing “-NA” if day or month is “NA” (using gsub
):
distribution %<>% mutate(end_date =
gsub("-NA", "",
paste(year,
sprintf("%02d", as.integer(month)),
sprintf("%02d", as.integer(day)),
sep = "-")))
Compare the first records of end_date
with raw_from
:
distribution %>% select(raw_to, end_date) %>% head(n = 20)
remove day
, month
and year
:
distribution %<>% select(-day, -month, -year)
Merge start_date
and end_date
in eventDate
(ISO 8601). When only one date is provided (start_date
OR end_date
), we copy the date for both start_date
and end_date
. This to emphasize that a taxon was detected exclusively at that particular time point:
distribution %<>% mutate(eventDate = case_when(
start_date == "NA" & end_date == "NA" ~ "",
start_date == "NA" & end_date != "NA" ~ paste(end_date, end_date, sep = "/"),
start_date != "NA" & end_date == "NA" ~ paste(start_date, start_date, sep = "/"),
TRUE ~ paste(start_date, end_date, sep = "/")))
Remove “/NA” or “NA/NA”:
distribution %<>% mutate(eventDate = gsub("([/][N][A]|([NA]{2}[/][NA]{2}))", "", eventDate))
distribution %<>% mutate(source = raw_bibliographicCitation)
Remove the original columns:
distribution %<>% select(-starts_with("raw_"), -start_date, -end_date)
Preview data:
head(distribution)
Save to CSV:
write_csv(distribution, dwc_distribution_file, na = "")
resource_relationship <- raw_data
Map the data to Resource Relationship.
In this extension, we map all interactions between the rust fungi and host plants, with one line representing a one-to-one taxon interaction. These interactions are saved in interactions
. We add this one-to-one taxon interactions to resource_relationship
:
resource_relationship %<>% right_join(interactions, by = c("raw_scientificName" = "scientificName_parasite"))
This Darwin Core term refers to the taxonID of the object of the relationship. This is the taxonID of the host plants. By joining resource_relationship
with interactions
on the the scientificName of the parasite, the taxonID’s of the host plants were removed from the resource_relationship
. We need to extract them from raw_data
:
taxonID_host <- raw_data %>% filter(raw_resourceRelation == "host") %>%
select(raw_scientificName, raw_taxon_id)
Join resource_relationship
with taxonID_host
:
resource_relationship %<>% left_join(taxonID_host, by = c("scientificName_host" = "raw_scientificName"))
Rename raw_taxon_id.y
:
resource_relationship %<>% rename("resourceID" = "raw_taxon_id.y")
The taxonID
refers to the resourceID
, which is the taxonID
of the fungi:
resource_relationship %<>% mutate(taxonID = relatedResourceID)
resource_relationship %<>% mutate(relationshipOfResource = "parasite of")
resource_relationship %<>% mutate(relationshipAccordingTo = raw_bibliographicCitation)
Remove the original columns:
resource_relationship %<>% select(-starts_with("raw_"), -scientificName_host)
Move taxonID
to the first column:
resource_relationship %<>% select(taxonID, everything())
Preview data:
head(resource_relationship)
Save to CSV:
write_csv(resource_relationship, dwc_relationship_file, na = "")