This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.
Load libraries:
library(tidyverse) # To do data science
library(magrittr) # To use %<>% pipes
library(here) # To find files
library(janitor) # To clean input data
library(readxl) # To read Excel files
library(digest) # To generate hashes
library(rgbif) # To use GBIF services
Set file paths (all paths should be relative to this script):
# Raw files:
raw_data_file = "../data/raw/transcribed_10530_2016_1278_MOESM2_ESM.xlsx"
references = "../data/raw/transcribed_references.xlsx"
# Processed files:
dwc_taxon_file = "../data/processed/taxon.csv"
dwc_literature_references_file = "../data/processed/references.csv"
dwc_distribution_file = "../data/processed/distribution.csv"
dwc_profile_file = "../data/processed/speciesprofile.csv"
dwc_description_file = "../data/processed/description.csv"
Create a data frame raw_data
from the source data:
raw_data <- read_excel(path = raw_data_file)
Clean the data somewhat: remove empty rows if present and clean names:
raw_data %<>%
remove_empty("rows") %<>%
clean_names()
To uniquely identify a taxon in the taxon core and reference taxa in the extensions, we need a taxonID
. Since we need it in all generated files, we generate it here in the raw data frame. It is a combination of dataset-shortname:taxon:
and a hash based on the scientific name. As long as the scientific name doesn’t change, the ID will be stable. Some of the names require some minor cleaning first though:
raw_data %<>% mutate(species_clean = case_when(
species == "Crassostrea rhizophoraeGuilding 1828" ~ "Crassostrea rhizophorae Guilding 1828",
species == "Petricola pholadiformisLamarck, 1818" ~ "Petricola pholadiformis Lamarck, 1818",
species == "Theba pisanaMüller, 1774)" ~ "Theba pisana (Müller, 1774)",
species == "Maylandia sp. M. K. Meyer & W. Förster, 1984" ~ "Maylandia M. K. Meyer & W. Förster, 1984",
TRUE ~ species))
Some scientific names don’t have spaces before the (
. We add them here:
raw_data %<>% mutate(species_clean = str_replace_all(species_clean, "\\(", " ("),
species_clean = str_replace_all(species_clean, " ", " "))
Generate taxonID
:
# Vectorize the digest function (the digest() function isn't vectorized. So if you pass in a vector, you get one value for the whole vector rather than a digest for each element of the vector):
vdigest <- Vectorize(digest)
# Generate taxonID:
raw_data %<>% mutate(taxonID = paste("rinse-pathways-checklist", "taxon", vdigest (species_clean, algo = "md5"), sep = ":"))
Add prefix raw_
to all column names to avoid name clashes with Darwin Core terms:
colnames(raw_data) <- paste0("raw_", colnames(raw_data))
Preview data:
raw_data %>% head()
taxon <- raw_data
Map the data to Darwin Core Taxon.
taxon %<>% mutate(language = "en")
taxon %<>% mutate(license = "http://creativecommons.org/publicdomain/zero/1.0/")
taxon %<>% mutate(rightsHolder = "University of Cambridge")
taxon %<>% mutate(accessRights = "http://www.inbo.be/en/norms-for-data-use")
taxon %<>% mutate(datasetID = "https://doi.org/10.15468/guejza")
taxon %<>% mutate(institutionCode = "University of Cambridge")
taxon %<>% mutate(datasetName = "RINSE - Pathways and vectors of biological invasions in Northwest Europe")
taxon %<>% mutate(taxonID = raw_taxonID)
taxon %<>% mutate(scientificName = raw_species_clean)
Information for the higher classification is contained in raw_higher_classification
:
taxon %>% distinct(raw_higher_classification)
These map to different taxonomical levels (e.g. Mammalia
= class while Osteichthyes
= order). We interpret this information in the mapping of kingdom
, phylum
, class
and order
.
taxon %<>% mutate(kingdom = case_when(
raw_higher_classification == "Angiospermae" ~ "Plantae" ,
TRUE ~ "Animalia"))
taxon %<>% mutate(phylum = case_when(
raw_higher_classification == "Angiospermae" ~ "Magnoliophyta" ,
raw_higher_classification == "Mollusca" ~ "Mollusca",
TRUE ~ "Chordata"))
Osteichthyes, which is a superclass, includes Actinopterygii and Sarcopterygii (https://en.wikipedia.org/wiki/Osteichthyes). All the taxa in this list are Actinopterygii which is a class.
taxon %<>% mutate(class = case_when(
raw_higher_classification == "Anseriformes" ~ "Aves",
raw_higher_classification == "Mammalia" ~ "Mammalia",
raw_higher_classification == "Osteichthyes" ~ "Actinopterygii",
TRUE ~ ""))
taxon %<>% mutate(order = case_when(
raw_higher_classification == "Anseriformes" ~ "Anseriformes",
TRUE ~ "" ))
The taxon rank can be provided by the nameparser()
function of rgbif. First we create a data frame with the parsed names:
taxon_ranks <- parsenames(taxon$raw_species_clean)
And then select rank information from rankmarker
. The field notho
contains valuable information for the generic hybrids, so this column is selected as well.
taxon_ranks %<>% select(scientificname, rankmarker, notho)
Merge taxon_ranks
with taxon
using scientific names:
taxon %<>% left_join(taxon_ranks, by = c("raw_species_clean" = "scientificname"))
Show unique values for rankmarker
and notho
:
taxon %>%
select(rankmarker, notho) %>%
group_by_all() %>%
summarise(records = n())
Show scientificName
s for which rankmarker
= NA
:
taxon %>%
select(scientificName, rankmarker) %>%
filter(is.na(rankmarker))
Maylandia M. K. Meyer & W. Förster, 1984
is a genus, the other two records are species.
Map taxonRank
:
taxon %<>% mutate(taxonRank = case_when(
!is.na(notho) ~ "hybrid",
rankmarker == "sp." & is.na(notho) ~ "species",
rankmarker == "infrasp." ~ "subspecies",
is.na(rankmarker) & scientificName != "Maylandia M. K. Meyer & W. Förster, 1984" ~ "species",
scientificName == "Maylandia M. K. Meyer & W. Förster, 1984" ~ "genus"))
Remove the original columns:
taxon %<>% select(-starts_with("raw_"), -rankmarker, - notho)
Sort on taxonID
:
taxon %<>% arrange(taxonID)
Preview data:
taxon %>% head()
Save to CSV:
write_csv(taxon, dwc_taxon_file, na = "")
literature_references <- raw_data
Information for the literature references is contained in raw_references
, which are (sequences of) numbers in this case. The link between these numbers and the full references is provided in the original Supplementary material file (page 28-30). We copy-pasted this information to a newly generated Excel file references, here saved as the dataframe transcribed_references
:
transcribed_references <- read_excel(path = references, col_types = "text")
The literature reference extension is a dataframe in which one row contains one single reference (a short identifier and the full reference) per taxon. We generate this file in several steps:
raw_references
, instead of using the format containing -
for a range of numbers between a delimited minimun and maximum (e.g. 3-7
is transformed to 3,4,5,6,7
). To accomplish this, we first create the function hyphen_to_sequence()
:hyphen_to_sequence <- function(input){
seq_locs <- gregexpr("\\d+-\\d+", input)
matched <- regmatches(input, seq_locs)[[1]]
split_support <- function(text) {
split_seq <- strsplit(text, "-")[[1]]
paste(as.character(seq(split_seq[1], split_seq[2])), collapse = ",")
}
regmatches(input, seq_locs, invert = FALSE) <- list(map_chr(matched, split_support))
input
}
Apply function:
literature_references %<>%
rowwise() %<>%
mutate(raw_references_clean = hyphen_to_sequence(raw_references))
literature_references
from a wide to a long dataset, with each row containing one reference number per taxon:literature_references %<>%
mutate(references_sep = raw_references_clean) %<>%
ungroup() %<>%
separate_rows(references_sep, sep = ",")
raw_references
with the full references provided in the dataframe transcribed_references
using the field number
as a link.literature_references %<>% left_join(transcribed_references, by = c("references_sep" = "number"))
This dataframe constitutes the basis for further mapping of the literature reference extension:
literature_references %>%
select(raw_species, identifier, full_reference) %>%
head()
For the mapping of the field source
in the other extensions, we need to translate the sequence of reference numbers given for each taxon in raw_references
into a sequence of all corresponding identifiers separated by |
. As we need to do this several times, we generate a new dataframe identifier_sequence
here, linking the sequence of numbers in raw_references
with the sequence of identifiers.
Generate dataframe identifier_sequence
with the required columns (add taxonID as this is necessary for spreading in the next step):
identifier_sequence <- literature_references %>% select(raw_taxonID, raw_references, references_sep, identifier)
To generate the required sequence of identifiers, we need to transform identifier_sequence
from a dataframe containing one identifier per taxon per row, to one containing a sequence of identifiers per taxon per row. We do this by using the spread()
function first, generating 33 identifier columns (corresponding to the 33 identifiers):
identifier_sequence %<>% mutate(references_sep = as.numeric(references_sep)) %<>%
spread(references_sep, identifier) %<>%
select(-`<NA>`)
In the following step, we unite these 33 columns into one field id_sequence
, using |
as a separator. To unite the identifier columns (1-33
) in a single step, we first rename the column names by adding the prefix ref_
:
old_names <- c("raw_taxonID", "raw_references", c(1:33))
new_names <- c("raw_taxonID", "raw_references", paste0("ref_", c(1:33)))
identifier_sequence %<>% rename_at(vars(old_names), ~ new_names)
Generate id_sequence
:
identifier_sequence %<>% unite("id_sequence", c(paste0("ref_", c(1:33))), sep = " | ")
Preview:
identifier_sequence %>% head()
Remove NA
’s from the generated sequences:
identifier_sequence %<>% mutate(id_sequence = str_replace_all(id_sequence, "(([NA]{2}\\s[|]) | [|]\\s([NA]{2}))", ""))
Remove raw_taxonID
:
identifier_sequence %<>% select(-raw_taxonID)
Remove duplicate rows (some taxa have the same reference sequence):
identifier_sequence %<>% distinct(raw_references, .keep_all = TRUE)
We will use this dataframe to link the sequence of references numbers (raw_references
) with the sequence of identifiers (id_sequence
) in the mapping of the following extensions.
Map the data to Literature References.
literature_references %<>% mutate(taxonID = raw_taxonID)
literature_references %<>% select(-identifier, everything())
literature_references %<>% mutate(bibliographicCitation = full_reference)
Remove the original columns:
literature_references %<>% select(-starts_with("raw_"), -references_sep, -full_reference)
Sort on taxonID
:
literature_references %<>% arrange(taxonID)
Preview data:
literature_references %>% head()
Save to CSV:
write_csv(literature_references, dwc_literature_references_file, na = "")
distribution <- raw_data
Information for the distributions and associated dates of first observation is contained in eight columns in total:
raw_established_gb
, raw_established_france
, raw_established_belgium
and raw_established_netherlands
)raw_first_record_gb
, raw_first_record_france
, raw_first_record_belgium
and raw_first_record_netherlands
).When a taxon is present in a country, this is indicated by a Y
and a date of first record is provided.
We need to reduce the information for eight to two columns: one column containing country presence information (country column
, which is the basis for locality
, locationID
and countryCode
) and column containing the date of first observance (date
column, which is the basis for eventDate
). We do this in three steps:
distribution %<>% mutate(record_gb = paste(raw_established_gb, raw_first_record_gb, sep = ",")) %<>%
mutate(record_fr = paste(raw_established_france, raw_first_record_france, sep = ",")) %<>%
mutate(record_be = paste(raw_established_belgium, raw_first_record_belgium, sep = ",")) %<>%
mutate(record_nl = paste(raw_established_netherlands, raw_first_record_netherlands, sep = ","))
country
and date
by gathering the new columns (from 4 to 2 columns):distribution %<>% gather("country", "date", record_gb, record_fr, record_be, record_nl)
date
by removing records with NA,NA
and removing Y
:distribution %<>% filter(date != "NA,NA") %<>% mutate(date = str_replace_all(date, "Y,", ""))
country
:distribution %<>% mutate(country = recode(country,
"record_gb" = "Great Britain",
"record_fr" = "France",
"record_be" = "Belgium",
"record_nl" = "The Netherlands")
)
Preview:
distribution %>%
select(raw_species, country, date) %>%
head()
Map the data to Species Distribution.
distribution %<>% mutate(taxonID = raw_taxonID)
distribution %<>% mutate(locationID = case_when(
country == "Great Britain" ~ " WGSRPD:GRB",
country == "France" ~ "ISO_3166-2:FR",
country == "Belgium" ~ "ISO_3166-2:BE",
country == "The Netherlands" ~ "ISO_3166-2:NL"))
distribution %<>% mutate(locality = country)
distribution %<>% mutate(countryCode = case_when(
country == "Great Britain" ~ "GB",
country == "France" ~ "FR",
country == "Belgium" ~ "BE",
country == "The Netherlands" ~ "NL"))
distribution %<>% mutate(occurrenceStatus = "present")
distribution %<>% mutate(establishmentMeans = "introduced")
Inspect dates:
distribution %>% distinct(date) %>% arrange(date)
Check which dates deviate from the required ISO 8601 format (here YYYY
)
unique(distribution[which(str_detect(distribution $ date,"^[0-9]{4}$") == FALSE), "date"])
Clean those dates:
distribution %<>% mutate(date = str_trim(date)) # remove whitespaces
distribution %<>% mutate(date = recode(date, # recode deviating formats
"n.d." = "",
"17634" = "1763",
"2004," = "2004",
"n..d" = "",
"199" = "1990",
"1985," = "1985",
"NA" = "",
"19204" = "1904"))
After this cleaning step, there should be no deviations from the ISO 8601 format:
unique(distribution[which(str_detect(distribution $ date,"^[0-9]{4}$") == FALSE), "date"])
We want to express eventDate
as the range between the date of first and last record. In this case, we do not have information about the last record. We will consider the publication year of Zieritz et al. (2017) as the date when the presence of the species was last verified:
distribution %<>% mutate(eventDate = case_when(
date == "" ~ "",
date != "" ~ paste(date, "2017", sep = "/")))
Information about the source is provided in raw_references
. We replace the sequence of reference numbers by the sequence of identifiers (not full references), provided in the dataframe identifier_sequence
generated earlier.
distribution %<>% left_join(identifier_sequence, by = ("raw_references"))
Rename id_sequence
:
distribution %<>% rename("source" = "id_sequence")
Remove the original columns:
distribution %<>% select(-starts_with("raw_"), -country, -date)
Sort on taxonID
:
distribution %<>% arrange(taxonID)
Preview data:
distribution %>% head()
Save to CSV:
write_csv(distribution, dwc_distribution_file, na = "")
In this extension we express broad habitat characteristics (e.g. isTerrestrial
) of the species.
species_profile <- raw_data
Habitat information can be found in raw_environment
, which describes whether a species is found in freshwater, terrestrial, marine or a combination of these habitats.
Show unique values:
species_profile %>%
select(raw_environment) %>%
group_by_all() %>%
summarize(records = n())
Only two taxa have no information on the environment:
Salvelinus fontinalis
(brook trout): a fish found in both freshwater and marine habitatsRattus norvegicus
(brown rat): a terrestrial mammalWe will add this information in the mapping. Environment information is mapped to isFreshwater
, isTerrestrial
, isMarine
or a combination of these terms in the species profile extension.
Map the data to Species Profile.
species_profile %<>% mutate(taxonID = raw_taxonID)
species_profile %<>% mutate(isMarine = case_when(
raw_environment == "M" |
raw_environment == "M+F" |
raw_environment == "T+M" |
raw_species == "Salvelinus fontinalis (Mitchill, 1814)" ~ "TRUE",
TRUE ~ "FALSE"
))
species_profile %<>% mutate(isFreshwater = case_when(
raw_environment == "F" |
raw_environment == "F,T" |
raw_environment == "F+T" |
raw_environment == "M+F" |
raw_species == "Salvelinus fontinalis (Mitchill, 1814)" ~ "TRUE",
TRUE ~ "FALSE"))
species_profile %<>% mutate(isTerrestrial = case_when(
raw_environment == "F,T" |
raw_environment == "F+T" |
raw_environment == "T" |
raw_environment == "T+M" |
raw_species == "Rattus norvegicus (Berkenhout 1769)" ~ "TRUE",
TRUE ~"FALSE"))
Show mapped values:
species_profile %>%
select(raw_environment, isMarine, isFreshwater, isTerrestrial) %>%
group_by_all() %>%
summarize(records = n())
Remove the original columns:
species_profile %<>% select(-starts_with("raw_"))
Sort on taxonID
:
species_profile %<>% arrange(taxonID)
Preview data:
species_profile %>% head()
Save to CSV:
write_csv(species_profile, dwc_profile_file, na = "")
In the description extension we want to include several important characteristics (hereafter refered to as descriptors) about the species:
A single taxon can have multiple descriptions of the same type (e.g. multiple native ranges), expressed as multiple rows in the description extension. For each descriptor, we create a separate dataframe to process the specific information. We always specify which descriptor we map (type
column) and its specific content (description
column). After the mapping of these Darwin Core terms type
and value
, we merge the dataframes to generate one single description extension. We then continue the mapping process by adding the other Darwin Core terms (which content is independent of the type of descriptor, such as language
).
Native range information (e.g. Australia
) can be found in raw_origin
.
Create separate dataframe:
native_range <- raw_data
Show unique values:
native_range %>%
distinct(raw_origin) %>%
arrange(raw_origin)
raw_origin
contains multiple values (currently not more than 4), so we separate it in 4 columns:
native_range %<>% separate(raw_origin, into = paste("origin", c(1:4), sep = "_"))
Gather in a key
and value
column:
native_range %<>% gather("key", "value", paste("origin", c(1:4), sep = "_"), na.rm = TRUE)
Inspect values:
native_range %>%
distinct(value) %>%
arrange(value)
Map values:
native_range %<>% mutate(mapped_value = recode(value,
"Af" = "Africa (WGSRPD:2)",
"Ar" = "Arctic",
"As" = "Asia",
"Au" = "Australia (WGSRPD:50)",
"Aus" = "Australia (WGSRPD:50)",
"d" = "",
"Eu" = "Europe (WGSRPD:1)",
"n" = "",
"na" = "",
"NAm" = "Northern America (WGSRPD:7)",
"SAm" = "Southern America (WGSRPD:8)"
))
Show mapped values:
native_range %>%
select(value, mapped_value) %>%
group_by(value, mapped_value) %>%
summarize(records = n())
Keep only non-empty descriptions:
native_range %<>% filter(!is.na(mapped_value) & mapped_value != "")
Drop the key
and value
columns and rename mapped_value
as description
:
native_range %<>%
select(-key, -value) %>%
rename(description = mapped_value)
Create a type
field to indicate the type of description:
native_range %<>% mutate(type = "native range")
Pathway information in this dataset is a combination of the columns starting with raw_pathway_
(4 columns: raw_pathway_accidental
, raw_pathway_disperal
, raw_pathway_import_escape
and raw_pathway_import_release
) and raw_vector_
(5 columns: raw_vector_biocontrol
, raw_vector_leisure
, raw_vector_industry
, raw_vector_ornamental
and raw_vector_research
):
raw_data %>% select(starts_with("raw_vector"), starts_with("raw_pathway")) %>%
group_by_all() %>%
summarize(records = n())
This information will be mapped to the six main categories of the CBD standard. How these RINSE pathways map to the CBD standard is described in the source paper Zieritz et al. (2017) (in Methods > Data analysis):
RINSE pathway | CBD category |
---|---|
accidental | stowaway, contaminant |
dispersal | corridor, undaided |
import_release | release |
import_escape | escape |
For certain taxa, we will complement this pathway information with the specific vector of introduction (contained in columns starting with raw_vector_
), using the subcategories of the CBD standard:
accidental
and disperal
, we will never provide vector information. This is because vectors of accidentally introduced species were not analysed in the present study, due to a lack of reliable data (see Zieritz et al. 2017).import_escape
and import_release
, we will map vector information to the CBD subcategories only when they map unambigiously to this standard (see table below and this issue) and when these vectors are the only vectors of introduction for these taxon.RINSE pathway | RINSE vector | CBD category | CBD subcategory |
---|---|---|---|
accidental | biocontrol | stowaway, contaminant | NA |
accidental | industry | stowaway, contaminant | NA |
accidental | leisure | stowaway, contaminant | NA |
accidental | ornamental | stowaway, contaminant | NA |
accidental | research | stowaway, contaminant | NA |
dispersal | biocontrol | corridor, unaided | NA |
dispersal | industry | corridor, unaided | NA |
dispersal | leisure | corridor, unaided | NA |
dispersal | ornamental | corridor, unaided | NA |
dispersal | research | corridor, unaided | NA |
import_escape | biocontrol | escape | NA |
import_escape | industry | escape | NA |
import_escape | leisure | escape | live food and bait |
import_escape | ornamental | escape | NA |
import_escape | research | escape | research and ex-situ breeding (in facilities) |
import_release | biocontrol | release | biological control |
import_release | industry | release | NA |
import_release | leisure | release | NA |
import_release | ornamental | release | NA |
import_release | research | release | NA |
We map this information in the new column description
, for which we use the TrIAS vocabulary.
Create separate dataframe:
pathway_desc <- raw_data
We gather pathway information in one new column pathway
:
pathway_desc %<>% gather(pathway, value, starts_with("raw_pathway_"), na.rm = TRUE, convert = TRUE)
Column “value” will only contain “Y”, so no need for this column
pathway_desc %<>% select(-value)
Map patwhay information to the CBD main category:
pathway_desc %<>% mutate(description = recode(pathway,
"raw_pathway_accidental" = "stowaway,contaminant",
"raw_pathway_dispersal" = "corridor,natural_dispersal",
"raw_pathway_import_escape" = "escape",
"raw_pathway_import_release" = "release"
))
Arrange by taxonID to see things more in context:
pathway_desc %<>% arrange(raw_taxonID)
Map vector information to CBD subcategories for some taxa:
pathway_desc %<>% mutate(description = case_when(
description == "escape" &
raw_vector_research == "Y" &
raw_vector_leisure == "Y" &
is.na(raw_vector_ornamental) &
is.na(raw_vector_biocontrol) &
is.na(raw_vector_industry) ~ "escape_research, escape_food_bait",
description == "escape" &
raw_vector_research == "Y" &
is.na(raw_vector_leisure ) &
is.na(raw_vector_ornamental) &
is.na(raw_vector_biocontrol) &
is.na(raw_vector_industry) ~ "escape_research",
description == "escape" &
is.na(raw_vector_research) &
raw_vector_leisure == "Y" &
is.na(raw_vector_ornamental) &
is.na(raw_vector_biocontrol) &
is.na(raw_vector_industry) ~ "escape_food_bait",
description == "release" &
is.na(raw_vector_research) &
is.na(raw_vector_leisure) &
is.na(raw_vector_ornamental) &
raw_vector_biocontrol == "Y" &
is.na(raw_vector_industry) ~ "release_biocontrol",
TRUE ~ description # Leave rest as is
))
Separate description
when more than two values are provided, e.g. in case of RINSE pathway = accidental
or RINSE pathway = dispersal
:
pathway_desc %<>% separate_rows(description, sep = ",")
Remove all empty description values:
pathway_desc %<>% filter(description != "")
Remove duplicate rows:
pathway_desc %<>% distinct(raw_taxonID, description, .keep_all = TRUE)
Remove pathway
:
pathway_desc %<>% select(-pathway)
Add the prefix cbd_2014_pathway:
to refer to this standard:
pathway_desc %<>% mutate(description = paste ("cbd_2014_pathway", description, sep = ":"))
Create a type
field to indicate the type of description:
pathway_desc %<>% mutate(type = "pathway")
Union native range and pathway of introduction:
description_ext <- bind_rows(native_range, pathway_desc)
Map the data to Taxon Description.
description_ext %<>% mutate(taxonID = raw_taxonID)
description_ext %<>% mutate(description = description)
description_ext %<>% mutate(type = type)
Information about the source is provided in raw_references
. We replace the sequence of reference numbers by the sequence of identifiers (not full references), provided in the dataframe identifier_sequence
generated earlier.
description_ext %<>% left_join(identifier_sequence, by = ("raw_references"))
Rename id_sequence
:
description_ext %<>% rename("source" = "id_sequence")
description_ext %<>% mutate(language = "en")
Remove the original columns:
description_ext %<>% select(-starts_with("raw_"))
Move taxonID
to the first position:
description_ext %<>% select(taxonID, everything())
Sort on taxonID
:
description_ext %<>% arrange(taxonID)
Preview data:
description_ext %>%
mutate(source = substr(source, 1, 20)) %>% # Shorten source to make it easier to display
head()
Save to CSV:
write_csv(description_ext, dwc_description_file, na = "")