Load libraries:
library(tidyverse) # To do data science
library(tidylog) # To provide feedback on dplyr functions
library(magrittr) # To use %<>% pipes
library(here) # To find files
library(janitor) # To clean input data
library(digest) # To generate hashes
library(rgbif) # To use GBIF services
library(googlesheets) # To import and read Google spreadsheets
Set column types:
col_types = cols(
.default = col_character(),
other_col = col_double())
The original spreadsheet can be found here. We need to retrieve this spreadsheet and select the specific worksheets first:
Retrieve the spreadsheet:
retrieve_spreadsheet <- gs_title("alien mollusca checklist")
Select the data in the worksheets:
input_taxa <- retrieve_spreadsheet %>% gs_read("taxa", col_types = col_types)
input_vernacular_names <- retrieve_spreadsheet %>% gs_read("vernacular_names", col_types = col_types)
input_synonyms <- retrieve_spreadsheet %>% gs_read("synonyms", col_types = col_types)
input_references <- retrieve_spreadsheet %>% gs_read("references", col_types = col_types)
We want to add a copy of the source data to the repository:
write_csv(input_taxa, here("data", "raw", "input_taxa.csv"), na = "")
write_csv(input_vernacular_names, here("data", "raw", "input_vernacular_names.csv"), na = "")
write_csv(input_synonyms, here("data", "raw", "input_synonyms.csv"), na = "")
write_csv(input_references, here("data", "raw", "input_references.csv"), na = "")
input_vernacular_names
, input_synonyms
and input_references
contain empty rows, we remove those here:
input_vernacular_names %<>% remove_empty("rows")
input_synonyms %<>% remove_empty("rows")
input_references %<>% remove_empty("rows")
Use the GBIF nameparser to retrieve nomenclatural information for the scientific names in taxa
:
parsed_names <- input_taxa %>%
distinct(scientific_name) %>%
pull() %>% # Create vector from dataframe
parsenames() # An rgbif function
All scientific names are of type = SCIENTIFIC
and need no further processing.
To link taxa with information in the extension(s), each taxon needs a unique and relatively stable taxonID
. Here we create one in the form of dataset_shortname:taxon:hash
, where hash
is unique code based on scientific name.
vdigest <- Vectorize(digest) # Vectorize digest function to work with vectors
input_taxa %<>%
mutate(taxon_id = paste("alien-mollusca-checklist",
"taxon",
vdigest(scientific_name, algo = "md5"),
sep = ":"))
Taxonomic information is provided in two different dataframes: - input_taxa
including all accepted scientific names - input_synonyms
including all scientific names of synonyms
For some of the taxa, the original name is provided, i.e. the scientific name as it originally appeared when the name was first established under the rules of the associated nomenclatural code. These original names can be extracted from input_synyms
: these are the scientific_name
for which remarks
= original name
.
For each scientific name in this checklist (accepted or synonym), the taxon Core will thus inlcude the following information
taxon_id
: taxon_id of the scientific nameaccepted_name_usage_id
: taxon_id of the accepted scientific namescientific_name
accepted_name_usage
: the accepted scientific nametaxonomic_status
: accepted
, homotypicSynonym
(for original names) and heterotypicSynoym
(for all other synonyms)class
, order
, family
and genus
We map these terms in three steps:
accepted_taxa
which contains all terms for the accepted scientific names (in input_taxa
).taxon_id
, scientific_name
and the higher classification is already providedtaxonomic_status
(= accepted
), accepted_name_usage
(= scientific_name
) and accepted_name_usage_id
(= taxon_id
)synonyms
which contains all terms for the synonyms (in input_synonyms
).scientific_name
is already providedclass
, order
, family
and genus
by joining with input_taxa
taxon_id
taxonomic_status
(= homotypicSynonym
or heterotypicSynoym
), accepted_name_usage
(= synonym_of
) and accepted_name_usage_id
(= taxon_id
from taxon
)synonyms
with accepted_taxa
using the dplyr funtion bind_rows()
Generate accepted_taxa
by extracting all taxonomic information from input_taxa
:
accepted_taxa <- input_taxa %>% select(taxon_id, scientific_name, class, order, family, genus, taxonomy)
Map taxonomic_status
which is accepted
:
accepted_taxa %<>% mutate(taxonomic_status = "accepted")
Map accepted_name_usage
and accepted_name_usage_id
:
accepted_taxa <-
accepted_taxa %>%
mutate(accepted_name_usage = scientific_name) %>%
mutate(accepted_name_usage_id = taxon_id)
Order terms to enable future join with synonyms
:
accepted_taxa %<>% select(taxon_id, accepted_name_usage_id, scientific_name, accepted_name_usage, class, order, family, genus, taxonomic_status, taxonomy)
Join synonyms
with accepted_taxa
to extract higher classification:
synonyms <-
input_synonyms %>%
left_join(
select(accepted_taxa, scientific_name, taxon_id, class, order, family, genus),
by = c("synonym_of" = "scientific_name"))
Rename taxon_id
from accepted_taxa
to avoid confusion with the future mapping of taxon_id
from synonyms
:
synonyms %<>% rename("taxon_id_accepted" = "taxon_id")
Map taxon_id
for synonyms:
synonyms %<>%
mutate(taxon_id = paste("alien-mollusca-checklist",
"taxon",
vdigest(scientific_name, algo = "md5"),
sep = ":"))
One scientific name is duplicated in synonyms
:
synonyms %>%
slice(which(duplicated(synonyms$scientific_name))) %>%
select(scientific_name)
This is because Helix balteata Pollonera, 1892
is both a synonym of Cernuella virgata (Da Costa, 1778)
and Cernuella cisalpina (Rossmässler, 1837)
. We need to manually adjust the taxonID
for these records:
synonyms %<>% mutate(taxon_id = case_when(
scientific_name == "Helix balteata Pollonera, 1892" & synonym_of == "Cernuella virgata (Da Costa, 1778)" ~ paste(taxon_id, "1", sep = ":"),
scientific_name == "Helix balteata Pollonera, 1892" & synonym_of == "Cernuella cisalpina (Rossmässler, 1837)" ~ paste(taxon_id, "2", sep = ":"),
TRUE ~ taxon_id))
Check whether taxonID’s are still duplicated:
synonyms %>%
slice(which(duplicated(synonyms$taxon_id))) %>%
select(scientific_name)
Map taxonomic_status
.
synonyms %<>% mutate(taxonomic_status = case_when(
scientific_name == "Helix balteata Pollonera, 1892" ~ "proParteSynonym",
remarks == "original name" ~ "homotypicSynonym",
TRUE ~ "heterotypicSynonym"))
Map accepted_name_usage
and accepted_name_usage_id
. This information is provided in synonym_of
and taxon_id_accepted
respectively. We thus only need to rename these columns:
synonyms <-
synonyms %>%
rename("accepted_name_usage" = "synonym_of") %>%
rename("accepted_name_usage_id" = "taxon_id_accepted")
Add taxonomy
which is empty for most synonyms
, except for Ferussacia folliculus Schröter, 1784
and Ferussacia follicula Schröter, 1784
:
synonyms %<>% mutate(taxonomy = case_when(
scientific_name == "Ferussacia folliculus Schröter, 1784" |
scientific_name == "Ferussacia follicula Schröter, 1784" ~ remarks,
TRUE ~ ""))
Rearrange columns to enable future join with accepted_taxa
:
synonyms <- synonyms %>%
select(taxon_id, accepted_name_usage_id, scientific_name, accepted_name_usage, class, order, family, genus, taxonomic_status, taxonomy)
Merge both datasets together in taxon
:
taxon <- accepted_taxa %<>% bind_rows(synonyms)
Map the data to Darwin Core Taxon.
Start with record-level terms which contain metadata about the dataset (which is generally the same for all records).
taxon %<>% mutate(dwc_language = "en")
taxon %<>% mutate(dwc_license = "http://creativecommons.org/publicdomain/zero/1.0/")
taxon %<>% mutate(dwc_rightsHolder = "Royal Belgian Institute of Natural Sciences")
taxon %<>% mutate(dwc_datasetID = "https://doi.org/10.15468/t13kwo")
taxon %<>% mutate(dwc_institutionCode = "RBINS")
taxon %<>% mutate(dwc_datasetName = "Registry of introduced terrestrial molluscs in Belgium")
The following terms contain information about the taxon:
taxon %<>% mutate(dwc_taxonID = taxon_id)
taxon %<>% mutate(dwc_acceptedNameUsageID = accepted_name_usage_id)
taxon %<>% mutate(dwc_scientificName = scientific_name)
taxon %<>% mutate(dwc_acceptedNameUsage = accepted_name_usage)
taxon %<>% mutate(dwc_kingdom = "Animalia")
taxon %<>% mutate(dwc_phylum = "Mollusca")
taxon %<>% mutate(dwc_class = class)
taxon %<>% mutate(dwc_order = order)
taxon %<>% mutate(dwc_family = family)
taxon %<>% mutate(dwc_genus = genus)
We can exxtract the taxon rank information from GBIF using the rgbif
package.
taxon_rank <- parsenames(taxon$dwc_scientificName)
Remove one 1 for the duplicated scientific name Helix balteata Pollonera, 1892
:
taxon_rank %<>% distinct(scientificname, rankmarker, .keep_all = FALSE)
Add information to taxon
:
taxon %<>%
left_join(
select(taxon_rank, scientificname, rankmarker),
by = c("dwc_scientificName" = "scientificname"))
Inspect values for rankmarker
:
taxon %>%
group_by(rankmarker) %>%
summarize(records = n())
Recode values for rankmarker
in dwc_taxonRank
taxon %<>% mutate(dwc_taxonRank = recode(rankmarker,
"f." = "forma",
"infrasp." = "subspecies",
"sp." = "species",
"var." = "variety"))
taxon %<>% mutate(dwc_nomenclaturalCode = "ICZN")
taxon %<>% mutate(dwc_taxonomicStatus = taxonomic_status)
taxon %<>% mutate(dwc_taxonRemarks = taxonomy)
Only keep the Darwin Core columns:
taxon %<>% select(starts_with("dwc_"))
Drop the dwc_
prefix:
colnames(taxon) <- str_remove(colnames(taxon), "dwc_")
Preview data:
taxon %>% head()
Save to CSV:
write_delim(taxon, here("data", "processed", "taxon.tsv"), delim = "\t", na = "")
Map the data to Vernacular Names.
Merge with taxon_id
:
input_vernacular_names %<>% left_join(
select(input_taxa, scientific_name, taxon_id),
by = "scientific_name")
Rename to dwc_taxonID
:
input_vernacular_names %<>% rename("dwc_taxonID" = "taxon_id")
input_vernacular_names %<>% mutate(dwc_vernacularName = vernacular_name)
input_vernacular_names %>%
group_by(language) %>%
summarize(records = n())
This column currently contains the original column name, which we will recode to the ISO 639-1 language code:
input_vernacular_names %<>% mutate(dwc_language = recode(language,
"dutch" = "nl",
"english" = "en",
"french" = "fr",
"german" = "de"))
Remove the original columns:
input_vernacular_names %<>% select(
starts_with("dwc"))
Drop the dwc_
prefix:
colnames(input_vernacular_names) <- str_remove(colnames(input_vernacular_names), "dwc_")
Preview data:
input_vernacular_names %>% head()
Save to CSV:
write_delim(input_vernacular_names, here("data", "processed", "vernacularnames.tsv"), delim = "\t", na = "")
Create separate dataframe:
species_profile <- input_taxa
Information for species profile is contained in the field realm
:
species_profile %>%
group_by(realm) %>%
summarize(records = n())
Conclusion: all species are terrestrial.
Map the data to Species Profile.
Rename taxon_id
to dwc_taxonID
:
species_profile %<>% rename("dwc_taxonID" = "taxon_id")
species_profile %<>% mutate("dwc_isMarine" = FALSE)
species_profile %<>% mutate("dwc_isFreshwater" = FALSE)
species_profile %<>% mutate("dwc_isTerrestrial" = TRUE)
Keep only terms with dwc_
prefix:
species_profile %<>% select(starts_with("dwc_"))
Remove prefix dwc_
:
colnames(species_profile) <- str_remove(colnames(species_profile), "dwc_")
Preview data:
species_profile %>% head()
Save to CSV:
write_delim(species_profile, here("data", "processed", "speciesprofile.tsv"), delim = "\t", na = "")
Create separate dataframe:
distribution <- input_taxa
Preview:
distribution %>% head()
distribution <- input_taxa
Map the data to Species Distribution.
Rename taxon_id
to dwc_taxonID
:
distribution %<>% rename("dwc_taxonID" = "taxon_id")
distribution %<>% mutate(dwc_locationID = "ISO_3166-2:BE")
distribution %<>% mutate(dwc_locality = "Belgium")
distribution %<>% mutate(dwc_countryCode = "BE")
occurrenceStatus
should be present
, doubtful
or absent
:
distribution %>%
group_by(occurrence_status) %>%
summarize(records = n())
distribution %<>% mutate(dwc_occurrenceStatus = occurrence_status)
distribution %<>% mutate(dwc_establishmentMeans = "introduced")
The dataset integrates two fields with date information: first_observation
, which is always populated, and last_observation
, which is often empty. From these columns, we create eventDate
(ISO 8601 format: yyyy
or yyyy
/yyyy
).
distribution %<>% mutate(dwc_eventDate = case_when(
is.na(last_observation) ~ first_observation,
TRUE ~ paste(first_observation, last_observation, sep = "/")))
Information for source can be found in source_distribution
.
distribution %<>% mutate(dwc_source = source_distribution)
distribution %<>% mutate(dwc_occurrenceRemarks = paste(
first_observation_text, distribution_belgium_text, sep = " "))
Only keep the Darwin Core columns:
distribution %<>% select(starts_with("dwc_"))
Drop the dwc_
prefix:
colnames(distribution) <- str_remove(colnames(distribution), "dwc_")
Preview data:
distribution %>% head()
Save to CSV:
write_delim(distribution, here("data", "processed", "distribution.tsv"), delim = "\t", na = "")
Merge with input_taxa
to retrieve taxon_id
:
input_references %<>% left_join(
select(input_taxa, scientific_name, taxon_id),
by = "scientific_name")
Rename to dwc_taxonID
:
input_references %<>% rename("dwc_taxonID" = "taxon_id")
input_references %<>% mutate(dwc_identifier = shortref)
input_references %<>% mutate(dwc_bibliographicCitation = reference)
Only keep the Darwin Core columns:
input_references %<>% select(starts_with("dwc_"))
Drop the dwc_
prefix:
colnames(input_references) <- str_remove(colnames(input_references), "dwc_")
Preview data:
input_references %>% head()
Save to CSV:
write_delim(input_references, here("data", "processed", "references.tsv"), delim = "\t", na = "")
In the description extension we want to include several important characteristics (hereafter refered to as descriptors) about the species:
A single taxon can have multiple descriptions of the same type (e.g. multiple native ranges), expressed as multiple rows in the description extension.
For each descriptor, we create a separate dataframe to process the specific information. We always specify which descriptor we map (type
column) and its specific content (description
column). After the mapping of these Darwin Core terms type
and value
, we merge the dataframes to generate one single description extension. We then continue the mapping process by adding the other Darwin Core terms (which content is independent of the type of descriptor, such as language
).
Create separate dataframe:
native_range <- input_taxa
Inspect data:
native_range %>%
group_by(native_range) %>%
summarize(records = n())
Separate by |
:
native_range %<>% separate(native_range,
into = paste("native_range",c(1:9),sep="_"),
remove = FALSE,
sep = "(\\s\\|\\s)")
Gather data:
native_range %<>% gather(key = "native_range",
value = "value",
paste("native_range",c(1:9),sep="_"),
na.rm = TRUE)
Inspect data:
native_range %>%
group_by(value) %>%
summarize(records = n())
Rename value
to description
:
native_range %<>% rename("description" = "value")
Create type
:
native_range %<>% mutate(type = "native range")
Remove native_range
:
native_range %<>% select(-native_range)
Create separate dataframe:
pathways <- input_taxa
Inspect data:
pathways %>%
group_by(introduction_pathway) %>%
summarize(records = n())
Separate by |
:
pathways %<>% separate(introduction_pathway,
into = paste("introduction_pathway",c(1:4),sep="_"),
sep = "(\\s\\|\\s)",
remove = FALSE)
Gather data:
pathways %<>% gather(key = "key",
value = "value",
paste("introduction_pathway",c(1:4),sep="_"),
na.rm = TRUE)
Inspect data:
pathways %>%
group_by(value) %>%
summarize(records = n())
Map to controlled vocabulary for pathways, based on the CBD standard and developed for TrIAS:
pathways %<>% mutate(description = recode(value,
"Contaminant: Contaminant on plants" = "contaminant_on_plants",
"Contaminant: Transportation of habitat material" = "contaminant_habitat_material",
"Escape : Pet" = "escape_pet",
"Escape: Research and ex-situ breeding" = "escape_research",
"Release: Other intentional use" = "release_other",
"Stowaway" = "stowaway",
"Stowaway: Container" = "stowaway_container",
"Stowaway: Hitchhikers on ship" = "stowaway_ship",
"Stowaway: People and their luggage" = "stowaway_people_luggage",
"Stowaway: Vehicles" = "stowaway_vehicles",
"Unaided" = "unaided"))
Add prefix cbd_2014_pathway
:
pathways %<>% mutate(description = paste ("cbd_2014_pathway", description, sep = ":"))
Inspect mapping:
pathways %>%
group_by(value, description) %>%
summarize(records = n())
Map type
:
pathways %<>% mutate(type = "pathway")
Remove key
and value
:
pathways %<>% select(-key, -value)
Create separate dataframe:
degree_of_establishment <- input_taxa
Inspect data:
degree_of_establishment %>%
group_by(degree_of_establishment) %>%
summarize(records = n())
We map the data to the Blackburn et al. 2011 proposed framework for biological invatsions. Each stage (A - E) represents a specific degree of establishment. When multiple stages are provided, we default to the most established of Blackburn’s categories, see issue 10.
degree_of_establishment %<>% mutate(description = recode(degree_of_establishment,
"C0 - C3" = "C3",
"C1 - C3" = "C3",
"C3, D1, D2" = "D2",
"D2, E" = "E"))
Add prefix blackburn_et_al_2011:
to refer to the framework:
degree_of_establishment %<>% mutate(description = paste ("blackburn_et_al_2011", description, sep = ":"))
Inspect mapping:
degree_of_establishment %>%
group_by(degree_of_establishment, description) %>%
summarize(records = n())
Map type
:
degree_of_establishment %<>% mutate(type = "degree of establishment")
Bind native_range
, pathways
and degree_of_establishment
by rows:
description <- bind_rows(native_range, pathways, degree_of_establishment)
description %<>% mutate(dwc_taxonID = taxon_id)
description %<>% mutate(dwc_description = description)
description %<>% mutate(dwc_type = type)
description %<>% mutate(dwc_language = "en")
Keep columns with dwc_
prefix only:
description %<>% select(starts_with("dwc_"))
Remove prefix dwc_
:
colnames(description) <- str_remove(colnames(description), "dwc_")
Sort on taxonID
to group description information per taxon:
description %<>% arrange(taxonID)
Preview data:
description %>% head(10)
Save to CSV:
write_delim(description, here("data","processed","description.tsv"), delim = "\t")