1 Setup

Load libraries:

library(tidyverse)      # To do data science
library(tidylog)        # To provide feedback on dplyr functions
library(magrittr)       # To use %<>% pipes
library(here)           # To find files
library(janitor)        # To clean input data
library(digest)         # To generate hashes
library(rgbif)          # To use GBIF services
library(googlesheets)   # To import and read Google spreadsheets 

2 Read source data

Set column types:

col_types = cols(
   .default = col_character(),
    other_col = col_double())

The original spreadsheet can be found here. We need to retrieve this spreadsheet and select the specific worksheets first:

Retrieve the spreadsheet:

retrieve_spreadsheet <- gs_title("alien mollusca checklist")

Select the data in the worksheets:

input_taxa <- retrieve_spreadsheet %>% gs_read("taxa", col_types = col_types)
input_vernacular_names <- retrieve_spreadsheet %>% gs_read("vernacular_names", col_types = col_types)
input_synonyms <- retrieve_spreadsheet %>% gs_read("synonyms", col_types = col_types)
input_references <- retrieve_spreadsheet %>% gs_read("references", col_types = col_types)

We want to add a copy of the source data to the repository:

write_csv(input_taxa, here("data", "raw", "input_taxa.csv"), na = "")
write_csv(input_vernacular_names, here("data", "raw", "input_vernacular_names.csv"), na = "")
write_csv(input_synonyms, here("data", "raw", "input_synonyms.csv"), na = "")
write_csv(input_references, here("data", "raw", "input_references.csv"), na = "")

3 Pre-processing

3.1 Clean files:

input_vernacular_names, input_synonyms and input_references contain empty rows, we remove those here:

input_vernacular_names %<>% remove_empty("rows") 
input_synonyms %<>% remove_empty("rows") 
input_references %<>% remove_empty("rows") 

3.2 Scientific names

Use the GBIF nameparser to retrieve nomenclatural information for the scientific names in taxa:

parsed_names <- input_taxa %>%
  distinct(scientific_name) %>%
  pull() %>% # Create vector from dataframe
  parsenames() # An rgbif function

All scientific names are of type = SCIENTIFIC and need no further processing.

3.3 Taxon IDs

To link taxa with information in the extension(s), each taxon needs a unique and relatively stable taxonID. Here we create one in the form of dataset_shortname:taxon:hash, where hash is unique code based on scientific name.

vdigest <- Vectorize(digest) # Vectorize digest function to work with vectors
input_taxa %<>% 
  mutate(taxon_id = paste("alien-mollusca-checklist", 
                          "taxon",
                          vdigest(scientific_name, algo = "md5"), 
                          sep = ":"))

4 Taxon core

4.1 Pre-processing

Taxonomic information is provided in two different dataframes: - input_taxa including all accepted scientific names - input_synonyms including all scientific names of synonyms

For some of the taxa, the original name is provided, i.e. the scientific name as it originally appeared when the name was first established under the rules of the associated nomenclatural code. These original names can be extracted from input_synyms: these are the scientific_name for which remarks = original name.

For each scientific name in this checklist (accepted or synonym), the taxon Core will thus inlcude the following information

  • taxon_id: taxon_id of the scientific name
  • accepted_name_usage_id: taxon_id of the accepted scientific name
  • scientific_name
  • accepted_name_usage: the accepted scientific name
  • taxonomic_status: accepted, homotypicSynonym (for original names) and heterotypicSynoym (for all other synonyms)
  • higher classification information: class, order, family and genus

We map these terms in three steps:

  1. Generate accepted_taxa which contains all terms for the accepted scientific names (in input_taxa).
  • taxon_id, scientific_name and the higher classification is already provided
  • Map taxonomic_status (= accepted), accepted_name_usage (= scientific_name) and accepted_name_usage_id (= taxon_id)
  1. Generate synonyms which contains all terms for the synonyms (in input_synonyms).
  • scientific_name is already provided
  • Add higher classification class, order, family and genus by joining with input_taxa
  • Generate taxon_id
  • Map taxonomic_status (= homotypicSynonym or heterotypicSynoym), accepted_name_usage (= synonym_of) and accepted_name_usage_id (= taxon_id from taxon)
  1. Join synonyms with accepted_taxa using the dplyr funtion bind_rows()

4.1.1 Generate accepted_taxa

Generate accepted_taxa by extracting all taxonomic information from input_taxa:

accepted_taxa <- input_taxa %>% select(taxon_id, scientific_name, class, order, family, genus, taxonomy)

Map taxonomic_status which is accepted:

accepted_taxa %<>% mutate(taxonomic_status = "accepted")

Map accepted_name_usage and accepted_name_usage_id:

accepted_taxa <-
  accepted_taxa %>% 
    mutate(accepted_name_usage = scientific_name) %>% 
    mutate(accepted_name_usage_id = taxon_id)

Order terms to enable future join with synonyms:

accepted_taxa %<>% select(taxon_id, accepted_name_usage_id, scientific_name, accepted_name_usage, class, order, family, genus, taxonomic_status, taxonomy) 

4.1.2 Generate synonyms

Join synonyms with accepted_taxa to extract higher classification:

synonyms <- 
  input_synonyms %>% 
    left_join(
      select(accepted_taxa, scientific_name, taxon_id, class, order, family, genus),
      by = c("synonym_of" = "scientific_name"))

Rename taxon_id from accepted_taxa to avoid confusion with the future mapping of taxon_id from synonyms:

synonyms %<>% rename("taxon_id_accepted" = "taxon_id")

Map taxon_id for synonyms:

synonyms %<>% 
  mutate(taxon_id = paste("alien-mollusca-checklist", 
                          "taxon",
                          vdigest(scientific_name, algo = "md5"), 
                          sep = ":"))

One scientific name is duplicated in synonyms:

synonyms %>% 
  slice(which(duplicated(synonyms$scientific_name))) %>% 
  select(scientific_name)

This is because Helix balteata Pollonera, 1892 is both a synonym of Cernuella virgata (Da Costa, 1778) and Cernuella cisalpina (Rossmässler, 1837). We need to manually adjust the taxonID for these records:

synonyms %<>% mutate(taxon_id = case_when(
  scientific_name == "Helix balteata Pollonera, 1892" & synonym_of == "Cernuella virgata (Da Costa, 1778)" ~ paste(taxon_id, "1", sep = ":"),
  scientific_name == "Helix balteata Pollonera, 1892" & synonym_of == "Cernuella cisalpina (Rossmässler, 1837)" ~ paste(taxon_id, "2", sep = ":"),
  TRUE ~ taxon_id))

Check whether taxonID’s are still duplicated:

synonyms %>% 
  slice(which(duplicated(synonyms$taxon_id))) %>% 
  select(scientific_name)

Map taxonomic_status.

synonyms %<>% mutate(taxonomic_status = case_when(
  scientific_name == "Helix balteata Pollonera, 1892" ~ "proParteSynonym",
  remarks == "original name" ~ "homotypicSynonym",
  TRUE ~ "heterotypicSynonym"))

Map accepted_name_usage and accepted_name_usage_id. This information is provided in synonym_of and taxon_id_accepted respectively. We thus only need to rename these columns:

synonyms <- 
  synonyms %>% 
    rename("accepted_name_usage" = "synonym_of") %>% 
    rename("accepted_name_usage_id" = "taxon_id_accepted")

Add taxonomy which is empty for most synonyms, except for Ferussacia folliculus Schröter, 1784 and Ferussacia follicula Schröter, 1784:

synonyms %<>% mutate(taxonomy = case_when(
  scientific_name == "Ferussacia folliculus Schröter, 1784" | 
    scientific_name == "Ferussacia follicula Schröter, 1784" ~ remarks,
  TRUE ~ ""))

Rearrange columns to enable future join with accepted_taxa:

synonyms <- synonyms %>% 
  select(taxon_id, accepted_name_usage_id, scientific_name, accepted_name_usage, class, order, family, genus, taxonomic_status, taxonomy) 

Merge both datasets together in taxon:

taxon <- accepted_taxa %<>% bind_rows(synonyms)

4.2 Term mapping

Map the data to Darwin Core Taxon.

Start with record-level terms which contain metadata about the dataset (which is generally the same for all records).

4.2.1 language

taxon %<>% mutate(dwc_language = "en")

4.2.2 license

taxon %<>% mutate(dwc_license = "http://creativecommons.org/publicdomain/zero/1.0/")

4.2.3 rightsHolder

taxon %<>% mutate(dwc_rightsHolder = "Royal Belgian Institute of Natural Sciences")

4.2.4 datasetID

taxon %<>% mutate(dwc_datasetID = "https://doi.org/10.15468/t13kwo")

4.2.5 institutionCode

taxon %<>% mutate(dwc_institutionCode = "RBINS")

4.2.6 datasetName

taxon %<>% mutate(dwc_datasetName = "Registry of introduced terrestrial molluscs in Belgium")

The following terms contain information about the taxon:

4.2.7 taxonID

taxon %<>% mutate(dwc_taxonID = taxon_id) 

4.2.8 acceptedNameUsageID

taxon %<>% mutate(dwc_acceptedNameUsageID = accepted_name_usage_id) 

4.2.9 scientificName

taxon %<>% mutate(dwc_scientificName = scientific_name)

4.2.10 acceptedNameUsage

taxon %<>% mutate(dwc_acceptedNameUsage = accepted_name_usage) 

4.2.11 kingdom

taxon %<>% mutate(dwc_kingdom = "Animalia")

4.2.12 phylum

taxon %<>% mutate(dwc_phylum = "Mollusca") 

4.2.13 class

taxon %<>% mutate(dwc_class = class)

4.2.14 order

taxon %<>% mutate(dwc_order = order) 

4.2.15 family

taxon %<>% mutate(dwc_family = family) 

4.2.16 genus

taxon %<>% mutate(dwc_genus = genus)

4.2.17 taxonRank

We can exxtract the taxon rank information from GBIF using the rgbif package.

taxon_rank <- parsenames(taxon$dwc_scientificName)

Remove one 1 for the duplicated scientific name Helix balteata Pollonera, 1892:

taxon_rank %<>% distinct(scientificname, rankmarker, .keep_all = FALSE)

Add information to taxon:

taxon %<>% 
  left_join(
    select(taxon_rank, scientificname, rankmarker),
    by = c("dwc_scientificName" = "scientificname"))

Inspect values for rankmarker:

taxon %>% 
  group_by(rankmarker) %>% 
  summarize(records = n())

Recode values for rankmarker in dwc_taxonRank

taxon %<>% mutate(dwc_taxonRank = recode(rankmarker,
  "f." = "forma",
  "infrasp." = "subspecies",
  "sp." = "species",
  "var." = "variety"))

4.2.18 nomenclaturalCode

taxon %<>% mutate(dwc_nomenclaturalCode = "ICZN")

4.2.19 taxonomicStatus

taxon %<>% mutate(dwc_taxonomicStatus = taxonomic_status) 

4.2.20 taxonRemarks

taxon %<>% mutate(dwc_taxonRemarks = taxonomy) 

4.3 Post-processing

Only keep the Darwin Core columns:

taxon %<>% select(starts_with("dwc_"))

Drop the dwc_ prefix:

colnames(taxon) <- str_remove(colnames(taxon), "dwc_")

Preview data:

taxon %>% head()

Save to CSV:

write_delim(taxon, here("data", "processed", "taxon.tsv"), delim = "\t", na = "")

5 Vernacular names extension

5.1 Term mapping

Map the data to Vernacular Names.

5.1.1 taxonID

Merge with taxon_id:

input_vernacular_names %<>% left_join(
  select(input_taxa, scientific_name, taxon_id), 
  by = "scientific_name")

Rename to dwc_taxonID:

input_vernacular_names %<>% rename("dwc_taxonID" = "taxon_id") 

5.1.2 vernacularName

input_vernacular_names %<>% mutate(dwc_vernacularName = vernacular_name)

5.1.3 language

input_vernacular_names %>%
  group_by(language) %>% 
  summarize(records = n())

This column currently contains the original column name, which we will recode to the ISO 639-1 language code:

input_vernacular_names %<>% mutate(dwc_language = recode(language,
 "dutch" = "nl",
 "english" = "en",
 "french" = "fr",
 "german" = "de"))

5.2 Post-processing

Remove the original columns:

input_vernacular_names %<>% select(
  starts_with("dwc"))

Drop the dwc_ prefix:

colnames(input_vernacular_names) <- str_remove(colnames(input_vernacular_names), "dwc_")

Preview data:

input_vernacular_names %>% head()

Save to CSV:

write_delim(input_vernacular_names, here("data", "processed", "vernacularnames.tsv"), delim = "\t", na = "")

6 Species Profile extension:

Create separate dataframe:

species_profile <- input_taxa

Information for species profile is contained in the field realm:

species_profile %>% 
  group_by(realm) %>% 
  summarize(records = n())

Conclusion: all species are terrestrial.

Map the data to Species Profile.

6.0.1 taxonID

Rename taxon_id to dwc_taxonID:

species_profile %<>% rename("dwc_taxonID" = "taxon_id") 

6.0.2 isMarine

species_profile %<>% mutate("dwc_isMarine" = FALSE)

6.0.3 isFreshwater

species_profile %<>% mutate("dwc_isFreshwater" = FALSE)

6.0.4 isTerrestrial

species_profile %<>% mutate("dwc_isTerrestrial" = TRUE)

6.1 Post-processing

Keep only terms with dwc_ prefix:

species_profile %<>% select(starts_with("dwc_")) 

Remove prefix dwc_:

colnames(species_profile) <- str_remove(colnames(species_profile), "dwc_") 

Preview data:

species_profile %>% head()

Save to CSV:

write_delim(species_profile, here("data", "processed", "speciesprofile.tsv"), delim = "\t", na = "")

7 Distribution extension:

Create separate dataframe:

distribution <- input_taxa

Preview:

distribution %>% head()

7.1 Term mapping

distribution <- input_taxa

Map the data to Species Distribution.

7.1.1 taxonID

Rename taxon_id to dwc_taxonID:

distribution %<>% rename("dwc_taxonID" = "taxon_id") 

7.1.2 locationID

distribution %<>% mutate(dwc_locationID = "ISO_3166-2:BE") 

7.1.3 locality

distribution %<>% mutate(dwc_locality = "Belgium") 

7.1.4 countryCode

distribution %<>% mutate(dwc_countryCode = "BE") 

7.1.5 occurrenceStatus

occurrenceStatus should be present, doubtful or absent:

distribution %>% 
  group_by(occurrence_status) %>% 
  summarize(records = n())
distribution %<>% mutate(dwc_occurrenceStatus = occurrence_status) 

7.1.6 establishmentMeans

distribution %<>% mutate(dwc_establishmentMeans = "introduced") 

7.1.7 eventDate

The dataset integrates two fields with date information: first_observation, which is always populated, and last_observation, which is often empty. From these columns, we create eventDate (ISO 8601 format: yyyy or yyyy/yyyy).

distribution %<>% mutate(dwc_eventDate = case_when(
  is.na(last_observation) ~ first_observation,
  TRUE ~ paste(first_observation, last_observation, sep = "/")))

7.1.8 source

Information for source can be found in source_distribution.

distribution %<>% mutate(dwc_source = source_distribution) 

7.1.9 occurrenceRemarks

distribution %<>% mutate(dwc_occurrenceRemarks = paste(
  first_observation_text, distribution_belgium_text, sep = " ")) 

7.2 Post-processing

Only keep the Darwin Core columns:

distribution %<>% select(starts_with("dwc_"))

Drop the dwc_ prefix:

colnames(distribution) <- str_remove(colnames(distribution), "dwc_")

Preview data:

distribution %>% head()

Save to CSV:

write_delim(distribution, here("data", "processed", "distribution.tsv"), delim = "\t", na = "")

8 Literature reference extension

8.0.1 taxonID

Merge with input_taxa to retrieve taxon_id:

input_references %<>% left_join(
  select(input_taxa, scientific_name, taxon_id),
  by = "scientific_name")

Rename to dwc_taxonID:

input_references %<>% rename("dwc_taxonID" = "taxon_id") 

8.0.2 identifier

input_references %<>% mutate(dwc_identifier = shortref)

8.0.3 bibliographicCitation

input_references %<>% mutate(dwc_bibliographicCitation = reference) 

8.1 Post-processing

Only keep the Darwin Core columns:

input_references %<>% select(starts_with("dwc_"))

Drop the dwc_ prefix:

colnames(input_references) <- str_remove(colnames(input_references), "dwc_")

Preview data:

input_references %>% head()

Save to CSV:

write_delim(input_references, here("data", "processed", "references.tsv"), delim = "\t", na = "")

8.2 Description extension

In the description extension we want to include several important characteristics (hereafter refered to as descriptors) about the species:

  • Native range
  • Pathway of introduction
  • Degree of establishment

A single taxon can have multiple descriptions of the same type (e.g. multiple native ranges), expressed as multiple rows in the description extension.

For each descriptor, we create a separate dataframe to process the specific information. We always specify which descriptor we map (type column) and its specific content (description column). After the mapping of these Darwin Core terms type and value, we merge the dataframes to generate one single description extension. We then continue the mapping process by adding the other Darwin Core terms (which content is independent of the type of descriptor, such as language).

8.2.1 Native range

Create separate dataframe:

native_range <- input_taxa

Inspect data:

native_range %>% 
  group_by(native_range) %>% 
  summarize(records = n())

Separate by |:

native_range %<>% separate(native_range, 
                          into = paste("native_range",c(1:9),sep="_"),
                          remove = FALSE,
                          sep = "(\\s\\|\\s)")

Gather data:

native_range %<>% gather(key = "native_range",
                         value = "value",
                         paste("native_range",c(1:9),sep="_"),
                         na.rm = TRUE)

Inspect data:

native_range %>% 
  group_by(value) %>% 
  summarize(records = n())

Rename value to description:

native_range %<>% rename("description" = "value")

Create type:

native_range %<>% mutate(type = "native range") 

Remove native_range:

native_range %<>% select(-native_range) 

8.2.2 Pathways

Create separate dataframe:

pathways <- input_taxa

Inspect data:

pathways %>% 
  group_by(introduction_pathway) %>% 
  summarize(records = n())

Separate by |:

pathways %<>% separate(introduction_pathway, 
                          into = paste("introduction_pathway",c(1:4),sep="_"),
                          sep = "(\\s\\|\\s)",
                          remove = FALSE)

Gather data:

pathways %<>% gather(key = "key",
                     value = "value",
                     paste("introduction_pathway",c(1:4),sep="_"),
                     na.rm = TRUE)

Inspect data:

pathways %>% 
  group_by(value) %>% 
  summarize(records = n())

Map to controlled vocabulary for pathways, based on the CBD standard and developed for TrIAS:

pathways %<>% mutate(description = recode(value,
  "Contaminant: Contaminant on plants" = "contaminant_on_plants",
  "Contaminant: Transportation of habitat material" = "contaminant_habitat_material",
  "Escape : Pet" = "escape_pet",
  "Escape: Research and ex-situ breeding" = "escape_research",
  "Release: Other intentional use" = "release_other",
  "Stowaway" = "stowaway",
  "Stowaway: Container" = "stowaway_container",
  "Stowaway: Hitchhikers on ship" = "stowaway_ship",
  "Stowaway: People and their luggage" = "stowaway_people_luggage",
  "Stowaway: Vehicles" = "stowaway_vehicles",
  "Unaided" = "unaided")) 

Add prefix cbd_2014_pathway:

pathways %<>% mutate(description = paste ("cbd_2014_pathway", description, sep = ":"))

Inspect mapping:

pathways %>% 
  group_by(value, description) %>% 
  summarize(records = n())

Map type:

pathways %<>% mutate(type = "pathway")

Remove key and value:

pathways %<>% select(-key, -value) 

8.2.3 Degree of establishment

Create separate dataframe:

degree_of_establishment <- input_taxa

Inspect data:

degree_of_establishment %>% 
  group_by(degree_of_establishment) %>% 
  summarize(records = n()) 

We map the data to the Blackburn et al. 2011 proposed framework for biological invatsions. Each stage (A - E) represents a specific degree of establishment. When multiple stages are provided, we default to the most established of Blackburn’s categories, see issue 10.

degree_of_establishment %<>% mutate(description = recode(degree_of_establishment,
  "C0 - C3" = "C3",
  "C1 - C3" = "C3",
  "C3, D1, D2" = "D2",
  "D2, E" = "E")) 

Add prefix blackburn_et_al_2011: to refer to the framework:

degree_of_establishment %<>% mutate(description = paste ("blackburn_et_al_2011", description, sep = ":"))

Inspect mapping:

degree_of_establishment %>% 
  group_by(degree_of_establishment, description) %>% 
  summarize(records = n()) 

Map type:

degree_of_establishment %<>% mutate(type = "degree of establishment") 

8.3 Union description

Bind native_range, pathways and degree_of_establishment by rows:

description <- bind_rows(native_range, pathways, degree_of_establishment)

8.4 Map terms

8.4.1 taxonID

description %<>% mutate(dwc_taxonID = taxon_id)

8.4.2 description

description %<>% mutate(dwc_description = description) 

8.4.3 type

description %<>% mutate(dwc_type = type) 

8.4.4 language

description %<>% mutate(dwc_language = "en") 

8.5 Post-processing

Keep columns with dwc_ prefix only:

description %<>% select(starts_with("dwc_"))

Remove prefix dwc_:

colnames(description) <- str_remove(colnames(description), "dwc_")

Sort on taxonID to group description information per taxon:

description %<>% arrange(taxonID)

Preview data:

description %>% head(10)

Save to CSV:

write_delim(description, here("data","processed","description.tsv"), delim = "\t")