This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.

Load libraries:

library(tidyverse)      # To do data science
library(magrittr)       # To use %<>% pipes
library(here)           # To find files
library(janitor)        # To clean input data
library(digest)         # To generate hashes

1 Read source data

The data is maintained in this Google Spreadsheet.

Read the relevant worksheet (published as csv):

input_data <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTl8IEk2fProQorMu5xKQPdMXl3OQp-c0f6eBXitv0BiVFZ3JSJCde0PtbFXuETgguf6vK8b43FDX1C/pub?gid=0&single=true&output=csv")

Copy the source data to the repository to keep track of changes:

write_csv(input_data, here("data", "raw", "ad_hoc_checklist_dump.csv"), na = "")

Preview data:

input_data %>% head()

2 Preprocessing

2.1 Tidy data

Clean data somewhat:

input_data %<>%
  remove_empty("rows") %>%    # Remove empty rows
  clean_names()               # Have sensible (lowercase) column names

2.2 Scientific names

No cleaning required.

2.3 Taxon IDs

To link taxa with information in the extension(s), each taxon needs a unique and relatively stable taxonID. Here we create one in the form of dataset_shortname:taxon:hash, where hash is unique code based on scientific name and kingdom (that will remain the same as long as scientific name and kingdom remain the same):

vdigest <- Vectorize(digest) # Vectorize digest function to work with vectors
input_data %<>% mutate(taxon_id = paste(
  "ad-hoc-checklist",
  "taxon",
  vdigest(paste(scientific_name, kingdom), algo = "md5"),
  sep = ":"
))

2.4 Preview data

Show the number of taxa and distributions per kingdom and rank:

input_data %>%
  group_by(kingdom, taxon_rank) %>%
  summarize(
    `# taxa` = n_distinct(taxon_id),
    `# distributions` = n()
  ) %>%
  adorn_totals("row")

Preview data:

input_data %>% head()

3 Darwin Core mapping

3.1 Taxon core

Create a dataframe with unique taxa only (ignoring multiple distribution rows):

taxon <- input_data %>% distinct(taxon_id, .keep_all = TRUE)

Map the data to Darwin Core Taxon.

3.1.1 language

taxon %<>% mutate(dwc_language = "en")

3.1.2 license

taxon %<>% mutate(dwc_license = "http://creativecommons.org/publicdomain/zero/1.0/")

3.1.3 rightsHolder

taxon %<>% mutate(dwc_rightsHolder = "INBO") 

3.1.4 accessRights

taxon %<>% mutate(dwc_accessRights = "https://www.inbo.be/en/norms-data-use") 

3.1.5 datasetID

taxon %<>% mutate(dwc_datasetID = "https://doi.org/10.15468/3pmlxs") 

3.1.6 institutionCode

taxon %<>% mutate(dwc_institutionCode = "INBO") 

3.1.7 datasetName

taxon %<>% mutate(dwc_datasetName = "Ad hoc checklist of alien species in Belgium") 

3.1.8 taxonID

taxon %<>% mutate(dwc_taxonID = taxon_id)

3.1.9 scientificName

taxon %<>% mutate(dwc_scientificName = scientific_name) 

3.1.10 kingdom

Inspect values:

taxon %>%
  group_by(kingdom) %>%
  count()

Map values:

taxon %<>% mutate(dwc_kingdom = kingdom)

3.1.11 phylum

Inspect values:

taxon %>%
  group_by(phylum) %>%
  count()

Map values:

taxon %<>% mutate(dwc_phylum = phylum)

3.1.12 class

Inspect values:

taxon %>%
  group_by(class) %>%
  count()

Map values:

taxon %<>% mutate(dwc_class = class)

3.1.13 order

Inspect values:

taxon %>%
  group_by(order) %>%
  count()

Map values:

taxon %<>% mutate(dwc_order = order)

3.1.14 family

Inspect values:

taxon %>%
  group_by(family) %>%
  count()

Map values:

taxon %<>% mutate(dwc_family = family)

3.1.15 genus

Inspect values:

taxon %>%
  group_by(genus) %>%
  count()

Map values:

taxon %<>% mutate(dwc_genus = genus)

3.1.16 taxonRank

Inspect values:

taxon %>%
  group_by(taxon_rank) %>%
  count()

Map values:

taxon %<>% mutate(dwc_taxonRank = taxon_rank)

3.1.17 nomenclaturalCode

taxon %<>% mutate(dwc_nomenclaturalCode = nomenclatural_code)

3.2 Literature references extension

In this extension we will express references from source, separated and gathered.

Create a dataframe with all data (including multiple distributions), to capture potentially different source for different distributions of the same taxa:

literature_references <- input_data

Separate values on | in a maximum of 3 columns:

literature_references %<>% separate(
  source,
  into = paste0("reference_", c(1:3)),
  sep = " \\| ",
  extra = "drop"
)

Gather and trim values:

literature_references %<>% gather(key, value, starts_with("reference_"), na.rm = TRUE) %>%
  mutate(value = str_trim(value))

Map the data to Literature References.

3.2.1 taxonID

literature_references %<>% mutate(dwc_taxonID = taxon_id)

3.2.2 identifier

Extract the URL from reference using regex:

literature_references %<>% mutate(dwc_identifier = str_extract(value, "http\\S+"))

3.2.3 bibliographicCitation

literature_references %<>% mutate(dwc_bibliographicCitation = value) 

3.3 Distribution extension

Create a dataframe with all data (including multiple distributions):

distribution <- input_data

Map the data to Species Distribution.

3.3.1 taxonID

distribution %<>% mutate(dwc_taxonID = taxon_id) 

3.3.2 locationID

Currently map for Belgian regions only:

distribution %<>% mutate(dwc_locationID = case_when(
  is.na(location) & country_code == "BE" ~ "ISO_3166-2:BE",
  location == "Flanders" ~ "ISO_3166-2:BE-VLG",
  location == "Wallonia" ~ "ISO_3166-2:BE-WAL",
  location == "Brussels" ~ "ISO_3166-2:BE-BRU"
))

3.3.3 locality

Currently map for Belgian regions only:

distribution %<>% mutate(dwc_locality = case_when(
  is.na(location) & country_code == "BE" ~ "Belgium",
  location == "Flanders" ~ "Flemish Region",
  location == "Wallonia" ~ "Walloon Region",
  location == "Brussels" ~ "Brussels-Capital Region"
))

3.3.4 countryCode

distribution %<>% mutate(dwc_countryCode = country_code)

3.3.5 occurrenceStatus

distribution %<>% mutate(dwc_occurrenceStatus = occurrence_status)

3.3.6 establishmentMeans

distribution %<>% mutate(dwc_establishmentMeans = "introduced")

3.3.7 eventDate

Information for eventDate is contained in date_first_observation and date_last_observation, which we will express here in an ISO 8601 date format yyyy/yyyy (start_date/end_date).

Inspect data_first_observation:

distribution %>%
  group_by(date_first_observation) %>%
  count()

Clean date_first_observation (remove >):

distribution %<>% mutate(date_first_observation = str_remove_all(date_first_observation, ">")) 

start_date_first_observation contains empty values. For those we’ll consider the publication year of the ad hoc checklist as the date when the presence of the species was last verified, except for Mephitis mephitis, which was last observed in 2014. For this species, we use 2014 as start date:

distribution %<>% mutate(start_date = case_when(
  scientific_name == "Mephitis mephitis (Schreber, 1776)" ~ "2014",
  is.na(date_first_observation) ~ "2018",
  TRUE ~ date_first_observation
)) 

date_last_observation should not be before 2018 for those specific records:

distribution %>% 
  filter(is.na(date_first_observation)) %>%
  group_by(date_first_observation, start_date, date_last_observation) %>% 
  count()

Inspect date_last_observation:

distribution %>%
  group_by(date_last_observation) %>%
  count()

In a similar way as for date_first_observation, we use the publication year of the ad hoc checklist when no end year is provided:

distribution %<>% mutate(end_date = case_when(
  is.na(date_last_observation) ~ "2018",
  TRUE  ~ date_last_observation
)) 

Create eventDate:

distribution %<>% mutate(dwc_eventDate = paste(start_date, end_date, sep = "/")) 

3.3.8 source

Use the source field as is. Its content is expected to be concatenated with | for more than one reference.

distribution %<>% mutate(dwc_source = source) 

3.4 Species profile extension

In this extension we will express broad habitat characteristics of the species (e.g. isTerrestrial) from realm.

Create a dataframe with unique taxa only (ignoring multiple distribution rows):

species_profile <- input_data %>% distinct(taxon_id, .keep_all = TRUE)

Only keep records for which realm is not empty:

species_profile %<>% filter(!is.na(realm))

Inspect values:

species_profile %>%
  group_by(realm) %>%
  count()

Map the data to Species Profile.

3.4.1 taxonID

species_profile %<>% mutate(dwc_taxonID = taxon_id)

3.4.2 isMarine

species_profile %<>% mutate(dwc_isMarine = case_when(
  realm == "freshwater | marine" ~ "TRUE",
  realm == "estuarine" ~ "TRUE",
  TRUE ~ "FALSE"
)) 

3.4.3 isFreshwater

species_profile %<>% mutate(dwc_isFreshwater = case_when(
  realm == "freshwater" ~ "TRUE",
  realm == "freshwater | marine" ~ "TRUE",
  realm == "terrestrial | freshwater" ~ "TRUE",
  realm == "estuarine" ~ "TRUE",
  TRUE ~ "FALSE"
)) 

3.4.4 isTerrestrial

species_profile %<>% mutate(dwc_isTerrestrial = case_when(
  realm == "terrestrial" ~ "TRUE",
  realm == "terrestrial | freshwater" ~ "TRUE",
  TRUE ~ "FALSE"
))

Show mapped values:

species_profile %>%
  select(realm, dwc_isMarine, dwc_isFreshwater, dwc_isTerrestrial) %>%
  group_by_all() %>%
  summarize(records = n())

3.5 Description extension

In the description extension we want to include several important characteristics (hereafter referred to as descriptors) about the species:

  • Native range
  • Pathway of introduction
  • Invasion stage

The structure of the description extension is slightly different from the other core/extension files: information for a specific taxon (linked to taxonID) is provided in multiple lines within the csv file: one line per taxon per descriptor. In this way, we are able to include multipele descriptors for each species.

For each descriptor, we create a separate dataframe to process the specific information. We always specify which descriptor we map (type column) and its specific content (description column). After the mapping of these Darwin Core terms type and value, we merge the dataframes to generate one single description extension. We then continue the mapping process by adding the other Darwin Core terms (which content is independent of the type of descriptor, such as language).

3.5.1 Native range

We will express native range information from native_range, separated and gathered.

Create a dataframe with unique taxa only (ignoring multiple distribution rows):

native_range <- input_data %>% distinct(taxon_id, .keep_all = TRUE)

Separate values on | in a maximum of 3 columns:

native_range %<>% separate(
  native_range,
  into = paste0("range_", c(1:3)),
  sep = " \\| ",
  extra = "drop"
)

Gather and trim values:

native_range %<>% gather(key, value, starts_with("range_"), na.rm = TRUE) %>%
  mutate(value = str_trim(value))

Inspect values:

native_range %>%
  group_by(value) %>%
  count()

Clean native range information in value somewhat:

native_range %<>% 
  mutate(value = str_remove_all(value, "\\?")) %>%  # Remove question
  mutate(value = str_to_title(value))

Map values:

native_range %<>% mutate(mapped_value = recode(value,
  "Africa"                 = "Africa (WGSRPD:2)",
  "Australa"               = "Australia (WGSRPD:50)",
  "Canary Islands"         = "Canary Islands (WGSRPD:21_CNY)",
  "Central America"        = "Central America (WGSRPD:80)",
  "China"                  = "China (WGSRPD:36)",
  "Costa Rica"             = "Costa Rica (WGSRPD:80_COS)",
  "Cyprus"                 = "Cyprus (WGSRPD:34_CYP)",
  "East Africa"            = "Eastern Africa",
  "East Asia"              = "Eastern Asia (WGSRPD:38)",
  "Europe"                 = "Europe (WGSRPD:1)",
  "Hawaï"                  = "Hawaii (WGSRPD:63_HAW)",
  "Japan"                  = "Japan (WGSRPD:38_JAP)",
  "Mexico"                 = "Mexico (WGSRPD:79)",
  "New Zealand"            = "New Zealand (WGSRPD:51)",
  "North Amercia"          = "Northern America (WGSRPD:7)",
  "South America"          = "Southern America (WGSRPD:8)",
  "Southeastern Europe"    = "Southeastern Europe (WGSRPD:13)",
  "Southern Africa "       = "Southern Africa (WGSRPD:27)",
  "Tasmania"               = "Tasmania (WGSRPD:50_TAS)",
  "Vietnam"                = "Vietnam (WGSRPD:41_VIE)",
# .default                 = "",
  .missing                 = ""
))

Inspect mapped values:

native_range %>%
  group_by(value, mapped_value) %>%
  count()

Drop key and value column and rename mapped value:

native_range %<>% 
  select(-key, -value) %>% 
  rename(description = mapped_value)

Keep only non-empty descriptions:

native_range %<>% filter(!is.na(description) & description != "")

Create a type field to indicate the type of description:

native_range %<>% mutate(type = "native range")

3.5.2 Pathway of introduction

We will express pathway information (e.g. aquaculture) from introduction_pathway, separated and gathered.

Create a dataframe with unique taxa only (ignoring multiple distribution rows):

pathway <- input_data %>% distinct(taxon_id, .keep_all = TRUE)

Separate values on | in a maximum of 2 columns:

pathway %<>% separate(
  introduction_pathway,
  into = paste0("range_", c(1:2)),
  sep = " \\| ",
  extra = "drop"
)

Gather and trim values:

pathway %<>% gather(key, value, starts_with("range_"), na.rm = TRUE) %>%
  mutate(value = str_trim(value))

Inspect values:

pathway %>%
  distinct(value) %>%
  arrange(value) 

We use the CBD 2014 pathway vocabulary to standardize this information. The vocubulary has these values.

The values in this checklist should already match to the CBD standard, but we’ll do a regex match for lowercase and underscore strings as a check and prefix cbd_2014_pathway for those only:

pathway %<>% mutate(mapped_value = case_when(
  str_detect(value, "^[a-z_]+$") ~ paste("cbd_2014_pathway", value, sep = ":"),
  is.na(value) ~ "",
  TRUE ~ ""
))

Inspect mapped values:

pathway %>%
  group_by(value, mapped_value) %>%
  count()

Drop key and value column:

pathway %<>% select(-key, -value)

Change column name mapped_value to description:

pathway %<>%  rename(description = mapped_value)

Create a type field to indicate the type of description:

pathway %<>% mutate (type = "pathway")

Keep only non-empty descriptions:

pathway %<>% filter(!is.na(description) & description != "")

3.5.3 Degree of establishment

Create a dataframe with unique taxa only (ignoring multiple distribution rows):

degree_of_establishment <- input_data %>% distinct(taxon_id, .keep_all = TRUE)

Inspect values:

degree_of_establishment %>%
  group_by(degree_of_establishment) %>%
  count()

Our vocabulary for invasion stage is based on the invasion stage vocabulary from Blackburn et al. (2011). We decided not to use the terms naturalized (because often, there’s no sensible criterium to distinguish between casual/naturalized of naturalized/established) and invasive (which is a term that can only be applied after a risk assessment).

Map data to Blackburn at al. (2011) vocabulary:

degree_of_establishment %<>% mutate(description = case_when(
  degree_of_establishment == "captive" | 
  degree_of_establishment == "casual" |
  degree_of_establishment == "cultivated" |
  degree_of_establishment == "reproducing" |
  degree_of_establishment == "transported" ~ "introduced",
  degree_of_establishment == "colonizing" |
  degree_of_establishment == "established" |
  degree_of_establishment == "invasive" ~ "established"
))

Remove empty values:

degree_of_establishment %<>% filter(!is.na(description))

Show mapped values:

degree_of_establishment %>%
  group_by(degree_of_establishment, description) %>%
  count()

Create a type field to indicate the type of description:

degree_of_establishment %<>% mutate(type = "degree of establishment")

Union native range, pathway of introduction and degree of establishment into a single description extension:

description <- bind_rows(native_range, pathway, degree_of_establishment)

Map the data to Taxon Description.

3.5.4 taxonID

description %<>% mutate(dwc_taxonID = taxon_id)

3.5.5 description

description %<>% mutate(dwc_description = description)

3.5.6 type

description %<>% mutate(dwc_type = type)

3.5.7 language

description %<>% mutate(dwc_language = "en")

4 Post-processing

Only keep the Darwin Core columns:

taxon %<>% select(starts_with("dwc_"))
literature_references %<>% select(starts_with("dwc_"))
distribution %<>% select(starts_with("dwc_"))
species_profile %<>% select(starts_with("dwc_"))
description %<>% select(starts_with("dwc_"))

Drop the dwc_ prefix:

colnames(taxon) <- str_remove(colnames(taxon), "dwc_")
colnames(literature_references) <- str_remove(colnames(literature_references), "dwc_")
colnames(distribution) <- str_remove(colnames(distribution), "dwc_")
colnames(species_profile) <- str_remove(colnames(species_profile), "dwc_")
colnames(description) <- str_remove(colnames(description), "dwc_")

Remove duplicates (same reference for same taxon) in the literature references extension:

literature_references %<>% distinct()

Sort on taxonID (to maintain some consistency between updates of the dataset):

taxon %<>% arrange(taxonID)
literature_references %<>% arrange(taxonID)
distribution %<>% arrange(taxonID)
species_profile %<>% arrange(taxonID)
description %<>% arrange(taxonID)

Preview taxon core:

taxon %>% head()

Preview literature references extension:

literature_references %>% head()

Preview distribution extension:

distribution %>% head()

Preview species profile extension:

species_profile %>% head()

Preview description extension:

description %>% head(10)

Save to CSV:

write_csv(taxon, here("data", "processed", "taxon.csv"), na = "")
write_csv(literature_references, here("data", "processed", "references.csv"), na = "")
write_csv(distribution, here("data", "processed", "distribution.csv"), na = "")
write_csv(species_profile, here("data", "processed", "speciesprofile.csv"), na = "")
write_csv(description, here("data", "processed", "description.csv"), na = "")