This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.

Load libraries:

library(tidyverse)      # To do data science
library(magrittr)       # To use %<>% pipes
library(here)           # To find files
library(janitor)        # To clean input data
library(readxl)         # To read Excel files
library(digest)         # To generate hashes

Set file paths (all paths should be relative to this script):

# Raw files:
raw_data_file = "../data/raw/ExotischeVissenVlaanderen2016.xlsx"

# Processed files:
dwc_taxon_file = "../data/processed/taxon.csv"
dwc_vernacular_file = "../data/processed/vernacularname.csv"
dwc_distribution_file = "../data/processed/distribution.csv"
dwc_description_file = "../data/processed/description.csv"
dwc_profile_file = "../data/processed/speciesprofile.csv"

1 Read and pre-process raw data

Create a data frame raw_data from the source data:

# Read the source data:
raw_data <- read_excel(raw_data_file, sheet = "Checklist", na = "NA")

Clean the data somewhat: remove empty rows if present

raw_data %<>%
  remove_empty("rows") %>%    # Remove empty rows
  clean_names()               # Have sensible (lowercase) column names

1.1 Generate taxonID

To uniquely identify a taxon in the taxon core and reference taxa in the extensions, we need a taxonID. Since we need it in all generated files, we generate it here in the raw data frame. It is a combination of dataset-shortname:taxon: and a hash based on the scientific name. As long as the scientific name doesn’t change, the ID will be stable:

# Vectorize the digest function (The digest() function isn't vectorized. So if you pass in a vector, you get one value for the whole vector rather than a digest for each element of the vector):
vdigest <- Vectorize(digest)

# Generate taxonID:
raw_data %<>% mutate(taxon_id = paste("alien-fishes-checklist", "taxon", vdigest(latin_name, algo="md5"), sep = ":"))

1.2 Further pre-processing:

Add prefix raw_ to all column names to avoid name clashes with Darwin Core terms:

colnames(raw_data) <- paste0("raw_", colnames(raw_data))

Preview data:

raw_data %>% head()

2 Create taxon core

taxon <- raw_data

2.1 Term mapping

Map the data to Darwin Core Taxon.

2.1.1 language

taxon %<>% mutate(language = "en")

2.1.2 license

taxon %<>% mutate(license = "http://creativecommons.org/publicdomain/zero/1.0/")

2.1.3 rightsHolder

taxon %<>% mutate(rightsHolder = "INBO")

2.1.4 accessRights

taxon %<>% mutate(accessRights = "http://www.inbo.be/en/norms-for-data-use")

2.1.5 datasetID

taxon %<>% mutate(datasetID = "https://doi.org/10.15468/xvuzfh")

2.1.6 datasetName

taxon %<>% mutate(datasetName = "Checklist of non-native freshwater fishes in Flanders, Belgium")

2.1.7 taxonID

taxon %<>% mutate(taxonID = raw_taxon_id)

2.1.8 scientificName

taxon %<>% mutate(scientificName = raw_latin_name)

Verify that scientificName contains unique values:

any(duplicated(taxon$scientificName)) # Should be FALSE
## [1] FALSE

2.1.9 kingdom

taxon %<>% mutate(kingdom = "Animalia")

2.1.10 taxonRank

All taxa are species:

taxon %<>% mutate(taxonRank = "species")

2.1.11 nomenclaturalCode

taxon %<>% mutate(nomenclaturalCode = "ICZN")

2.2 Post-processing

Remove the original columns:

taxon %<>% select(-starts_with("raw_"))

Preview data:

taxon %>% head()

Save to CSV:

write_csv(taxon, dwc_taxon_file, na = "")

3 Create vernacular names extension

vernacular_names <- raw_data

3.1 Term mapping

Map the data to Vernacular Names.

3.1.1 taxonID

vernacular_names %<>% mutate(taxonID = raw_taxon_id)

3.1.2 vernacularName

Vernacular names are available in two languages: English (raw_common_name) and Dutch (raw_nederlandse_naam). We will gather these columns to generate a single column containing the vernacular name (vernacularName) and an additional column with the language (language):

vernacular_names %<>%
  gather(key = language, value = vernacularName, raw_common_name, raw_nederlandse_naam, na.rm = TRUE, convert = TRUE) %>%
  select(-language, everything()) # Move language column to the end
vernacular_names %<>% arrange(taxonID)

3.1.3 language

This column currently contains the original column name, which we will recode to the ISO 639-1 language code:

vernacular_names %<>% mutate(language = recode(language,
  "raw_common_name" = "en",
  "raw_nederlandse_naam" = "nl"
))

3.2 Post-processing

Remove the original columns:

vernacular_names %<>% select(
  -starts_with("raw"))

Preview data:

vernacular_names %>% head()

Save to CSV:

write_csv(vernacular_names, dwc_vernacular_file, na = "")

4 Create distribution extension

distribution <- raw_data

4.1 Term mapping

Map the data to Species Distribution.

4.1.1 taxonID

distribution %<>% mutate(taxonID = raw_taxon_id)

4.1.2 locationID

distribution %<>% mutate(locationID = "ISO_3166-2:BE-VLG")

4.1.3 locality

distribution %<>% mutate(locality = "Flemish Region")

4.1.4 countryCode

distribution %<>% mutate(countryCode = "BE")

4.1.5 occurrenceStatus

distribution %<>% mutate(occurrenceStatus = "present")

4.1.6 establishmentMeans

We use the GBIF controlled vocabulary for this term. For this dataset, all species are introduced (= alien):

distribution %<>% mutate(establishmentMeans = "introduced")

4.1.7 eventDate

The distribution information applies to a certain date range, which we will express here as an ISO 8601 date yyyy/yyyy (start_year/end_year).

The start_year is the year of first observation:

distribution %<>% mutate(start_year = raw_introduction)

Some start_year values are quite broad. We change these values to the first year for which an observation is plausible:

distribution %<>% mutate(start_year = recode(start_year,
  "20xx" = "2000",
  "17th c." = "1601",
  "1980s" = "1980",
  "13th c." = "1201"
))

As there is no end_year information, we’ll consider the publication year of Verreycken et al. (2018) as the date when the presence of the species was last verified (even though all species are considered established and will probably be present after this year as well):

distribution %<>% mutate(end_year = "2018")

Create eventDate by combining start_year/end_year:

distribution %<>% mutate(eventDate = paste(start_year, end_year, sep = "/"))

Compare formatted dates with original dates in raw_introduction:

distribution %>%
  distinct(raw_introduction, eventDate) %>%
  arrange(raw_introduction)

4.2 Post-processing

Remove the original columns:

distribution %<>% select(-starts_with("raw_"), -start_year, -end_year)

Preview data:

distribution %>% head()

Save to CSV:

write_csv(distribution, dwc_distribution_file, na = "")

5 Create description extension

In the description extension we want to include several important characteristics (hereafter refered to as descriptors) about the species:

  • Native range
  • Pathway of introduction
  • Invasion stage

A single taxon can have multiple descriptions of the same type (e.g. multiple native ranges), expressed as multiple rows in the description extension.

For each descriptor, we create a separate dataframe to process the specific information. We always specify which descriptor we map (type column) and its specific content (description column). After the mapping of these Darwin Core terms type and value, we merge the dataframes to generate one single description extension. We then continue the mapping process by adding the other Darwin Core terms (which content is independent of the type of descriptor, such as language).

5.1 Native range

Native range information (e.g. AS for Asia) can be found in raw_origin.

Create separate dataframe:

native_range <- raw_data

Show unique values:

native_range %>%
  distinct(raw_origin) %>%
  arrange(raw_origin)

raw_origin contains multiple values (currently not more than 2), so we separate it on " or " in 2 columns:

native_range %<>% separate(raw_origin,
  into = c("native_range_1", "native_range_2"),
  sep = " or ",
  remove = FALSE,
  convert = FALSE,
  extra = "merge",
  fill = "right"
)

Gather in a key and value column:

native_range %<>% gather(
  key, value,
  native_range_1, native_range_2,
  na.rm = TRUE, # Also removes records for which there is no native_range_1
  convert = FALSE
)

Map values:

native_range %<>% mutate(mapped_value = recode(value,
  "AS" = "Asia",
  "EE" = "Eastern Europe",
  "AFR" = "Africa"))

Show mapped values:

native_range %>%
  select(value, mapped_value) %>%
  group_by_all() %>%
  summarize(records = n()) %>%
  arrange(value) 

Drop the key and value columns and rename mapped_value as description:

native_range %<>%
  select(-key, -value) %>%
  rename(description = mapped_value)

Create a type field to indicate the type of description:

native_range %<>% mutate(type = "native range")

5.2 Pathway of introduction

Pathway information can be found in raw_pathway_s.

Create separate dataframe:

pathway <- raw_data

Show unique values:

native_range %>%
  distinct(raw_pathway_s) %>%
  arrange(raw_pathway_s)

raw_pathway_s contains multiple values (currently not more than 2), so we separate it in 2 columns:

pathway %<>% separate(
  raw_pathway_s,
  into = c("pathway_1", "pathway_2"),
  sep = ", ",
  remove = FALSE,
  convert = FALSE,
  extra = "merge",
  fill = "right"
)

Gather in a key and value column:

pathway %<>% gather(
  key, value,
  pathway_1, pathway_2,
  na.rm = TRUE, # Also removes records for which there is no pathway_1
  convert = FALSE
)

We use the CBD 2014 pathway vocabulary to standardize this information. The vocubulary has these values.

This is a manually created overview of the abbreviations used in this dataset, their interpretation and the mapping to the CBD vocabulary:

Map values:

pathway %<>% mutate(mapped_value = recode(value,
  "AM" = "corridor_water",
  "AN" = "escape_food_bait",
  "AQ" = "escape_aquaculture",
  "BC" = "release_biological_control",
  "BW" = "stowaway_ballast_water",
  "OR" = "escape_ornamental",
  "UN" = "unintentional" # Not part of CBD
))

Add the prefix cbd_2014_pathway: to refer to this standard, except for unintentional, which has no alternative in the standard:

pathway %<>% mutate(mapped_value = case_when (
  mapped_value != "unintentional" ~ paste("cbd_2014_pathway", mapped_value, sep = ":"),
  mapped_value == "unintentional" ~ mapped_value))

Show mapped values:

pathway %>%
  select(value, mapped_value) %>%
  group_by_all() %>%
  summarize(records = n()) %>%
  arrange(value) 

Drop the key and value columns and rename mapped_value as description:

pathway %<>%
  select(-key, -value) %>%
  rename(description = mapped_value)

Create a type field to indicate the type of description:

pathway %<>% mutate(type = "pathway")

5.3 Invasion stage

Invasion stage information (e.g. casual) can be found in raw_status.

Create separate dataframe:

invasion_stage <- raw_data

Show unique values:

native_range %>%
  distinct(raw_status) %>%
  arrange(raw_status)

We use the invasion stage vocabulary from Blackburn et al. (2011) to standardize this information.

  • A and A* in the raw data file stand for acclimatized and refers to the status where the alien species has overcome the geographical barrier, the captivity/cultivation barrier and the survival barier, but where it fails to establish because individuals in the population fail to reproduce. This includes species (like Nile tilapia) in secluded warmer water of power plants. In Blackburn et al. (2011), this corresponds to the term casual.
  • N in the raw data file stands for naturalized. We decided not to use this term, because often, there’s no sensible criterium to distinguish between casual/naturalised and naturalised/established. Thus, here, we consider all naturalized species to be established.
invasion_stage %<>% mutate(description = case_when(
  raw_status == "A" ~ "casual",
  raw_status == "A*" ~ "casual",
  raw_status == "N" ~ "established"
))

Create a type field to indicate the type of description:

invasion_stage %<>% mutate(type = "invasion stage")

5.4 Union descriptions

Union native range, pathway of introduction and invasion stage:

description_ext <- bind_rows(native_range, pathway, invasion_stage)

5.5 Term mapping

Map the data to Taxon Description.

5.5.1 taxonID

description_ext %<>% mutate(taxonID = raw_taxon_id)

5.5.2 description

description_ext %<>% mutate(description = description)

5.5.3 type

description_ext %<>% mutate(type = type)

5.5.4 language

description_ext %<>% mutate(language = "en")

5.6 Post-processing

Remove the original columns:

description_ext %<>% select(-starts_with("raw_"))

Move taxonID to the first position:

description_ext %<>% select(taxonID, everything())

Sort on taxonID to group description information per taxon:

description_ext %<>% arrange(taxonID)

Preview data:

description_ext %>% head(10)

Save to CSV:

write_csv(description_ext, dwc_description_file, na = "")

6 Create species profile extension

In this extension will express broad habitat characteristics of the species.

species_profile <- raw_data

Habitat information can be found in raw_habitat. All taxa are considered to be freshwater fishes.

6.1 Term mapping

Map the data to Species Profile.

6.1.1 taxonID

species_profile %<>% mutate(taxonID = raw_taxon_id)

6.1.2 isMarine

species_profile %<>% mutate(isMarine = FALSE)

6.1.3 isFreshwater

species_profile %<>% mutate(isFreshwater = TRUE)

6.1.4 isTerrestrial

species_profile %<>% mutate(isTerrestrial = "FALSE")

6.2 Post-processing

Remove the original columns:

species_profile %<>% select(-starts_with("raw_"))

Sort on taxonID:

species_profile %<>% arrange(taxonID)

Preview data:

species_profile %>% head()

Save to CSV:

write_csv(species_profile, dwc_profile_file, na = "")