This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.
Load libraries:
library(tidyverse) # To do data science
library(magrittr) # To use %<>% pipes
library(here) # To find files
library(janitor) # To clean input data
library(readxl) # To read Excel files
library(digest) # To generate hashes
Set file paths (all paths should be relative to this script):
# Raw files:
raw_data_file = "../data/raw/ExotischeVissenVlaanderen2016.xlsx"
# Processed files:
dwc_taxon_file = "../data/processed/taxon.csv"
dwc_vernacular_file = "../data/processed/vernacularname.csv"
dwc_distribution_file = "../data/processed/distribution.csv"
dwc_description_file = "../data/processed/description.csv"
dwc_profile_file = "../data/processed/speciesprofile.csv"
Create a data frame raw_data
from the source data:
# Read the source data:
raw_data <- read_excel(raw_data_file, sheet = "Checklist", na = "NA")
Clean the data somewhat: remove empty rows if present
raw_data %<>%
remove_empty("rows") %>% # Remove empty rows
clean_names() # Have sensible (lowercase) column names
To uniquely identify a taxon in the taxon core and reference taxa in the extensions, we need a taxonID
. Since we need it in all generated files, we generate it here in the raw data frame. It is a combination of dataset-shortname:taxon:
and a hash based on the scientific name. As long as the scientific name doesn’t change, the ID will be stable:
# Vectorize the digest function (The digest() function isn't vectorized. So if you pass in a vector, you get one value for the whole vector rather than a digest for each element of the vector):
vdigest <- Vectorize(digest)
# Generate taxonID:
raw_data %<>% mutate(taxon_id = paste("alien-fishes-checklist", "taxon", vdigest(latin_name, algo="md5"), sep = ":"))
Add prefix raw_
to all column names to avoid name clashes with Darwin Core terms:
colnames(raw_data) <- paste0("raw_", colnames(raw_data))
Preview data:
raw_data %>% head()
taxon <- raw_data
Map the data to Darwin Core Taxon.
taxon %<>% mutate(language = "en")
taxon %<>% mutate(license = "http://creativecommons.org/publicdomain/zero/1.0/")
taxon %<>% mutate(rightsHolder = "INBO")
taxon %<>% mutate(accessRights = "http://www.inbo.be/en/norms-for-data-use")
taxon %<>% mutate(datasetID = "https://doi.org/10.15468/xvuzfh")
taxon %<>% mutate(datasetName = "Checklist of non-native freshwater fishes in Flanders, Belgium")
taxon %<>% mutate(taxonID = raw_taxon_id)
taxon %<>% mutate(scientificName = raw_latin_name)
Verify that scientificName
contains unique values:
any(duplicated(taxon$scientificName)) # Should be FALSE
## [1] FALSE
taxon %<>% mutate(kingdom = "Animalia")
All taxa are species:
taxon %<>% mutate(taxonRank = "species")
taxon %<>% mutate(nomenclaturalCode = "ICZN")
Remove the original columns:
taxon %<>% select(-starts_with("raw_"))
Preview data:
taxon %>% head()
Save to CSV:
write_csv(taxon, dwc_taxon_file, na = "")
vernacular_names <- raw_data
Map the data to Vernacular Names.
vernacular_names %<>% mutate(taxonID = raw_taxon_id)
Vernacular names are available in two languages: English (raw_common_name
) and Dutch (raw_nederlandse_naam
). We will gather these columns to generate a single column containing the vernacular name (vernacularName
) and an additional column with the language (language
):
vernacular_names %<>%
gather(key = language, value = vernacularName, raw_common_name, raw_nederlandse_naam, na.rm = TRUE, convert = TRUE) %>%
select(-language, everything()) # Move language column to the end
vernacular_names %<>% arrange(taxonID)
This column currently contains the original column name, which we will recode to the ISO 639-1 language code:
vernacular_names %<>% mutate(language = recode(language,
"raw_common_name" = "en",
"raw_nederlandse_naam" = "nl"
))
Remove the original columns:
vernacular_names %<>% select(
-starts_with("raw"))
Preview data:
vernacular_names %>% head()
Save to CSV:
write_csv(vernacular_names, dwc_vernacular_file, na = "")
distribution <- raw_data
Map the data to Species Distribution.
distribution %<>% mutate(taxonID = raw_taxon_id)
distribution %<>% mutate(locationID = "ISO_3166-2:BE-VLG")
distribution %<>% mutate(locality = "Flemish Region")
distribution %<>% mutate(countryCode = "BE")
distribution %<>% mutate(occurrenceStatus = "present")
We use the GBIF controlled vocabulary for this term. For this dataset, all species are introduced
(= alien):
distribution %<>% mutate(establishmentMeans = "introduced")
The distribution information applies to a certain date range, which we will express here as an ISO 8601 date yyyy/yyyy
(start_year/end_year
).
The start_year
is the year of first observation:
distribution %<>% mutate(start_year = raw_introduction)
Some start_year
values are quite broad. We change these values to the first year for which an observation is plausible:
distribution %<>% mutate(start_year = recode(start_year,
"20xx" = "2000",
"17th c." = "1601",
"1980s" = "1980",
"13th c." = "1201"
))
As there is no end_year
information, we’ll consider the publication year of Verreycken et al. (2018) as the date when the presence of the species was last verified (even though all species are considered established and will probably be present after this year as well):
distribution %<>% mutate(end_year = "2018")
Create eventDate
by combining start_year/end_year
:
distribution %<>% mutate(eventDate = paste(start_year, end_year, sep = "/"))
Compare formatted dates with original dates in raw_introduction
:
distribution %>%
distinct(raw_introduction, eventDate) %>%
arrange(raw_introduction)
Remove the original columns:
distribution %<>% select(-starts_with("raw_"), -start_year, -end_year)
Preview data:
distribution %>% head()
Save to CSV:
write_csv(distribution, dwc_distribution_file, na = "")
In the description extension we want to include several important characteristics (hereafter refered to as descriptors) about the species:
A single taxon can have multiple descriptions of the same type (e.g. multiple native ranges), expressed as multiple rows in the description extension.
For each descriptor, we create a separate dataframe to process the specific information. We always specify which descriptor we map (type
column) and its specific content (description
column). After the mapping of these Darwin Core terms type
and value
, we merge the dataframes to generate one single description extension. We then continue the mapping process by adding the other Darwin Core terms (which content is independent of the type of descriptor, such as language
).
Native range information (e.g. AS
for Asia) can be found in raw_origin
.
Create separate dataframe:
native_range <- raw_data
Show unique values:
native_range %>%
distinct(raw_origin) %>%
arrange(raw_origin)
raw_origin
contains multiple values (currently not more than 2), so we separate it on " or "
in 2 columns:
native_range %<>% separate(raw_origin,
into = c("native_range_1", "native_range_2"),
sep = " or ",
remove = FALSE,
convert = FALSE,
extra = "merge",
fill = "right"
)
Gather in a key
and value
column:
native_range %<>% gather(
key, value,
native_range_1, native_range_2,
na.rm = TRUE, # Also removes records for which there is no native_range_1
convert = FALSE
)
Map values:
native_range %<>% mutate(mapped_value = recode(value,
"AS" = "Asia",
"EE" = "Eastern Europe",
"AFR" = "Africa"))
Show mapped values:
native_range %>%
select(value, mapped_value) %>%
group_by_all() %>%
summarize(records = n()) %>%
arrange(value)
Drop the key
and value
columns and rename mapped_value
as description
:
native_range %<>%
select(-key, -value) %>%
rename(description = mapped_value)
Create a type
field to indicate the type of description:
native_range %<>% mutate(type = "native range")
Pathway information can be found in raw_pathway_s
.
Create separate dataframe:
pathway <- raw_data
Show unique values:
native_range %>%
distinct(raw_pathway_s) %>%
arrange(raw_pathway_s)
raw_pathway_s
contains multiple values (currently not more than 2), so we separate it in 2 columns:
pathway %<>% separate(
raw_pathway_s,
into = c("pathway_1", "pathway_2"),
sep = ", ",
remove = FALSE,
convert = FALSE,
extra = "merge",
fill = "right"
)
Gather in a key
and value
column:
pathway %<>% gather(
key, value,
pathway_1, pathway_2,
na.rm = TRUE, # Also removes records for which there is no pathway_1
convert = FALSE
)
We use the CBD 2014 pathway vocabulary to standardize this information. The vocubulary has these values.
This is a manually created overview of the abbreviations used in this dataset, their interpretation and the mapping to the CBD vocabulary:
Map values:
pathway %<>% mutate(mapped_value = recode(value,
"AM" = "corridor_water",
"AN" = "escape_food_bait",
"AQ" = "escape_aquaculture",
"BC" = "release_biological_control",
"BW" = "stowaway_ballast_water",
"OR" = "escape_ornamental",
"UN" = "unintentional" # Not part of CBD
))
Add the prefix cbd_2014_pathway:
to refer to this standard, except for unintentional
, which has no alternative in the standard:
pathway %<>% mutate(mapped_value = case_when (
mapped_value != "unintentional" ~ paste("cbd_2014_pathway", mapped_value, sep = ":"),
mapped_value == "unintentional" ~ mapped_value))
Show mapped values:
pathway %>%
select(value, mapped_value) %>%
group_by_all() %>%
summarize(records = n()) %>%
arrange(value)
Drop the key
and value
columns and rename mapped_value
as description
:
pathway %<>%
select(-key, -value) %>%
rename(description = mapped_value)
Create a type
field to indicate the type of description:
pathway %<>% mutate(type = "pathway")
Invasion stage information (e.g. casual
) can be found in raw_status
.
Create separate dataframe:
invasion_stage <- raw_data
Show unique values:
native_range %>%
distinct(raw_status) %>%
arrange(raw_status)
We use the invasion stage vocabulary from Blackburn et al. (2011) to standardize this information.
A
and A*
in the raw data file stand for acclimatized
and refers to the status where the alien species has overcome the geographical barrier, the captivity/cultivation barrier and the survival barier, but where it fails to establish because individuals in the population fail to reproduce. This includes species (like Nile tilapia) in secluded warmer water of power plants. In Blackburn et al. (2011), this corresponds to the term casual
.N
in the raw data file stands for naturalized
. We decided not to use this term, because often, there’s no sensible criterium to distinguish between casual/naturalised and naturalised/established. Thus, here, we consider all naturalized
species to be established
.invasion_stage %<>% mutate(description = case_when(
raw_status == "A" ~ "casual",
raw_status == "A*" ~ "casual",
raw_status == "N" ~ "established"
))
Create a type
field to indicate the type of description:
invasion_stage %<>% mutate(type = "invasion stage")
Union native range, pathway of introduction and invasion stage:
description_ext <- bind_rows(native_range, pathway, invasion_stage)
Map the data to Taxon Description.
description_ext %<>% mutate(taxonID = raw_taxon_id)
description_ext %<>% mutate(description = description)
description_ext %<>% mutate(type = type)
description_ext %<>% mutate(language = "en")
Remove the original columns:
description_ext %<>% select(-starts_with("raw_"))
Move taxonID
to the first position:
description_ext %<>% select(taxonID, everything())
Sort on taxonID
to group description information per taxon:
description_ext %<>% arrange(taxonID)
Preview data:
description_ext %>% head(10)
Save to CSV:
write_csv(description_ext, dwc_description_file, na = "")
In this extension will express broad habitat characteristics of the species.
species_profile <- raw_data
Habitat information can be found in raw_habitat
. All taxa are considered to be freshwater fishes.
Map the data to Species Profile.
species_profile %<>% mutate(taxonID = raw_taxon_id)
species_profile %<>% mutate(isMarine = FALSE)
species_profile %<>% mutate(isFreshwater = TRUE)
species_profile %<>% mutate(isTerrestrial = "FALSE")
Remove the original columns:
species_profile %<>% select(-starts_with("raw_"))
Sort on taxonID
:
species_profile %<>% arrange(taxonID)
Preview data:
species_profile %>% head()
Save to CSV:
write_csv(species_profile, dwc_profile_file, na = "")