This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.
Load libraries:
library(tidyverse) # To do data science
library(magrittr) # To use %<>% pipes
library(here) # To find files
library(janitor) # To clean input data
library(readxl) # To read Excel files
library(digest) # To generate hashes
Set file paths (all paths should be relative to this script):
# Raw files:
raw_data_file = "../data/raw/AI_2016_Boets_etal_Supplement.xls"
sources_file = "../data/raw/sources.tsv"
# Processed files:
dwc_taxon_file = "../data/processed/dwc_checklist/taxon.csv"
dwc_distribution_file = "../data/processed/dwc_checklist/distribution.csv"
dwc_profile_file = "../data/processed/dwc_checklist/speciesprofile.csv"
dwc_description_file = "../data/processed/dwc_checklist/description.csv"
Create a data frame raw_data
from the source data:
# Read the source data:
raw_data <- read_excel(raw_data_file, sheet = "checklist", na = "NA")
sources <- read.table(sources_file, sep = "\t", quote = "", colClasses = "character", fileEncoding = "UTF8", header = T)
Clean the data somewhat: remove empty rows if present:
raw_data %<>%
remove_empty("rows") %>% # Remove empty rows
clean_names() # Have sensible (lowercase) column names
To uniquely identify a taxon in the taxon core and reference taxa in the extensions, we need a taxonID
. Since we need it in all generated files, we generate it here in the raw data frame. It is a combination of dataset-shortname:taxon:
and a hash based on the scientific name. As long as the scientific name doesn’t change, the ID will be stable:
# Vectorize the digest function (The digest() function isn't vectorized. So if you pass in a vector, you get one value for the whole vector rather than a digest for each element of the vector):
vdigest <- Vectorize(digest)
# Generate taxonID:
raw_data %<>% mutate(taxon_id = paste("alien-macroinvertebrates-checklist", "taxon", vdigest(species, algo="md5"), sep = ":"))
The raw data contains (abbreviated) citations in raw_reference
(e.g. van Haaren and Soors 2009
). We want to use full reference (e.g. van Haaren T, Soors J (2009) Sinelobus stanfordi (Richardson, 1901): a new crustacean invader in Europe. Aquatic Invasions 4: 703–711, https://doi.org/10.3391/ai.2009.4.4.20
), which are stored in a separate sources file. We will join the two here, so we can use this information in the core and/or extensions.
Clean raw_reference
somewhat:
raw_data %<>% mutate(reference = recode(reference,
"Adam and Leloup 1934" = "Adam and Leloup 1934", # Remove whitespace
"Van Haaren and Soors 2009" = "van Haaren and Soors 2009", # Remove whitespace and lowercase "van"
"This study" = "Boets et al. 2016",
"Nyst 1835; Adam 1947" = "Nyst 1835 | Adam 1947" )
)
Left join raw_data
with sources
on references
= citation
:
raw_data %<>%
left_join(sources, by = c("reference" = "citation"))
Show result (note: Boets. et al. unpub data
and Collection RBINS
don’t have a full reference):
raw_data %>%
mutate(full_reference = substr(full_reference, 1, 50)) %>% # Shorten full_reference to make it easier to display
select(reference, full_reference) %>%
group_by_all %>%
summarize(records = n()) %>%
arrange(reference)
Add prefix raw_
to all column names to avoid name clashes with Darwin Core terms:
colnames(raw_data) <- paste0("raw_", colnames(raw_data))
Preview data:
head(raw_data)
taxon <- raw_data
Map the data to Darwin Core Taxon.
taxon %<>% mutate(language = "en")
taxon %<>% mutate(license = "http://creativecommons.org/publicdomain/zero/1.0/")
taxon %<>% mutate(rightsHolder = "Ghent University Aquatic Ecology")
taxon %<>% mutate(datasetID = "https://doi.org/10.15468/yxcq07")
taxon %<>% mutate(datasetName = "Inventory of alien macroinvertebrates in Flanders, Belgium")
taxon %<>% mutate(taxonID = raw_taxon_id)
taxon %<>% mutate(scientificName = raw_species) %>% mutate(scientificName = str_trim(scientificName))
Verify that scientificName
contains unique values:
any(duplicated(taxon$scientificName)) # Should be FALSE
## [1] FALSE
taxon %<>% mutate(kingdom = "Animalia")
taxon %<>% mutate(phylum = recode(raw_phylum,
"Crustacea" = "Arthropoda" # Crustacea is a subphylum, the correct phylum is "Arthropoda"
))
taxon %<>%
mutate(order = recode(raw_order,
"Veneroidea" = "Venerida" # Correct typo
)) %>%
mutate(order = str_trim(order)) # Oddly, not all whitespace got detected and trimmed by read_excel
taxon %<>% mutate(family = raw_family)
taxon %<>% mutate(taxonRank = case_when(
raw_species == "Dreissena rostriformis bugensis" ~ "subspecies",
raw_species != "Dreissena rostriformis bugensis" ~ "species")
)
taxon %<>% mutate(nomenclaturalCode = "ICZN")
Remove the original columns:
taxon %<>% select(-starts_with("raw_"))
Sort on taxonID
:
taxon %<>% arrange(taxonID)
Preview data:
head(taxon)
Save to CSV:
write_csv(taxon, dwc_taxon_file, na = "")
distribution <- raw_data
Map the data to Species Distribution.
distribution %<>% mutate(taxonID = raw_taxon_id)
distribution %<>% mutate(locationID = "ISO_3166-2:BE-VLG")
distribution %<>% mutate(locality = "Flemish Region")
distribution %<>% mutate(countryCode = "BE")
distribution %<>% mutate(occurrenceStatus = "present")
We use the GBIF controlled vocabulary for this term. For this dataset, all species are introduced
(= alien):
distribution %<>% mutate(establishmentMeans = "introduced")
The distribution information applies to a certain date range, which we will express here as an ISO 8601 date yyyy/yyyy
(start_year/end_year
).
The date information can be found in raw_first_occurrence_in_flanders
, but it is currently expressed in different formats: yyyy
, < yyyy
, <yyyy
, before yyyy
and yyyy-yyyy
:
distribution %>%
distinct(raw_first_occurrence_in_flanders) %>%
arrange(raw_first_occurrence_in_flanders)
We remove the special characters from raw_first_occurrence_in_flanders
and save as year
:
distribution %<>% mutate(year = str_replace_all(raw_first_occurrence_in_flanders, "(< |before |<)", ""))
Then, we split the information on -
to get a start_year
and end_year
. Most records won’t have an end_year
. For those we’ll consider the publication year of Boets et al. (2016) as the date when the presence of the species was last verified:
distribution %<>%
separate(year, into = c("start_year", "end_year"), sep = "-") %>% # Separate year
mutate(end_year = case_when(
is.na(end_year) ~ "2016",
TRUE ~ end_year)
)
Create eventDate
by combining start_year/end_year
:
distribution %<>% mutate(eventDate = paste(start_year, end_year, sep = "/"))
Compare formatted dates with original dates in raw_first_occurrence_in_flanders
:
distribution %>%
distinct(raw_first_occurrence_in_flanders, eventDate) %>%
arrange(raw_first_occurrence_in_flanders)
distribution %<>% mutate(source = raw_full_reference)
Remove the original columns:
distribution %<>% select(-starts_with("raw_"), -start_year, -end_year)
Sort on taxonID
:
distribution %<>% arrange(taxonID)
Preview data:
distribution %>%
mutate(source = substr(source, 1, 10)) %>% # Shorten source to make it easier to display
head()
Save to CSV:
write_csv(distribution, dwc_distribution_file, na = "")
In this extension will express broad habitat characteristics (e.g. isMarine
) of the species.
species_profile <- raw_data
Habitat information can be found in raw_salinity_zone
, which describes whether a species is found in brackish (B
), freshwater (F
), marine (M
) or combined (B/M
or F/B
) salinity zones.
Show unique values:
species_profile %>%
distinct(raw_salinity_zone) %>%
arrange(raw_salinity_zone)
This is a manually created overview of how we will map the raw_salinity_zone
to isMarine
, isFreshwater
and isTerrestrial
. Note: the latter term is added for completeness, but will be FALSE
for all species in this dataset:
Map the data to Species Profile.
species_profile %<>% mutate(taxonID = raw_taxon_id)
species_profile %<>% mutate(isMarine = case_when(
raw_salinity_zone == "M" | raw_salinity_zone == "B/M" | raw_salinity_zone == "B" ~ "TRUE",
TRUE ~ "FALSE"
))
species_profile %<>% mutate(isFreshwater = case_when(
raw_salinity_zone == "F" | raw_salinity_zone == "F/B" | raw_salinity_zone == "B" ~ "TRUE",
TRUE ~"FALSE"
))
species_profile %<>% mutate(isTerrestrial = "FALSE")
Show mapped values:
species_profile %>%
select(raw_salinity_zone, isMarine, isFreshwater, isTerrestrial) %>%
group_by_all() %>%
summarize(records = n())
Remove the original columns:
species_profile %<>% select(-starts_with("raw_"))
Sort on taxonID
:
species_profile %<>% arrange(taxonID)
Preview data:
species_profile %>% head()
Save to CSV:
write_csv(species_profile, dwc_profile_file, na = "")
In the description extension we want to include several important characteristics (hereafter refered to as descriptors) about the species:
The structure of the description extension is slightly different from the other core/extension files: information for a specific taxon (linked to taxonID) is provided in multipele lines within the csv file: one line per taxon per descriptor. In this way, we are able to include multipele descriptors for each species.
For each descriptor, we create a separate dataframe to process the specific information. We always specify which descriptor we map (type
column) and its specific content (description
column). After the mapping of these Darwin Core terms type
and value
, we merge the dataframes to generate one single description extension. We then continue the mapping process by adding the other Darwin Core terms (which content is independent of the type of descriptor, such as language
).
Native range information (e.g. South-America
) can be found in raw_origin
.
Create separate dataframe:
native_range <- raw_data
Show unique values:
native_range %>%
distinct(raw_origin) %>%
arrange(raw_origin)
raw_origin
contains multiple values (currently not more than 3), so we separate it in 3 columns:
native_range %<>% separate(raw_origin,
into = c("native_range_1", "native_range_2", "native_range_3"),
sep = ", ",
remove = TRUE,
convert = FALSE,
extra = "merge",
fill = "right"
)
Gather in a key
and value
column:
native_range %<>% gather(
key, value,
native_range_1, native_range_2, native_range_3,
na.rm = TRUE, # Also removes records for which there is no native_range_1
convert = FALSE
)
Map values:
native_range %<>% mutate(mapped_value = recode(value,
"East-Asia" = "East Asia",
"East-Europe" = "Eastern Europe",
"Indio-Pacific" = "Indo-Pacific",
"North-Africa" = "North Africa",
"North-America" = "Northern America",
"Northeast-Asia" = "Northeast Asia",
"South-America" = "South America",
"South-Europe" = "Southern Europe",
"Southeast-Asia" = "Southeast Asia",
"Southern hemisphere" = "Southern Hemisphere",
"USA" = "United States",
"West-Africa" = "West Africa",
"West-Atlantic" = "Western Atlantic"
))
Show mapped values:
native_range %>%
select(value, mapped_value) %>%
group_by(value, mapped_value) %>%
summarize(records = n()) %>%
arrange(value)
Drop the key
and value
columns and rename mapped_value
as description
:
native_range %<>%
select(-key, -value) %>%
rename(description = mapped_value)
Create a type
field to indicate the type of description:
native_range %<>% mutate(type = "native range")
Pathway information (e.g. aquaculture
) can be found in raw_pathway_of_introduction
. This original information was interpreted by @timadriaens to a more standardized value in raw_pathway_mapping
(with remarks about this transformation in raw_pathway_mapping_remarks
).
Create separate dataframe:
pathway <- raw_data
Show unique values:
native_range %>%
distinct(raw_pathway_mapping) %>%
arrange(raw_pathway_mapping)
raw_pathway_mapping
contains multiple values (currently not more than 3), so we separate it in 3 columns:
pathway %<>% separate(raw_pathway_mapping,
into = c("pathway_1", "pathway_2", "pathway_3"),
sep = " \\| ",
remove = TRUE,
convert = FALSE,
extra = "merge",
fill = "right"
)
Gather in a key
and value
column:
pathway %<>% gather(
key, value,
pathway_1, pathway_2, pathway_3,
na.rm = TRUE, # Also removes records for which there is no pathway_1
convert = FALSE
)
We use the CBD 2014 pathway vocabulary to standardize this information. The vocubulary has these values.
Map values:
pathway %<>% mutate (mapped_value = recode(value,
"Aquaculture" = "escape_aquaculture",
"Aquaculture / mariculture" = "escape_aquaculture",
"Contaminant on animals (except parasites, species transported by host/vector)" = "contaminant_on_animals",
"Interconnected waterways/basins/seas" = "corridor_water",
"Mariculture" = "escape_aquaculture",
"Other means of transport" = "stowaway_other",
"Pet/aquarium/terrarium species (including live food for such species )" = "escape_pet",
"Ship/boat ballast water" = "stowaway_ballast_water",
"Ship/boat hull fouling" = "stowaway_hull_fouling"))
Add the prefix cbd_2014_pathway:
to refer to this standard:
pathway %<>% mutate(mapped_value = paste ("cbd_2014_pathway", mapped_value, sep = ":"))
Show mapped values:
pathway %>%
select(value, mapped_value) %>%
group_by(value, mapped_value) %>%
summarize(records = n()) %>%
arrange(value)
Drop the key
and value
columns and rename mapped_value
as description
:
pathway %<>%
select(-key, -value) %>%
rename(description = mapped_value)
Create a type
field to indicate the type of description:
pathway %<>% mutate(type = "pathway")
There is no invasion stage information (e.g. casual
) in the raw data, but we want to add it none the less.
Create separate dataframe:
invasion_stage <- raw_data
We use the invasion stage vocabulary from Blackburn et al. (2011) to standardize this information.
Here, we consider all species to be established
as they come from live samples in running waters. We decided not to use the terms naturalized
(because often, there’s no sensible criterium to distinguish between casual/naturalized of naturalized/established) and invasive
(which is a term that can only be applied after a risk assessment).
invasion_stage %<>% mutate(description = "established")
Create a type
field to indicate the type of description:
invasion_stage %<>% mutate(type = "invasion stage")
Union native range, pathway of introduction and invasion stage:
description_ext <- bind_rows(native_range, pathway, invasion_stage)
Map the data to Taxon Description.
description_ext %<>% mutate(taxonID = raw_taxon_id)
description_ext %<>% mutate(description = description)
description_ext %<>% mutate(type = type)
description_ext %<>% mutate(source = raw_full_reference)
description_ext %<>% mutate(language = "en")
Remove the original columns:
description_ext %<>% select(-starts_with("raw_"))
Move taxonID
to the first position:
description_ext %<>% select(taxonID, everything())
Sort on taxonID
:
description_ext %<>% arrange(taxonID)
Preview data:
description_ext %>%
mutate(source = substr(source, 1, 10)) %>% # Shorten source to make it easier to display
head()
Save to CSV:
write_csv(description_ext, dwc_description_file, na = "")