This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.

Load libraries:

library(tidyverse)      # To do data science
library(magrittr)       # To use %<>% pipes
library(here)           # To find files
library(janitor)        # To clean input data
library(readxl)         # To read Excel files
library(digest)         # To generate hashes

1 Read source data

Create a data frame input_data from the source data:

input_data <- read_excel(path = here("data", "raw", "checklist.xlsx")) 

Preview data:

input_data %>% head(n = 5)

2 Preprocessing

2.1 Tidy data

input_data %<>%
  remove_empty("rows") %>%       # Remove empty rows
  clean_names()                  # Have sensible (lowercase) column names

2.2 Taxon IDs

To uniquely identify a taxon in the taxon core and reference taxa in the extensions, we need a taxonID. Since we need it in all generated files, we generate it here in the raw data frame. It is a combination of dataset-shortname:taxon: and a hash based on the scientific name. As long as the scientific name doesn’t change, the ID will be stable:

# Vectorize the digest function (The digest() function isn't vectorized. So if you pass in a vector, you get one value for the whole vector rather than a digest for each element of the vector):
vdigest <- Vectorize(digest)

# Generate taxonID:
input_data %<>% mutate(taxon_id = paste("alien-plants-belgium", "taxon", vdigest (taxon, algo="md5"), sep=":"))

Preview data:

input_data %>% head()

3 Darwin Core mapping

3.1 Taxon core

Create a dataframe with unique taxa only (ignoring multiple distribution rows):

taxon <- input_data %>% distinct(taxon_id, .keep_all = TRUE)

Map the data to Darwin Core Taxon:

3.1.1 language

taxon %<>% mutate(dwc_language = "en")

3.1.2 license

taxon %<>% mutate(dwc_license = "http://creativecommons.org/publicdomain/zero/1.0/")

3.1.3 rightsHolder

taxon %<>% mutate(dwc_rightsHolder = "Botanic Garden Meise")

3.1.4 datasetID

taxon %<>% mutate(dwc_datasetID = "https://doi.org/10.15468/wtda1m")

3.1.5 datasetName

taxon %<>% mutate(dwc_datasetName = "Manual of the Alien Plants of Belgium")

3.1.6 taxonID

taxon %<>% mutate(dwc_taxonID = taxon_id)

3.1.7 scientificNameID

taxon %<>% mutate(dwc_scientificNameID = scientific_name_id)

3.1.8 scientificName

taxon %<>% mutate(dwc_scientificName = taxon)

3.1.9 kingdom

taxon %<>% mutate(dwc_kingdom = "Plantae")

3.1.10 family

taxon %<>% mutate(dwc_family = family)

3.1.11 taxonRank

taxon %<>% mutate(dwc_taxonRank = taxon_rank)

3.1.12 nomenclaturalCode

taxon %<>% mutate(dwc_nomenclaturalCode = "ICN")

3.2 Distribution extension

distribution <- input_data

Before we start mapping the distribution extensions, we focus on two terms: occurrenceStatus and eventDate:

This is because:

  1. Information on the occurrences is given for the regions, while date information is given for Belgium as a whole. Some transformations and clarifications are needed.

  2. Some species have two values for occurrenceStatus and eventDate, i.e. species with the degree of naturalisation (d_n) of extinct (Ext.) or extinct/casual (Ext./Cas.).

Extinct: introduced taxa that once were naturalized (usually rather locally) but that have not been confirmed in recent times in their known localities. Only taxa that are certainly extinct are indicated as such.
Extinct/casual: Some of these extinct taxa are no longer considered as naturalized but still occur as casuals; such taxa are indicated as “Ext./Cas.” (for instance Tragopogon porrifolius).

For these species, we include the occurrenceStatus within the specified time frame (eventDate = first - most recent observation) and after the last observation (eventDate = most recent observation - current date).

The easiest way to do use a stepwize approach: 1. Clean presence information and date information in distribution 2. Create a separate dataframe occurrenceStatus_ALO (ALO = after last observation) for extinct and extinct/casual species 3. Map occurrenceStatus and eventDate from cleaned presence and date information in distribution (for eventDate = first - most recent observation) 4. Map occurrenceStatus and eventDate from cleaned presence and date information in occurrenceStatus_ALO (for eventDate = most recent observation - current date) 5. Bind both dataframes by row. 6. Map the other Darwin Core terms in the distribution extension

The checklist contains minimal presence information (X,? or NA) for the three regions in Belgium: Flanders, Wallonia and the Brussels-Capital Region, contained in presence_fl, presence_wa and presence_br respectively. Information regarding the first/last recorded observation applies to the distribution in Belgium as a whole. Both national and regional information is required in the checklist. In the distribution.csv, we will first provide occurrenceStatus and `eventDate`` on a national level, followed by specific information for the regions.

For this, we use the following principles:

  1. When a species is present in only one region, we can assume eventDate relates to that specific region. In this case, we can keep lines for Belgium and for the specific region populated with these variables (see #45).

  2. When a species is present in more than one region, it is impossible to extrapolate the date information for the regions. In this case, we decided to provide occurrenceStatus for the regional information, and specify dates only for Belgium.

Thus, we need to specify when a species is present in only one of the regions.

We generate 4 new columns: Flanders, Brussels,Wallonia and Belgium. The content of these columns refers to the specific presence status of a species on a regional or national level. S if present in a single region or in Belgium, ? if presence uncertain, NA if absent and M if present in multiple regions.

This should look like this:

We translate this to the distribution extension:

distribution %<>% 
  mutate(Flanders = case_when(
    presence_fl == "X" & (is.na(presence_br) | presence_br == "?") & (is.na(presence_wa) | presence_wa == "?") ~ "S",
    presence_fl == "?" ~ "?",
    is.na(presence_fl) ~ "NA",
    TRUE ~ "M")) %>%
  mutate(Brussels = case_when(
    (is.na(presence_fl) | presence_fl == "?") & presence_br == "X" & (is.na(presence_wa) | presence_wa == "?") ~ "S",
    presence_br == "?" ~ "?",
    is.na(presence_br) ~ "NA",
    TRUE ~ "M")) %>%
  mutate(Wallonia = case_when(
    (is.na(presence_fl) | presence_fl == "?") & (is.na(presence_br) | presence_br == "?") & presence_wa == "X" ~ "S",
    presence_wa == "?" ~ "?",
    is.na(presence_wa) ~ "NA",
    TRUE ~ "M")) %>%
  mutate(Belgium = case_when(
    presence_fl == "X" | presence_br == "X" | presence_wa == "X" ~ "S", # One is "X"
    presence_fl == "?" | presence_br == "?" | presence_wa == "?" ~ "?" # One is "?"
  ))

Summary of the previous action:

distribution %>% select(presence_fl, presence_br, presence_wa, Flanders, Wallonia, Brussels, Belgium) %>%
  group_by_all() %>%
  summarize(records = n()) %>%
  arrange(Flanders, Wallonia, Brussels)

One line should represent the presence information of a species in one region or Belgium. We need to transform distribution from a wide to a long table (i.e. create a key and value column):

distribution %<>% gather(
  key, value,
  Flanders, Wallonia, Brussels, Belgium,
  convert = FALSE
) 

Rename key and value:

distribution %<>% rename("location" = "key", "presence" = "value")

Remove species for which we lack presence information (i.e. presence = `NA``):

distribution %<>% filter(!presence == "NA")

Now, we will clean date information

Create start_year from fr:

distribution %<>% mutate(start_year = fr)

Clean values:

distribution %<>% mutate(start_year = 
  str_replace_all(start_year, "(\\?|ca. |<|>)", "") # Strip ?, ca., < and >
)

Create end_year from mrr (most recent record):

distribution %<>% mutate(end_year = mrr)

Clean values:

distribution %<>% mutate(end_year = 
  str_replace_all(end_year, "(\\?|ca. |<|>)", "") # Strip ?, ca., < and >
)

If end_year is Ann. or N use current year:

current_year = format(Sys.Date(), "%Y")
distribution %<>% mutate(end_year = recode(end_year,
  "Ann." = current_year,
  "N" = current_year)
)

Show reformatted values for both fr and mrr:

distribution %>%
  select(fr, start_year) %>%
  rename(year = fr, formatted_year = start_year) %>%
  union( # Union with mrr. Will also remove duplicates
    distribution %>%
      select(mrr, end_year) %>%
      rename(year = mrr, formatted_year = end_year)
  ) %>%
  filter(nchar(year) != 4) %>% # Don't show raw values that were already YYYY
  arrange(year)

Check if any start_year fall after end_year (expected to be none):

distribution %>%
  select(start_year, end_year) %>%
  mutate(start_year = as.numeric(start_year)) %>%
  mutate(end_year = as.numeric(end_year)) %>%
  group_by(start_year, end_year) %>%
  summarize(records = n()) %>%
  filter(start_year > end_year) 

Combine start_year and end_year in an ranged Date (ISO 8601 format). If any those two dates is empty or the same, we use a single year, as a statement when it was seen once (either as a first record or a most recent record):

distribution %<>% mutate(Date = case_when(
  start_year == "" & end_year == "" ~ "",
  start_year == ""                  ~ end_year,
  end_year == ""                    ~ start_year,
  start_year == end_year            ~ start_year,
  TRUE                              ~ paste(start_year, end_year, sep = "/")
))

In a next step, we will generate occurrenceStatus_ALO

occurrenceStatus_ALO <- distribution %>% filter(d_n == "Ext." | d_n == "Ext./Cas.")

Then, we map occurrenceStatus and eventDate for distribution:

Map occurrenceStaus using IUCN definitions:

distribution %<>% mutate(occurrenceStatus = recode(presence,
  "S" = "present",
  "M" = "present",
  "?" = "presence uncertain",
  .default = ""
))

Overview of occurrenceStatus for each location x presence combination:

distribution %>% select (location, presence, occurrenceStatus) %>%
  group_by_all() %>%
  summarize(records = n()) 

Populate eventDate only when presence = S:

distribution %<>% mutate(eventDate = case_when(
  presence == "S" ~ Date,
  TRUE ~ ""
))

Map occurrenceStatus and eventDate for occurrenceStatus_ALO:

occurrenceStatus_ALO %<>% mutate(occurrenceStatus = case_when(
  d_n == "Ext." ~ "absent",
  d_n == "Ext./Cas." ~ "present"
))

occurrenceStatus_ALO %<>% mutate(eventDate = case_when(
  presence == "S" ~ paste(end_year, current_year, sep = "/")
))

Bind occurrenceStatus_ALO and distribution by rows:

distribution %<>% bind_rows(occurrenceStatus_ALO)

Map the data to Species Distribution:

3.2.1 taxonID

distribution %<>% mutate(dwc_taxonID = taxon_id)

3.2.2 locationID

distribution %<>% mutate(dwc_locationID = case_when(
  location == "Belgium" ~ "ISO_3166-2:BE",
  location == "Flanders" ~ "ISO_3166-2:BE-VLG",
  location == "Wallonia" ~ "ISO_3166-2:BE-WAL",
  location == "Brussels" ~ "ISO_3166-2:BE-BRU"
))

3.2.3 locality

distribution %<>% mutate(dwc_locality = case_when(
  location == "Belgium" ~ "Belgium",
  location == "Flanders" ~ "Flemish Region",
  location == "Wallonia" ~ "Walloon Region",
  location == "Brussels" ~ "Brussels-Capital Region"
))

3.2.4 countryCode

distribution %<>% mutate(dwc_countryCode = "BE")

3.2.5 occurrenceStatus

distribution %<>% mutate(dwc_occurrenceStatus = occurrenceStatus) 

3.2.6 establishmentMeans

distribution %<>% mutate(dwc_establishmentMeans = "introduced")

3.2.7 eventDate

distribution %<>% mutate(dwc_eventDate = eventDate) 

3.3 Species profile extension

In this extension will express broad habitat characteristics (e.g. isTerrestrial) of the species.

species_profile <- input_data

Habitat information can be found in habitat, which describes whether a species is found in freshwater, terrestrial or both (terrestrial/freshwater) habitats.

Show unique values:

species_profile %>%
  distinct(habitat) %>%
  arrange(habitat)

Clean content somewhat, i.e. change uppercase to lowercase

species_profile %<>% mutate(habitat = str_to_lower(habitat)) 

We map this information respectively to isFreshwater, isTerrestrial or both terms in the species profile extension.

Map the data to Species Profile:

3.3.1 taxonID

species_profile %<>% mutate(dwc_taxonID = taxon_id)

3.3.2 isMarine

species_profile %<>% mutate(dwc_isMarine = "FALSE")

3.3.3 isFreshwater

species_profile %<>% mutate(dwc_isFreshwater = case_when(
  habitat == "fresh water" | habitat == "terrestrial/fresh water" ~ "TRUE",
  TRUE ~"FALSE"
))

3.3.4 isTerrestrial

species_profile %<>% mutate(dwc_isTerrestrial = case_when(
  habitat == "terrestrial" | habitat == "terrestrial/fresh water" ~ "TRUE",
  TRUE ~"FALSE"
))

Show mapped values:

species_profile %>%
  select(habitat, dwc_isMarine, dwc_isFreshwater, dwc_isTerrestrial) %>%
  group_by_all() %>%
  summarize(records = n())

3.4 Description extension

In the description extension we want to include several important characteristics (hereafter refered to as descriptors) about the species:

  • Native range
  • Pathway of introduction
  • Invasion stage

A single taxon can have multiple descriptions of the same type (e.g. multiple native ranges), expressed as multiple rows in the description extension.

For each descriptor, we create a separate dataframe to process the specific information. We always specify which descriptor we map (type column) and its specific content (description column). After the mapping of these Darwin Core terms type and value, we merge the dataframes to generate one single description extension. We then continue the mapping process by adding the other Darwin Core terms (which content is independent of the type of descriptor, such as language).

3.4.1 native range

origin contains native range information (e.g. E AS-Te NAM). We’ll separate, clean, map and combine these values.

Create separate dataframe:

native_range <- input_data

Create description from origin:

native_range %<>% mutate(description = origin)

Separate description on space in 4 columns: In case there are more than 4 values, these will be merged in native_range_4. The dataset currently contains no more than 4 values per record.

native_range %<>% separate(
  description,
  into = c("native_range_1", "native_range_2", "native_range_3", "native_range_4"),
  sep = " ",
  remove = TRUE,
  convert = FALSE,
  extra = "merge",
  fill = "right"
)

Gather native ranges in a key and value column:

native_range %<>% gather(
  key, value,
  native_range_1, native_range_2, native_range_3, native_range_4,
  na.rm = TRUE, # Also removes records for which there is no native_range_1
  convert = FALSE
)

Sort on ID to see pathways in context for each record:

native_range %<>% arrange(id)

Clean values:

native_range %<>% mutate(
  value = str_replace_all(value, "\\?", ""), # Strip ?
  value = str_trim(value) # Clean whitespace
)

Map values:

native_range %<>% mutate(mapped_value = recode(value,
  "AF" = "Africa (WGSRPD:2)",
  "AM" = "pan-American",
  "AS" = "Asia",
  "AS-Te" = "temperate Asia (WGSRPD:3)",
  "AS-Tr" = "tropical Asia (WGSRPD:4)",
  "AUS" = "Australasia (WGSRPD:5)",
  "Cult." = "cultivated origin",
  "E" = "Europe (WGSRPD:1)",
  "Hybr." = "hybrid origin",
  "NAM" = "Northern America (WGSRPD:7)",
  "SAM" = "Southern America (WGSRPD:8)",
  "Trop." = "Pantropical",
  .default = "",
  .missing = "" # As result of stripping, records with no native range already removed by gather()
))

Show mapped values:

native_range %>%
  select(value, mapped_value) %>%
  group_by(value, mapped_value) %>%
  summarize(records = n()) %>%
  arrange(value)

Drop key and value column and rename mapped value:

native_range %<>% select(-key, -value)
native_range %<>% rename(description = mapped_value)

Keep only non-empty descriptions:

native_range %<>% filter(!is.na(description) & description != "")

Create a type field to indicate the type of description:

native_range %<>% mutate(type = "native range")

3.4.2 pathway of introduction

pathway (pathway description) information is based on v_i, which contains a list of introduction pathways (e.g. Agric., wool). We’ll separate, clean, map and combine these values.

Create separate dataframe:

pathway <- input_data

Create pathway from v_i:

pathway %<>% mutate(pathway = v_i)

Separate pathway on , in 4 columns: In case there are more than 4 values, these will be merged in pathway_4. The dataset currently contains no more than 3 values per record.

pathway %<>% separate(
  pathway,
  into = c("pathway_1", "pathway_2", "pathway_3", "pathway_4"),
  sep = ",",
  remove = TRUE,
  convert = FALSE,
  extra = "merge",
  fill = "right"
)

Gather pathways in a key and value column:

pathway %<>% gather(
  key, value,
  pathway_1, pathway_2, pathway_3, pathway_4,
  na.rm = TRUE, # Also removes records for which there is no pathway_1
  convert = FALSE
)

Sort on taxon_id to see pathways in context for each record:

pathway %<>% arrange(taxon_id)

Show unique values:

pathway %>%
  distinct(value) %>%
  arrange(value) 

Clean values:

pathway %<>% mutate(
  value = str_replace_all(value, "\\?|…|\\.{3}", ""), # Strip ?, …, ...
  value = str_to_lower(value), # Convert to lowercase
  value = str_trim(value) # Clean whitespace
)

Map values to the CBD standard::

pathway %<>% mutate(cbd_stand = recode(value, 
  "agric." = "escape_agriculture",
  "bird seed" = "contaminant_seed",
  "birdseed" = "contaminant_seed",
  "bulbs" = "",
  "coconut mats" = "contaminant_seed",
  "fish" = "",
  "food refuse" = "escape_food_bait",
  "grain" = "contaminant_seed",
  "grain (rice)" = "contaminant_seed",
  "grass seed" = "contaminant_seed",
  "hay" = "",
  "hort" = "escape_horticulture",
  "hort." = "escape_horticulture",
  "hybridization" = "",
  "military troops" = "",
  "nurseries" = "contaminant_nursery",
  "ore" = "contaminant_habitat_material",
  "pines" = "contaminant_on_plants",
  "rice" = "",
  "salt" = "",
  "seeds" = "contaminant_seed",
  "timber" = "contaminant_timber",
  "tourists" = "stowaway_people_luggage",
  "traffic" = "",
  "unknown" = "unknown",
  "urban weed" = "stowaway",
  "waterfowl" = "contaminant_on_animals",
  "wool" = "contaminant_on_animals",
  "wool alien" = "contaminant_on_animals",
  .default = "",
  .missing = "" # As result of stripping, records with no pathway already removed by gather()
))

Add prefix cbd_2014_pathway in case there is a match with the CBD standard:

pathway %<>% mutate(mapped_value = case_when(
  cbd_stand != "" ~ paste ("cbd_2014_pathway", cbd_stand, sep = ":"),
  TRUE ~ ""
))

Show mapped values:

pathway %>%
  select(value, mapped_value) %>%
  group_by(value, mapped_value) %>%
  summarize(records = n()) %>%
  arrange(value) 

Drop key,value and cbd_stand column:

pathway %<>% select(-key, -value, -cbd_stand)

Change column name mapped_value to description:

pathway %<>%  rename(description = mapped_value)

Create a type field to indicate the type of description:

pathway %<>% mutate (type = "pathway")

Show pathway descriptions:

pathway %>% 
  select(description) %>% 
  group_by(description) %>% 
  summarize(records = n())

Keep only non-empty descriptions:

pathway %<>% filter(!is.na(description) & description != "")

3.4.3 Degree of establishment

Create separate dataframe:

degree_of_establishment <- input_data

The information for degree of establishment is contained in d_n:

degree_of_establishment %>%
  select(d_n) %>%
  group_by_all() %>%
  summarize(records = n()) 

Clean the data:

degree_of_establishment %<>% mutate(description = recode(d_n,
  "Ext.?" = "Ext.",
  "Cas.?" = "Cas.",
  "Nat.?" = "Nat.",
  .missing = ""
))

We decided to use the unified framework for biological invasions of Blackburn et al. 2011 for invasion stage. casual, naturalized and invasive are terms included in this framework. However, we decided to discard the terms naturalized and invasive listed in Blackburn et al. (see trias-project/alien-fishes-checklist#6 (comment)). So, naturalized and invasive are replaced by established. For extinct (introduced taxa that once were naturalized but that have not been confirmed in recent times) and extinct/casual species (taxa are no longer considered as naturalized but still occur as casuals), we map the most recent invasion stage (i.e. extinct and casual respectively):

degree_of_establishment %<>% mutate(description = recode(description,
  "Cas." = "casual",
  "Inv." = "established",
  "Nat." = "established",
  "Ext." = "extinct",
  "Ext./Cas." = "casual"
))

Show mapped values:

degree_of_establishment %>%
  select(d_n, description) %>%
  group_by_all() %>%
  summarize(records = n())

Create a type field to indicate the type of description:

degree_of_establishment %<>% mutate(type = "degree of establishment")

Union native range, pathway of introduction and degree of establishment:

description <- bind_rows(native_range, pathway, degree_of_establishment)

Map the data to Taxon Description:

3.4.4 taxonID

description %<>% mutate(dwc_taxonID = taxon_id)

3.4.5 description

description %<>% mutate(dwc_description = description)

3.4.6 type

description %<>% mutate(dwc_type = type)

3.4.7 language

description %<>% mutate(dwc_language = "en")

4 Post-processing

Remove the original columns for each of the generated files:

taxon %<>% select(starts_with("dwc_")) 
distribution %<>% select(starts_with("dwc_")) 
species_profile %<>% select(starts_with("dwc_")) 
description %<>% select(starts_with("dwc_"))

Drop the dwc_ prefix:

colnames(taxon) <- str_remove(colnames(taxon), "dwc_")
colnames(distribution) <- str_remove(colnames(distribution), "dwc_")
colnames(species_profile) <- str_remove(colnames(species_profile), "dwc_")
colnames(description) <- str_remove(colnames(description), "dwc_")

Sort on taxonID to group description information per taxon:

taxon %<>% arrange(taxonID)
distribution %<>% arrange(taxonID)
species_profile %<>% arrange(taxonID)
description %<>% arrange(taxonID)

Preview taxon core:

taxon %>% head(10)

Preview distribution extension:

distribution %>% head(10)

Preview species profile extension:

species_profile %>% head(10)

Preview description extension:

description %>% head(10)

Save to CSV:

write_csv(taxon, here("data", "processed", "taxon.csv"), na = "")
write_csv(distribution, here("data", "processed", "distribution.csv"), na = "")
write_csv(species_profile, here("data", "processed", "speciesprofile.csv"), na = "")
write_csv(description, here("data", "processed", "description.csv"), na = "")