This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.
Load libraries:
library(tidyverse) # To do data science
library(magrittr) # To use %<>% pipes
library(here) # To find files
library(janitor) # To clean input data
library(readxl) # To read Excel files
library(digest) # To generate hashes
Create a data frame input_data
from the source data:
input_data <- read_excel(path = here("data", "raw", "checklist.xlsx"))
Preview data:
input_data %>% head(n = 5)
input_data %<>%
remove_empty("rows") %>% # Remove empty rows
clean_names() # Have sensible (lowercase) column names
To uniquely identify a taxon in the taxon core and reference taxa in the extensions, we need a taxonID
. Since we need it in all generated files, we generate it here in the raw data frame. It is a combination of dataset-shortname:taxon:
and a hash based on the scientific name. As long as the scientific name doesn’t change, the ID will be stable:
# Vectorize the digest function (The digest() function isn't vectorized. So if you pass in a vector, you get one value for the whole vector rather than a digest for each element of the vector):
vdigest <- Vectorize(digest)
# Generate taxonID:
input_data %<>% mutate(taxon_id = paste("alien-plants-belgium", "taxon", vdigest (taxon, algo="md5"), sep=":"))
Preview data:
input_data %>% head()
Create a dataframe with unique taxa only (ignoring multiple distribution rows):
taxon <- input_data %>% distinct(taxon_id, .keep_all = TRUE)
Map the data to Darwin Core Taxon:
taxon %<>% mutate(dwc_language = "en")
taxon %<>% mutate(dwc_license = "http://creativecommons.org/publicdomain/zero/1.0/")
taxon %<>% mutate(dwc_rightsHolder = "Botanic Garden Meise")
taxon %<>% mutate(dwc_datasetID = "https://doi.org/10.15468/wtda1m")
taxon %<>% mutate(dwc_datasetName = "Manual of the Alien Plants of Belgium")
taxon %<>% mutate(dwc_taxonID = taxon_id)
taxon %<>% mutate(dwc_scientificNameID = scientific_name_id)
taxon %<>% mutate(dwc_scientificName = taxon)
taxon %<>% mutate(dwc_kingdom = "Plantae")
taxon %<>% mutate(dwc_family = family)
taxon %<>% mutate(dwc_taxonRank = taxon_rank)
taxon %<>% mutate(dwc_nomenclaturalCode = "ICN")
distribution <- input_data
Before we start mapping the distribution extensions, we focus on two terms: occurrenceStatus
and eventDate
:
This is because:
Information on the occurrences is given for the regions, while date information is given for Belgium as a whole. Some transformations and clarifications are needed.
Some species have two values for occurrenceStatus
and eventDate
, i.e. species with the degree of naturalisation (d_n
) of extinct (Ext.
) or extinct/casual (Ext./Cas.
).
Extinct: introduced taxa that once were naturalized (usually rather locally) but that have not been confirmed in recent times in their known localities. Only taxa that are certainly extinct are indicated as such.
Extinct/casual: Some of these extinct taxa are no longer considered as naturalized but still occur as casuals; such taxa are indicated as “Ext./Cas.” (for instance Tragopogon porrifolius).
For these species, we include the occurrenceStatus within the specified time frame (eventDate
= first - most recent observation) and after the last observation (eventDate
= most recent observation - current date).
The easiest way to do use a stepwize approach: 1. Clean presence information and date information in distribution
2. Create a separate dataframe occurrenceStatus_ALO
(ALO = after last observation) for extinct and extinct/casual species 3. Map occurrenceStatus
and eventDate
from cleaned presence and date information in distribution
(for eventDate
= first - most recent observation) 4. Map occurrenceStatus
and eventDate
from cleaned presence and date information in occurrenceStatus_ALO
(for eventDate
= most recent observation - current date) 5. Bind both dataframes by row. 6. Map the other Darwin Core terms in the distribution extension
The checklist contains minimal presence information (X
,?
or NA
) for the three regions in Belgium: Flanders, Wallonia and the Brussels-Capital Region, contained in presence_fl
, presence_wa
and presence_br
respectively. Information regarding the first/last recorded observation applies to the distribution in Belgium as a whole. Both national and regional information is required in the checklist. In the distribution.csv
, we will first provide occurrenceStatus
and `eventDate`` on a national level, followed by specific information for the regions.
For this, we use the following principles:
When a species is present in only one region, we can assume eventDate
relates to that specific region. In this case, we can keep lines for Belgium and for the specific region populated with these variables (see #45).
When a species is present in more than one region, it is impossible to extrapolate the date information for the regions. In this case, we decided to provide occurrenceStatus
for the regional information, and specify dates only for Belgium.
Thus, we need to specify when a species is present in only one of the regions.
We generate 4 new columns: Flanders
, Brussels
,Wallonia
and Belgium
. The content of these columns refers to the specific presence status of a species on a regional or national level. S
if present in a single region or in Belgium, ?
if presence uncertain, NA
if absent and M
if present in multiple regions.
This should look like this:
We translate this to the distribution extension:
distribution %<>%
mutate(Flanders = case_when(
presence_fl == "X" & (is.na(presence_br) | presence_br == "?") & (is.na(presence_wa) | presence_wa == "?") ~ "S",
presence_fl == "?" ~ "?",
is.na(presence_fl) ~ "NA",
TRUE ~ "M")) %>%
mutate(Brussels = case_when(
(is.na(presence_fl) | presence_fl == "?") & presence_br == "X" & (is.na(presence_wa) | presence_wa == "?") ~ "S",
presence_br == "?" ~ "?",
is.na(presence_br) ~ "NA",
TRUE ~ "M")) %>%
mutate(Wallonia = case_when(
(is.na(presence_fl) | presence_fl == "?") & (is.na(presence_br) | presence_br == "?") & presence_wa == "X" ~ "S",
presence_wa == "?" ~ "?",
is.na(presence_wa) ~ "NA",
TRUE ~ "M")) %>%
mutate(Belgium = case_when(
presence_fl == "X" | presence_br == "X" | presence_wa == "X" ~ "S", # One is "X"
presence_fl == "?" | presence_br == "?" | presence_wa == "?" ~ "?" # One is "?"
))
Summary of the previous action:
distribution %>% select(presence_fl, presence_br, presence_wa, Flanders, Wallonia, Brussels, Belgium) %>%
group_by_all() %>%
summarize(records = n()) %>%
arrange(Flanders, Wallonia, Brussels)
One line should represent the presence information of a species in one region or Belgium. We need to transform distribution
from a wide to a long table (i.e. create a key
and value
column):
distribution %<>% gather(
key, value,
Flanders, Wallonia, Brussels, Belgium,
convert = FALSE
)
Rename key
and value
:
distribution %<>% rename("location" = "key", "presence" = "value")
Remove species for which we lack presence information (i.e. presence
= `NA``):
distribution %<>% filter(!presence == "NA")
Now, we will clean date information
Create start_year
from fr
:
distribution %<>% mutate(start_year = fr)
Clean values:
distribution %<>% mutate(start_year =
str_replace_all(start_year, "(\\?|ca. |<|>)", "") # Strip ?, ca., < and >
)
Create end_year
from mrr
(most recent record):
distribution %<>% mutate(end_year = mrr)
Clean values:
distribution %<>% mutate(end_year =
str_replace_all(end_year, "(\\?|ca. |<|>)", "") # Strip ?, ca., < and >
)
If end_year
is Ann.
or N
use current year:
current_year = format(Sys.Date(), "%Y")
distribution %<>% mutate(end_year = recode(end_year,
"Ann." = current_year,
"N" = current_year)
)
Show reformatted values for both fr
and mrr
:
distribution %>%
select(fr, start_year) %>%
rename(year = fr, formatted_year = start_year) %>%
union( # Union with mrr. Will also remove duplicates
distribution %>%
select(mrr, end_year) %>%
rename(year = mrr, formatted_year = end_year)
) %>%
filter(nchar(year) != 4) %>% # Don't show raw values that were already YYYY
arrange(year)
Check if any start_year
fall after end_year
(expected to be none):
distribution %>%
select(start_year, end_year) %>%
mutate(start_year = as.numeric(start_year)) %>%
mutate(end_year = as.numeric(end_year)) %>%
group_by(start_year, end_year) %>%
summarize(records = n()) %>%
filter(start_year > end_year)
Combine start_year
and end_year
in an ranged Date
(ISO 8601 format). If any those two dates is empty or the same, we use a single year, as a statement when it was seen once (either as a first record or a most recent record):
distribution %<>% mutate(Date = case_when(
start_year == "" & end_year == "" ~ "",
start_year == "" ~ end_year,
end_year == "" ~ start_year,
start_year == end_year ~ start_year,
TRUE ~ paste(start_year, end_year, sep = "/")
))
In a next step, we will generate occurrenceStatus_ALO
occurrenceStatus_ALO <- distribution %>% filter(d_n == "Ext." | d_n == "Ext./Cas.")
Then, we map occurrenceStatus and eventDate for distribution
:
Map occurrenceStaus
using IUCN definitions:
distribution %<>% mutate(occurrenceStatus = recode(presence,
"S" = "present",
"M" = "present",
"?" = "presence uncertain",
.default = ""
))
Overview of occurrenceStatus
for each location x presence combination:
distribution %>% select (location, presence, occurrenceStatus) %>%
group_by_all() %>%
summarize(records = n())
Populate eventDate
only when presence
= S
:
distribution %<>% mutate(eventDate = case_when(
presence == "S" ~ Date,
TRUE ~ ""
))
Map occurrenceStatus
and eventDate
for occurrenceStatus_ALO
:
occurrenceStatus_ALO %<>% mutate(occurrenceStatus = case_when(
d_n == "Ext." ~ "absent",
d_n == "Ext./Cas." ~ "present"
))
occurrenceStatus_ALO %<>% mutate(eventDate = case_when(
presence == "S" ~ paste(end_year, current_year, sep = "/")
))
Bind occurrenceStatus_ALO
and distribution
by rows:
distribution %<>% bind_rows(occurrenceStatus_ALO)
Map the data to Species Distribution:
distribution %<>% mutate(dwc_taxonID = taxon_id)
distribution %<>% mutate(dwc_locationID = case_when(
location == "Belgium" ~ "ISO_3166-2:BE",
location == "Flanders" ~ "ISO_3166-2:BE-VLG",
location == "Wallonia" ~ "ISO_3166-2:BE-WAL",
location == "Brussels" ~ "ISO_3166-2:BE-BRU"
))
distribution %<>% mutate(dwc_locality = case_when(
location == "Belgium" ~ "Belgium",
location == "Flanders" ~ "Flemish Region",
location == "Wallonia" ~ "Walloon Region",
location == "Brussels" ~ "Brussels-Capital Region"
))
distribution %<>% mutate(dwc_countryCode = "BE")
distribution %<>% mutate(dwc_occurrenceStatus = occurrenceStatus)
distribution %<>% mutate(dwc_establishmentMeans = "introduced")
distribution %<>% mutate(dwc_eventDate = eventDate)
In this extension will express broad habitat characteristics (e.g. isTerrestrial
) of the species.
species_profile <- input_data
Habitat information can be found in habitat
, which describes whether a species is found in freshwater, terrestrial or both (terrestrial/freshwater) habitats.
Show unique values:
species_profile %>%
distinct(habitat) %>%
arrange(habitat)
Clean content somewhat, i.e. change uppercase to lowercase
species_profile %<>% mutate(habitat = str_to_lower(habitat))
We map this information respectively to isFreshwater
, isTerrestrial
or both terms in the species profile extension.
Map the data to Species Profile:
species_profile %<>% mutate(dwc_taxonID = taxon_id)
species_profile %<>% mutate(dwc_isMarine = "FALSE")
species_profile %<>% mutate(dwc_isFreshwater = case_when(
habitat == "fresh water" | habitat == "terrestrial/fresh water" ~ "TRUE",
TRUE ~"FALSE"
))
species_profile %<>% mutate(dwc_isTerrestrial = case_when(
habitat == "terrestrial" | habitat == "terrestrial/fresh water" ~ "TRUE",
TRUE ~"FALSE"
))
Show mapped values:
species_profile %>%
select(habitat, dwc_isMarine, dwc_isFreshwater, dwc_isTerrestrial) %>%
group_by_all() %>%
summarize(records = n())
In the description extension we want to include several important characteristics (hereafter refered to as descriptors) about the species:
A single taxon can have multiple descriptions of the same type (e.g. multiple native ranges), expressed as multiple rows in the description extension.
For each descriptor, we create a separate dataframe to process the specific information. We always specify which descriptor we map (type
column) and its specific content (description
column). After the mapping of these Darwin Core terms type
and value
, we merge the dataframes to generate one single description extension. We then continue the mapping process by adding the other Darwin Core terms (which content is independent of the type of descriptor, such as language
).
origin
contains native range information (e.g. E AS-Te NAM
). We’ll separate, clean, map and combine these values.
Create separate dataframe:
native_range <- input_data
Create description
from origin
:
native_range %<>% mutate(description = origin)
Separate description
on space in 4 columns: In case there are more than 4 values, these will be merged in native_range_4. The dataset currently contains no more than 4 values per record.
native_range %<>% separate(
description,
into = c("native_range_1", "native_range_2", "native_range_3", "native_range_4"),
sep = " ",
remove = TRUE,
convert = FALSE,
extra = "merge",
fill = "right"
)
Gather native ranges in a key and value column:
native_range %<>% gather(
key, value,
native_range_1, native_range_2, native_range_3, native_range_4,
na.rm = TRUE, # Also removes records for which there is no native_range_1
convert = FALSE
)
Sort on ID to see pathways in context for each record:
native_range %<>% arrange(id)
Clean values:
native_range %<>% mutate(
value = str_replace_all(value, "\\?", ""), # Strip ?
value = str_trim(value) # Clean whitespace
)
Map values:
native_range %<>% mutate(mapped_value = recode(value,
"AF" = "Africa (WGSRPD:2)",
"AM" = "pan-American",
"AS" = "Asia",
"AS-Te" = "temperate Asia (WGSRPD:3)",
"AS-Tr" = "tropical Asia (WGSRPD:4)",
"AUS" = "Australasia (WGSRPD:5)",
"Cult." = "cultivated origin",
"E" = "Europe (WGSRPD:1)",
"Hybr." = "hybrid origin",
"NAM" = "Northern America (WGSRPD:7)",
"SAM" = "Southern America (WGSRPD:8)",
"Trop." = "Pantropical",
.default = "",
.missing = "" # As result of stripping, records with no native range already removed by gather()
))
Show mapped values:
native_range %>%
select(value, mapped_value) %>%
group_by(value, mapped_value) %>%
summarize(records = n()) %>%
arrange(value)
Drop key
and value
column and rename mapped value
:
native_range %<>% select(-key, -value)
native_range %<>% rename(description = mapped_value)
Keep only non-empty descriptions:
native_range %<>% filter(!is.na(description) & description != "")
Create a type
field to indicate the type of description:
native_range %<>% mutate(type = "native range")
pathway
(pathway description) information is based on v_i
, which contains a list of introduction pathways (e.g. Agric., wool
). We’ll separate, clean, map and combine these values.
Create separate dataframe:
pathway <- input_data
Create pathway
from v_i
:
pathway %<>% mutate(pathway = v_i)
Separate pathway
on ,
in 4 columns: In case there are more than 4 values, these will be merged in pathway_4. The dataset currently contains no more than 3 values per record.
pathway %<>% separate(
pathway,
into = c("pathway_1", "pathway_2", "pathway_3", "pathway_4"),
sep = ",",
remove = TRUE,
convert = FALSE,
extra = "merge",
fill = "right"
)
Gather pathways in a key and value column:
pathway %<>% gather(
key, value,
pathway_1, pathway_2, pathway_3, pathway_4,
na.rm = TRUE, # Also removes records for which there is no pathway_1
convert = FALSE
)
Sort on taxon_id
to see pathways in context for each record:
pathway %<>% arrange(taxon_id)
Show unique values:
pathway %>%
distinct(value) %>%
arrange(value)
Clean values:
pathway %<>% mutate(
value = str_replace_all(value, "\\?|…|\\.{3}", ""), # Strip ?, …, ...
value = str_to_lower(value), # Convert to lowercase
value = str_trim(value) # Clean whitespace
)
Map values to the CBD standard::
pathway %<>% mutate(cbd_stand = recode(value,
"agric." = "escape_agriculture",
"bird seed" = "contaminant_seed",
"birdseed" = "contaminant_seed",
"bulbs" = "",
"coconut mats" = "contaminant_seed",
"fish" = "",
"food refuse" = "escape_food_bait",
"grain" = "contaminant_seed",
"grain (rice)" = "contaminant_seed",
"grass seed" = "contaminant_seed",
"hay" = "",
"hort" = "escape_horticulture",
"hort." = "escape_horticulture",
"hybridization" = "",
"military troops" = "",
"nurseries" = "contaminant_nursery",
"ore" = "contaminant_habitat_material",
"pines" = "contaminant_on_plants",
"rice" = "",
"salt" = "",
"seeds" = "contaminant_seed",
"timber" = "contaminant_timber",
"tourists" = "stowaway_people_luggage",
"traffic" = "",
"unknown" = "unknown",
"urban weed" = "stowaway",
"waterfowl" = "contaminant_on_animals",
"wool" = "contaminant_on_animals",
"wool alien" = "contaminant_on_animals",
.default = "",
.missing = "" # As result of stripping, records with no pathway already removed by gather()
))
Add prefix cbd_2014_pathway
in case there is a match with the CBD standard:
pathway %<>% mutate(mapped_value = case_when(
cbd_stand != "" ~ paste ("cbd_2014_pathway", cbd_stand, sep = ":"),
TRUE ~ ""
))
Show mapped values:
pathway %>%
select(value, mapped_value) %>%
group_by(value, mapped_value) %>%
summarize(records = n()) %>%
arrange(value)
Drop key
,value
and cbd_stand
column:
pathway %<>% select(-key, -value, -cbd_stand)
Change column name mapped_value
to description
:
pathway %<>% rename(description = mapped_value)
Create a type
field to indicate the type of description:
pathway %<>% mutate (type = "pathway")
Show pathway descriptions:
pathway %>%
select(description) %>%
group_by(description) %>%
summarize(records = n())
Keep only non-empty descriptions:
pathway %<>% filter(!is.na(description) & description != "")
Create separate dataframe:
degree_of_establishment <- input_data
The information for degree of establishment is contained in d_n
:
degree_of_establishment %>%
select(d_n) %>%
group_by_all() %>%
summarize(records = n())
Clean the data:
degree_of_establishment %<>% mutate(description = recode(d_n,
"Ext.?" = "Ext.",
"Cas.?" = "Cas.",
"Nat.?" = "Nat.",
.missing = ""
))
We decided to use the unified framework for biological invasions of Blackburn et al. 2011 for invasion stage
. casual
, naturalized
and invasive
are terms included in this framework. However, we decided to discard the terms naturalized
and invasive
listed in Blackburn et al. (see trias-project/alien-fishes-checklist#6 (comment)). So, naturalized
and invasive
are replaced by established
. For extinct
(introduced taxa that once were naturalized but that have not been confirmed in recent times) and extinct/casual
species (taxa are no longer considered as naturalized but still occur as casuals), we map the most recent invasion stage (i.e. extinct
and casual
respectively):
degree_of_establishment %<>% mutate(description = recode(description,
"Cas." = "casual",
"Inv." = "established",
"Nat." = "established",
"Ext." = "extinct",
"Ext./Cas." = "casual"
))
Show mapped values:
degree_of_establishment %>%
select(d_n, description) %>%
group_by_all() %>%
summarize(records = n())
Create a type
field to indicate the type of description:
degree_of_establishment %<>% mutate(type = "degree of establishment")
Union native range, pathway of introduction and degree of establishment:
description <- bind_rows(native_range, pathway, degree_of_establishment)
Map the data to Taxon Description:
description %<>% mutate(dwc_taxonID = taxon_id)
description %<>% mutate(dwc_description = description)
description %<>% mutate(dwc_type = type)
description %<>% mutate(dwc_language = "en")
Remove the original columns for each of the generated files:
taxon %<>% select(starts_with("dwc_"))
distribution %<>% select(starts_with("dwc_"))
species_profile %<>% select(starts_with("dwc_"))
description %<>% select(starts_with("dwc_"))
Drop the dwc_
prefix:
colnames(taxon) <- str_remove(colnames(taxon), "dwc_")
colnames(distribution) <- str_remove(colnames(distribution), "dwc_")
colnames(species_profile) <- str_remove(colnames(species_profile), "dwc_")
colnames(description) <- str_remove(colnames(description), "dwc_")
Sort on taxonID
to group description information per taxon:
taxon %<>% arrange(taxonID)
distribution %<>% arrange(taxonID)
species_profile %<>% arrange(taxonID)
description %<>% arrange(taxonID)
Preview taxon core:
taxon %>% head(10)
Preview distribution extension:
distribution %>% head(10)
Preview species profile extension:
species_profile %>% head(10)
Preview description extension:
description %>% head(10)
Save to CSV:
write_csv(taxon, here("data", "processed", "taxon.csv"), na = "")
write_csv(distribution, here("data", "processed", "distribution.csv"), na = "")
write_csv(species_profile, here("data", "processed", "speciesprofile.csv"), na = "")
write_csv(description, here("data", "processed", "description.csv"), na = "")