This document describes how we map the checklist data to Darwin Core.
Load libraries:
library(tidyverse)
library(here)
library(janitor)
library(readxl)
library(digest)
The source data is maintained as an Excel file elsewhere. It is manually copied to this repository.
Read source data:
raw_data <- read_excel(path = here("data", "raw", "checklist.xlsx"))
Preview raw data:
glimpse(raw_data)
## Rows: 2,617
## Columns: 18
## $ Id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
## $ Taxon <chr> "Acanthus hungaricus (Borbás) Baen.", "…
## $ `Hybrid formula` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Synonym <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Family <chr> "Acanthaceae", "Acanthaceae", "Acanthac…
## $ `M/I` <chr> "D", "D", "D", "D", "D", "D", "D", "D",…
## $ FR <chr> "2016", "1998", "2016", "2013", "1680",…
## $ MRR <chr> "2018", "2016", "2018", "2018", "N", "A…
## $ Origin <chr> "E", "E AF", "E AF AS-Te", "AF", "AS-Te…
## $ Presence_Fl <chr> "X", "X", "X", "X", "X", "X", "X", "X",…
## $ Presence_Br <chr> "X", NA, NA, NA, "X", NA, NA, NA, NA, "…
## $ Presence_Wa <chr> NA, "X", NA, NA, "X", "X", NA, NA, NA, …
## $ `D/N` <chr> "Cas.", "Cas.", "Cas.", "Cas.", "Nat.",…
## $ `V/I` <chr> "Hort.", "Hort.", "Hort.", "Hort.", "Ho…
## $ taxonRank <chr> "species", "species", "species", "speci…
## $ scientificNameID <chr> NA, "http://ipni.org/urn:lsid:ipni.org:…
## $ Habitat <chr> "Terrestrial", "Terrestrial", "Terrestr…
## $ `Name (excl. author citation)` <chr> "Acanthus hungaricus", "Acanthus mollis…
To link taxa with information in the extension(s), each taxon needs a
unique and relatively stable taxonID. Here we create one in
the form of dataset_shortname:taxon:hash, where
hash is unique code based on scientific name (not
kingdom).
# Vectorize the digest function (The digest() function isn't vectorized. So if you pass in a vector, you get one value for the whole vector rather than a digest for each element of the vector):
vdigest <- Vectorize(digest)
input_data <-
raw_data |>
remove_empty("rows") |>
clean_names() |>
mutate(
taxon_id = paste(
"alien-plants-belgium",
"taxon",
vdigest(taxon, algo = "md5"),
sep = ":"
)
)
Information on the occurrences is given for the regions, while date information is given for Belgium as a whole. Some transformations and clarifications are needed.
Some species have two values for occurrenceStatus
and eventDate, i.e. species with the degree of
naturalisation (d_n) of extinct (Ext.) or
extinct/casual (Ext./Cas.).
Extinct: introduced taxa that once were naturalized (usually rather locally) but that have not been confirmed in recent times in their known localities. Only taxa that are certainly extinct are indicated as such. Extinct/casual: Some of these extinct taxa are no longer considered as naturalized but still occur as casuals; such taxa are indicated as “Ext./Cas.” (for instance Tragopogon porrifolius).
For these species, we include the occurrenceStatus
within the specified time frame (eventDate
= first - most recent observation) and after the last
observation (eventDate = most recent observation - current
date).
The easiest way to do use a stepwise approach:
distributionoccurrenceStatus_ALO (ALO =
after last observation) for extinct and extinct/casual speciesoccurrenceStatus and eventDate from
cleaned presence and date information in distribution (for
eventDate = first - most recent observation)occurrenceStatus and eventDate from
cleaned presence and date information in
occurrence_status_ALO (for eventDate = most
recent observation - current date)The checklist contains minimal presence information
(X,? or NA) for the three regions
in Belgium: Flanders, Wallonia and the Brussels-Capital Region,
contained in presence_fl, presence_wa and
presence_br respectively. Information regarding the
first/last recorded observation applies to the distribution in Belgium
as a whole. Both national and regional information is required in the
checklist. In the distribution.csv, we will first provide
occurrenceStatus and `eventDate`` on a national
level, followed by specific information for the
regions.
For this, we use the following principles:
When a species is present in only one region, we can
assume eventDate relates to that specific region. In this
case, we can keep lines for Belgium and for the specific region
populated with these variables (see #45).
When a species is present in more than one region, it is
impossible to extrapolate the date information for the regions. In this
case, we decided to provide occurrenceStatus for the
regional information, and specify dates only for Belgium.
Thus, we need to specify when a species is present in only one of the regions.
We generate 4 new columns: flanders,
brussels,wallonia and belgium.
The content of these columns refers to the specific presence status of a
species on a regional or national level. S if present in a
single region or in Belgium, ? if presence uncertain,
NA if absent and M if present in multiple
regions.
This should look like this:
Add location columns:
input_data <-
input_data |>
mutate(
flanders = case_when(
presence_fl == "X" &
(is.na(presence_br) | presence_br == "?") &
(is.na(presence_wa) | presence_wa == "?")
~ "S",
presence_fl == "?" ~ "?",
is.na(presence_fl) ~ NA_character_,
TRUE ~ "M"
),
brussels = case_when(
(is.na(presence_fl) | presence_fl == "?") &
presence_br == "X" &
(is.na(presence_wa) | presence_wa == "?")
~ "S",
presence_br == "?" ~ "?",
is.na(presence_br) ~ NA_character_,
TRUE ~ "M"
),
wallonia = case_when(
(is.na(presence_fl) | presence_fl == "?") &
(is.na(presence_br) | presence_br == "?") &
presence_wa == "X"
~ "S",
presence_wa == "?" ~ "?",
is.na(presence_wa) ~ NA_character_,
TRUE ~ "M"
),
belgium = case_when(
presence_fl == "X" | presence_br == "X" | presence_wa == "X" ~ "S", # One is "X"
presence_fl == "?" | presence_br == "?" | presence_wa == "?" ~ "?" # One is "?"
)
)
Summary of the previous action:
input_data |>
select(
presence_fl,
presence_br,
presence_wa,
flanders,
wallonia,
brussels,
belgium) |>
group_by_all() |>
summarize(records = n()) |>
arrange(flanders, wallonia, brussels)
Transform from wide to long (one line per distribution):
input_data <-
input_data |>
pivot_longer(
cols = c(flanders, wallonia, brussels, belgium),
names_to = "location",
values_to = "presence",
values_drop_na = FALSE
) |>
# Remove empty presence, but keep at least Belgium (for records without any distribution)
filter(!is.na(presence) | location == "belgium")
Now we’ll create and clean a start and end year:
current_year = format(Sys.Date(), "%Y")
input_data <-
input_data |>
mutate(
# start_year = first record (fr), end_year = most recent record (mrr)
# Cleaning strips out ?, ca., < and >
start_year = str_replace_all(fr, "(\\?|ca. |<|>)", ""),
end_year = str_replace_all(mrr, "(\\?|ca. |<|>)", "")
) |>
# If end_year = "Ann." or "N", use current year
mutate(
end_year = case_match(
end_year,
"Ann." ~ current_year,
"N" ~ current_year,
.default = end_year
)
)
Check if any start_year fall after end_year
(expected to be none):
input_data |>
select(start_year, end_year) |>
mutate(start_year = as.numeric(start_year)) |>
mutate(end_year = as.numeric(end_year)) |>
group_by(start_year, end_year) |>
summarize(records = n()) |>
filter(start_year > end_year)
Combine start_year and end_year in an
ranged date (ISO 8601 format). If any those two dates is
empty or the same, we use a single year, as a statement when it was seen
once (either as a first record or a most recent record):
input_data <-
input_data |>
mutate(
date = case_when(
start_year == "" & end_year == "" ~ "",
start_year == "" ~ end_year,
end_year == "" ~ start_year,
start_year == end_year ~ start_year,
TRUE ~ paste(start_year, end_year, sep = "/")
)
)
Then, we map occurrence_status using IUCN
definitions and set event_date only when
presence = S:
input_data <-
input_data |>
mutate(
occurrence_status = case_match(
presence,
"S" ~ "present",
"M" ~ "present",
"?" ~ "presence uncertain",
.default = NA_character_
),
event_date = if_else(presence == "S", date, NA_character_)
)
Overview of occurrence_status for each location x
presence combination:
input_data |>
select (location, presence, occurrence_status) |>
group_by_all() |>
summarize(records = n())
Now we create occurrence_status_ALO:
occurrence_status_ALO <-
input_data |>
filter(d_n == "Ext." | d_n == "Ext./Cas.")
Map occurrence_status and event_date for
occurrence_status_ALO:
occurrence_status_ALO <-
occurrence_status_ALO |>
mutate(
occurrence_status = case_when(
d_n == "Ext." ~ "absent",
d_n == "Ext./Cas." ~ "present"
),
event_date = if_else(
presence == "S",
paste(end_year, current_year, sep = "/"),
NA_character_
)
)
Bind input_data with occurrence_status_ALO
by rows:
input_data <- bind_rows(input_data, occurrence_status_ALO)
Pathway information is based on v_i, which contains a
list of introduction pathways (e.g. Agric., wool). We’ll
separate, clean, map and combine these values.
Split v_i on space in 4 columns. In case there are more
than 4 values, these will be merged in native_range_4. The dataset
currently contains no more than 4 values per record.
input_data <-
input_data |>
separate(
v_i,
into = c("pathway_1", "pathway_2", "pathway_3", "pathway_4"),
sep = ",",
remove = FALSE,
convert = FALSE,
extra = "merge",
fill = "right"
)
Transform from wide to long (one line per distribution + pathway):
input_data <-
input_data |>
pivot_longer(
cols = c(pathway_1, pathway_2, pathway_3, pathway_4),
names_to = "pathway_key",
values_to = "pathway",
values_drop_na = FALSE
) |>
# Remove empty pathways, but keep at least first (for records without any pathway)
filter(!is.na(pathway) | pathway_key == "pathway_1")
Map values to http://rs.tdwg.org/dwc/doc/pw/:
input_data <-
input_data |>
arrange(id) |> # To see descriptions in order
mutate(
pathway = str_replace_all(pathway, "\\?|…|\\.{3}", ""), # Strip ?, …, ...
pathway = str_to_lower(pathway), # Convert to lowercase
pathway = str_trim(pathway) # Clean whitespace
) |>
mutate(
type = "pathway",
mapped_pathway = case_match(
pathway,
"agric." ~ "agriculture",
"bird seed" ~ "seedContaminant",
"birdseed" ~ "seedContaminant",
"bulbs" ~ NA_character_, # horticulture?
"coconut mats" ~ "seedContaminant",
"fish" ~ NA_character_,
"food refuse" ~ "liveFoodLiveBait",
"grain" ~ "seedContaminant",
"grain (rice)" ~ "seedContaminant",
"grass seed" ~ "seedContaminant",
"hay" ~ NA_character_,
"hort" ~ "horticulture",
"hort." ~ "horticulture",
"hybridization" ~ NA_character_,
"military troops" ~ NA_character_,
"nurseries" ~ "contaminantNursery",
"ore" ~ "transportationHabitatMaterial",
"pines" ~ "contaminantOnPlants", # timberTrade?
"rice" ~ NA_character_,
"salt" ~ NA_character_,
"seeds" ~ "seedContaminant",
"timber" ~ "timberTrade",
"tourists" ~ "containerBulk",
"traffic" ~ NA_character_,
"unknown" ~ NA_character_,
"urban weed" ~ "stowaway", # NA_character_ ?
"waterfowl" ~ "contaminantOnAnimals",
"wool" ~ "contaminantOnAnimals",
"wool alien" ~ "contaminantOnAnimals",
.default = NA_character_
)
)
Show mapped values:
input_data |>
select(pathway, mapped_pathway) |>
group_by(pathway, mapped_pathway) |>
summarize(records = n()) |>
arrange(pathway)
The information for degree of establishment is contained in
d_n.
Map values to http://rs.tdwg.org/dwc/doc/doe/:
input_data <-
input_data |>
arrange(id) |> # To see descriptions in order
mutate(
degree_of_establishment = case_match(
d_n,
"Ext." ~ NA_character_,
"Ext.?" ~ NA_character_,
"Ext./Cas." ~ "casual",
"Cas." ~ "casual",
"Cas.?" ~ "casual",
"Inv." ~ "invasive",
"Nat." ~ "established",
"Nat.?" ~ "established",
.default = NA_character_
)
)
Show mapped values:
input_data |>
select(d_n, degree_of_establishment) |>
group_by(d_n, degree_of_establishment) |>
summarize(records = n()) |>
arrange(d_n)
Map the data to Darwin Core Taxon:
taxon <-
input_data |>
distinct(taxon_id, .keep_all = TRUE) |> # Only keep one row per taxon
mutate(
.keep = "none",
language = "en",
license = "http://creativecommons.org/publicdomain/zero/1.0/",
rightsHolder = "Meise Botanic Garden",
datasetID = "https://doi.org/10.15468/wtda1m",
datasetName = "Manual of the Alien Plants of Belgium",
taxonID = taxon_id,
scientificNameID = scientific_name_id,
scientificName = taxon,
kingdom = "Plantae",
family = family,
taxonRank = taxon_rank,
nomenclaturalCode = "ICN"
) |>
relocate(family, .after = kingdom) |>
arrange(taxonID)
Map the data to Species Distribution:
distribution <-
input_data |>
filter(!is.na(presence)) |> # Only keep taxa with a distribution
mutate(
.keep = "none",
taxonID = taxon_id,
locationID = case_match(
location,
"belgium" ~ "ISO_3166-2:BE",
"flanders" ~ "ISO_3166-2:BE-VLG",
"wallonia" ~ "ISO_3166-2:BE-WAL",
"brussels" ~ "ISO_3166-2:BE-BRU"
),
locality = case_match(
location,
"belgium" ~ "Belgium",
"flanders" ~ "Flemish Region",
"wallonia" ~ "Walloon Region",
"brussels" ~ "Brussels-Capital Region"
),
countryCode = "BE",
occurrenceStatus = occurrence_status,
establishmentMeans = "introduced",
degreeOfEstablishment = degree_of_establishment,
pathway = mapped_pathway,
eventDate = event_date
) |>
relocate(pathway, .after = degreeOfEstablishment) |>
arrange(taxonID)
In this extension will express broad habitat characteristics
(e.g. isTerrestrial) of the species. Habitat information
can be found in habitat, which describes whether a species
is found in freshwater, terrestrial or both (terrestrial/freshwater)
habitats.
Map the data to Species Profile:
species_profile <-
input_data |>
distinct(taxon_id, .keep_all = TRUE) |> # Only keep one row per taxon
mutate(habitat = str_to_lower(habitat)) |>
mutate(
.keep = "none",
taxonID = taxon_id,
isMarine = "FALSE",
isFreshwater = case_when(
habitat == "fresh water" | habitat == "terrestrial/fresh water" ~ "TRUE",
TRUE ~ "FALSE"
),
isTerrestrial = case_when(
habitat == "terrestrial" | habitat == "terrestrial/fresh water" ~ "TRUE",
TRUE ~ "FALSE"
)
) |>
arrange(taxonID)
In the description extension we include the native range information.
This information is included in origin
(e.g. E AS-Te NAM). We’ll separate, clean, map and combine
these values.
Split origin on space in 4 columns. In case there are
more than 4 values, these will be merged in native_range_4. The dataset
currently contains no more than 4 values per record.
native_range <-
input_data |>
distinct(taxon_id, .keep_all = TRUE) |> # Only keep one row per taxon
separate(
origin,
into = c("native_range_1", "native_range_2", "native_range_3", "native_range_4"),
sep = " ",
remove = FALSE,
convert = FALSE,
extra = "merge",
fill = "right"
)
Transform from wide to long (one line per native range):
native_range <-
native_range |>
pivot_longer(
cols = c(native_range_1, native_range_2, native_range_3, native_range_4),
names_to = "native_range_key",
values_to = "native_range",
values_drop_na = TRUE # Also removes records for which there is no native_range_1
)
Map values:
native_range <-
native_range |>
arrange(id) |> # To see descriptions in order
mutate(
native_range = str_replace_all(native_range, "\\?", ""), # Strip ?
native_range = str_trim(native_range) # Clean whitespace
) |>
mutate(
mapped_native_range = case_match(
native_range,
"AF" ~ "Africa (WGSRPD:2)",
"AM" ~ "pan-American",
"AS" ~ "Asia",
"AS-Te" ~ "temperate Asia (WGSRPD:3)",
"AS-Tr" ~ "tropical Asia (WGSRPD:4)",
"AUS" ~ "Australasia (WGSRPD:5)",
"Cult." ~ "cultivated origin",
"E" ~ "Europe (WGSRPD:1)",
"Hybr." ~ "hybrid origin",
"NAM" ~ "Northern America (WGSRPD:7)",
"SAM" ~ "Southern America (WGSRPD:8)",
"Trop." ~ "Pantropical",
.default = NA_character_
)
) |>
filter(!is.na(mapped_native_range))
Show mapped values:
native_range |>
select(native_range, mapped_native_range) |>
group_by(native_range, mapped_native_range) |>
summarize(records = n()) |>
arrange(native_range)
Map the data to Taxon Description:
description <-
native_range |>
mutate(
.keep = "none",
taxonID = taxon_id,
description = mapped_native_range,
type = "native range",
language = "en"
) |>
relocate(type, .after = description) |>
arrange(taxonID)
write_csv(taxon, here("data", "processed", "taxon.csv"), na = "")
write_csv(distribution, here("data", "processed", "distribution.csv"), na = "")
write_csv(species_profile, here("data", "processed", "speciesprofile.csv"), na = "")
write_csv(description, here("data", "processed", "description.csv"), na = "")