This document describes how we map the checklist data to Darwin Core.

Load libraries:

library(tidyverse)
library(here)
library(janitor)
library(readxl)
library(digest)

1 Read source data

The source data is maintained as an Excel file elsewhere. It is manually copied to this repository.

Read source data:

raw_data <- read_excel(path = here("data", "raw", "checklist.xlsx")) 

Preview raw data:

glimpse(raw_data)
## Rows: 2,617
## Columns: 18
## $ Id                             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
## $ Taxon                          <chr> "Acanthus hungaricus (Borbás) Baen.", "…
## $ `Hybrid formula`               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Synonym                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Family                         <chr> "Acanthaceae", "Acanthaceae", "Acanthac…
## $ `M/I`                          <chr> "D", "D", "D", "D", "D", "D", "D", "D",…
## $ FR                             <chr> "2016", "1998", "2016", "2013", "1680",…
## $ MRR                            <chr> "2018", "2016", "2018", "2018", "N", "A…
## $ Origin                         <chr> "E", "E AF", "E AF AS-Te", "AF", "AS-Te…
## $ Presence_Fl                    <chr> "X", "X", "X", "X", "X", "X", "X", "X",…
## $ Presence_Br                    <chr> "X", NA, NA, NA, "X", NA, NA, NA, NA, "…
## $ Presence_Wa                    <chr> NA, "X", NA, NA, "X", "X", NA, NA, NA, …
## $ `D/N`                          <chr> "Cas.", "Cas.", "Cas.", "Cas.", "Nat.",…
## $ `V/I`                          <chr> "Hort.", "Hort.", "Hort.", "Hort.", "Ho…
## $ taxonRank                      <chr> "species", "species", "species", "speci…
## $ scientificNameID               <chr> NA, "http://ipni.org/urn:lsid:ipni.org:…
## $ Habitat                        <chr> "Terrestrial", "Terrestrial", "Terrestr…
## $ `Name (excl. author citation)` <chr> "Acanthus hungaricus", "Acanthus mollis…

2 Preprocessing

2.1 Add Taxon IDs

To link taxa with information in the extension(s), each taxon needs a unique and relatively stable taxonID. Here we create one in the form of dataset_shortname:taxon:hash, where hash is unique code based on scientific name (not kingdom).

# Vectorize the digest function (The digest() function isn't vectorized. So if you pass in a vector, you get one value for the whole vector rather than a digest for each element of the vector):
vdigest <- Vectorize(digest)

input_data <-
  raw_data |>
  remove_empty("rows") |>
  clean_names() |>
  mutate(
    taxon_id = paste(
      "alien-plants-belgium",
      "taxon",
      vdigest(taxon, algo = "md5"),
      sep = ":"
    )
  )

2.2 Process distribution information (adds rows)

  1. Information on the occurrences is given for the regions, while date information is given for Belgium as a whole. Some transformations and clarifications are needed.

  2. Some species have two values for occurrenceStatus and eventDate, i.e. species with the degree of naturalisation (d_n) of extinct (Ext.) or extinct/casual (Ext./Cas.).

Extinct: introduced taxa that once were naturalized (usually rather locally) but that have not been confirmed in recent times in their known localities. Only taxa that are certainly extinct are indicated as such. Extinct/casual: Some of these extinct taxa are no longer considered as naturalized but still occur as casuals; such taxa are indicated as “Ext./Cas.” (for instance Tragopogon porrifolius).

For these species, we include the occurrenceStatus within the specified time frame (eventDate = first - most recent observation) and after the last observation (eventDate = most recent observation - current date).

The easiest way to do use a stepwise approach:

  1. Clean presence information and date information in distribution
  2. Create a separate dataframe occurrenceStatus_ALO (ALO = after last observation) for extinct and extinct/casual species
  3. Map occurrenceStatus and eventDate from cleaned presence and date information in distribution (for eventDate = first - most recent observation)
  4. Map occurrenceStatus and eventDate from cleaned presence and date information in occurrence_status_ALO (for eventDate = most recent observation - current date)
  5. Bind both dataframes by row.
  6. Map the other Darwin Core terms in the distribution extension

The checklist contains minimal presence information (X,? or NA) for the three regions in Belgium: Flanders, Wallonia and the Brussels-Capital Region, contained in presence_fl, presence_wa and presence_br respectively. Information regarding the first/last recorded observation applies to the distribution in Belgium as a whole. Both national and regional information is required in the checklist. In the distribution.csv, we will first provide occurrenceStatus and `eventDate`` on a national level, followed by specific information for the regions.

For this, we use the following principles:

  1. When a species is present in only one region, we can assume eventDate relates to that specific region. In this case, we can keep lines for Belgium and for the specific region populated with these variables (see #45).

  2. When a species is present in more than one region, it is impossible to extrapolate the date information for the regions. In this case, we decided to provide occurrenceStatus for the regional information, and specify dates only for Belgium.

Thus, we need to specify when a species is present in only one of the regions.

We generate 4 new columns: flanders, brussels,wallonia and belgium. The content of these columns refers to the specific presence status of a species on a regional or national level. S if present in a single region or in Belgium, ? if presence uncertain, NA if absent and M if present in multiple regions.

This should look like this:

Add location columns:

input_data <-
  input_data |>
  mutate(
    flanders = case_when(
      presence_fl == "X" &
      (is.na(presence_br) | presence_br == "?") &
      (is.na(presence_wa) | presence_wa == "?")
        ~ "S",
      presence_fl == "?" ~ "?",
      is.na(presence_fl) ~ NA_character_,
      TRUE ~ "M"
    ),
    brussels = case_when(
      (is.na(presence_fl) | presence_fl == "?") &
      presence_br == "X" &
      (is.na(presence_wa) | presence_wa == "?")
        ~ "S",
      presence_br == "?" ~ "?",
      is.na(presence_br) ~ NA_character_,
      TRUE ~ "M"
    ),
    wallonia = case_when(
      (is.na(presence_fl) | presence_fl == "?") &
      (is.na(presence_br) | presence_br == "?") &
      presence_wa == "X"
        ~ "S",
      presence_wa == "?" ~ "?",
      is.na(presence_wa) ~ NA_character_,
      TRUE ~ "M"
    ),
    belgium = case_when(
      presence_fl == "X" | presence_br == "X" | presence_wa == "X" ~ "S", # One is "X"
      presence_fl == "?" | presence_br == "?" | presence_wa == "?" ~ "?" # One is "?"
    )
  )

Summary of the previous action:

input_data |>
  select(
    presence_fl,
    presence_br,
    presence_wa,
    flanders,
    wallonia,
    brussels,
    belgium) |>
  group_by_all() |>
  summarize(records = n()) |>
  arrange(flanders, wallonia, brussels)

Transform from wide to long (one line per distribution):

input_data <-
  input_data |>
  pivot_longer(
    cols = c(flanders, wallonia, brussels, belgium),
    names_to = "location",
    values_to = "presence",
    values_drop_na = FALSE
  ) |>
  # Remove empty presence, but keep at least Belgium (for records without any distribution)
  filter(!is.na(presence) | location == "belgium")

Now we’ll create and clean a start and end year:

current_year = format(Sys.Date(), "%Y")

input_data <-
  input_data |>
  mutate(
    # start_year = first record (fr), end_year = most recent record (mrr)
    # Cleaning strips out ?, ca., < and >
    start_year = str_replace_all(fr, "(\\?|ca. |<|>)", ""),
    end_year = str_replace_all(mrr, "(\\?|ca. |<|>)", "")
  ) |>
  # If end_year = "Ann." or "N", use current year
  mutate(
    end_year = case_match(
      end_year,
      "Ann." ~ current_year,
      "N" ~ current_year,
      .default = end_year
    )
  )

Check if any start_year fall after end_year (expected to be none):

input_data |>
  select(start_year, end_year) |>
  mutate(start_year = as.numeric(start_year)) |>
  mutate(end_year = as.numeric(end_year)) |>
  group_by(start_year, end_year) |>
  summarize(records = n()) |>
  filter(start_year > end_year) 

Combine start_year and end_year in an ranged date (ISO 8601 format). If any those two dates is empty or the same, we use a single year, as a statement when it was seen once (either as a first record or a most recent record):

input_data <-
  input_data |>
  mutate(
    date = case_when(
      start_year == "" & end_year == "" ~ "",
      start_year == "" ~ end_year,
      end_year == "" ~ start_year,
      start_year == end_year ~ start_year,
      TRUE ~ paste(start_year, end_year, sep = "/")
    )
  )

Then, we map occurrence_status using IUCN definitions and set event_date only when presence = S:

input_data <-
  input_data |>
  mutate(
    occurrence_status = case_match(
      presence,
      "S" ~ "present",
      "M" ~ "present",
      "?" ~ "presence uncertain",
      .default = NA_character_
    ),
    event_date = if_else(presence == "S", date, NA_character_)
  )

Overview of occurrence_status for each location x presence combination:

input_data |>
  select (location, presence, occurrence_status) |>
  group_by_all() |>
  summarize(records = n()) 

Now we create occurrence_status_ALO:

occurrence_status_ALO <-
  input_data |>
  filter(d_n == "Ext." | d_n == "Ext./Cas.")

Map occurrence_status and event_date for occurrence_status_ALO:

occurrence_status_ALO <-
  occurrence_status_ALO |>
  mutate(
    occurrence_status = case_when(
      d_n == "Ext." ~ "absent",
      d_n == "Ext./Cas." ~ "present"
    ),
    event_date = if_else(
      presence == "S",
      paste(end_year, current_year, sep = "/"),
      NA_character_
    )
  )

Bind input_data with occurrence_status_ALO by rows:

input_data <- bind_rows(input_data, occurrence_status_ALO)

2.3 Process pathways (adds rows)

Pathway information is based on v_i, which contains a list of introduction pathways (e.g. Agric., wool). We’ll separate, clean, map and combine these values.

Split v_i on space in 4 columns. In case there are more than 4 values, these will be merged in native_range_4. The dataset currently contains no more than 4 values per record.

input_data <-
  input_data |>
  separate(
    v_i,
    into = c("pathway_1", "pathway_2", "pathway_3", "pathway_4"),
    sep = ",",
    remove = FALSE,
    convert = FALSE,
    extra = "merge",
    fill = "right"
  )

Transform from wide to long (one line per distribution + pathway):

input_data <-
  input_data |>
  pivot_longer(
    cols = c(pathway_1, pathway_2, pathway_3, pathway_4),
    names_to = "pathway_key",
    values_to = "pathway",
    values_drop_na = FALSE
  ) |>
  # Remove empty pathways, but keep at least first (for records without any pathway)
  filter(!is.na(pathway) | pathway_key == "pathway_1")

Map values to http://rs.tdwg.org/dwc/doc/pw/:

input_data <-
  input_data |>
  arrange(id) |> # To see descriptions in order
  mutate(
    pathway = str_replace_all(pathway, "\\?|…|\\.{3}", ""), # Strip ?, …, ...
    pathway = str_to_lower(pathway), # Convert to lowercase
    pathway = str_trim(pathway) # Clean whitespace
  ) |>
  mutate(
    type = "pathway",
    mapped_pathway = case_match(
      pathway,
      "agric." ~ "agriculture",
      "bird seed" ~ "seedContaminant",
      "birdseed" ~ "seedContaminant",
      "bulbs" ~ NA_character_, # horticulture?
      "coconut mats" ~ "seedContaminant",
      "fish" ~ NA_character_,
      "food refuse" ~ "liveFoodLiveBait",
      "grain" ~ "seedContaminant",
      "grain (rice)" ~ "seedContaminant",
      "grass seed" ~ "seedContaminant",
      "hay" ~ NA_character_,
      "hort" ~ "horticulture",
      "hort." ~ "horticulture",
      "hybridization" ~ NA_character_,
      "military troops" ~ NA_character_,
      "nurseries" ~ "contaminantNursery",
      "ore" ~ "transportationHabitatMaterial",
      "pines" ~ "contaminantOnPlants", # timberTrade?
      "rice" ~ NA_character_,
      "salt" ~ NA_character_,
      "seeds" ~ "seedContaminant",
      "timber" ~ "timberTrade",
      "tourists" ~ "containerBulk",
      "traffic" ~ NA_character_,
      "unknown" ~ NA_character_,
      "urban weed" ~ "stowaway", # NA_character_ ?
      "waterfowl" ~ "contaminantOnAnimals",
      "wool" ~ "contaminantOnAnimals",
      "wool alien" ~ "contaminantOnAnimals",
      .default = NA_character_
    )
  )

Show mapped values:

input_data |>
  select(pathway, mapped_pathway) |>
  group_by(pathway, mapped_pathway) |>
  summarize(records = n()) |>
  arrange(pathway)

2.4 Process degree of establishment

The information for degree of establishment is contained in d_n.

Map values to http://rs.tdwg.org/dwc/doc/doe/:

input_data <-
  input_data |>
  arrange(id) |> # To see descriptions in order
  mutate(
    degree_of_establishment = case_match(
      d_n,
      "Ext." ~ NA_character_,
      "Ext.?" ~ NA_character_,
      "Ext./Cas." ~ "casual",
      "Cas." ~ "casual",
      "Cas.?" ~ "casual",
      "Inv." ~ "invasive",
      "Nat." ~ "established",
      "Nat.?" ~ "established",
      .default = NA_character_
    )
  )

Show mapped values:

input_data |>
  select(d_n, degree_of_establishment) |>
  group_by(d_n, degree_of_establishment) |>
  summarize(records = n()) |>
  arrange(d_n)

3 Darwin Core mapping

3.1 Taxon core

Map the data to Darwin Core Taxon:

taxon <-
  input_data |>
  distinct(taxon_id, .keep_all = TRUE) |> # Only keep one row per taxon
  mutate(
    .keep = "none",
    language = "en",
    license = "http://creativecommons.org/publicdomain/zero/1.0/",
    rightsHolder = "Meise Botanic Garden",
    datasetID = "https://doi.org/10.15468/wtda1m",
    datasetName = "Manual of the Alien Plants of Belgium",
    taxonID = taxon_id,
    scientificNameID = scientific_name_id,
    scientificName = taxon,
    kingdom = "Plantae",
    family = family,
    taxonRank = taxon_rank,
    nomenclaturalCode = "ICN"
  ) |>
  relocate(family, .after = kingdom) |>
  arrange(taxonID)

3.2 Distribution extension

Map the data to Species Distribution:

distribution <-
  input_data |>
  filter(!is.na(presence)) |> # Only keep taxa with a distribution
  mutate(
    .keep = "none",
    taxonID = taxon_id,
    locationID = case_match(
      location,
      "belgium" ~ "ISO_3166-2:BE",
      "flanders" ~ "ISO_3166-2:BE-VLG",
      "wallonia" ~ "ISO_3166-2:BE-WAL",
      "brussels" ~ "ISO_3166-2:BE-BRU"
    ),
    locality = case_match(
      location,
      "belgium" ~ "Belgium",
      "flanders" ~ "Flemish Region",
      "wallonia" ~ "Walloon Region",
      "brussels" ~ "Brussels-Capital Region"
    ),
    countryCode = "BE",
    occurrenceStatus = occurrence_status,
    establishmentMeans = "introduced",
    degreeOfEstablishment = degree_of_establishment,
    pathway = mapped_pathway,
    eventDate = event_date
  ) |>
  relocate(pathway, .after = degreeOfEstablishment) |>
  arrange(taxonID)

3.3 Species profile extension

In this extension will express broad habitat characteristics (e.g. isTerrestrial) of the species. Habitat information can be found in habitat, which describes whether a species is found in freshwater, terrestrial or both (terrestrial/freshwater) habitats.

Map the data to Species Profile:

species_profile <-
  input_data |>
  distinct(taxon_id, .keep_all = TRUE) |> # Only keep one row per taxon
  mutate(habitat = str_to_lower(habitat)) |>
  mutate(
    .keep = "none",
    taxonID = taxon_id,
    isMarine = "FALSE",
    isFreshwater = case_when(
      habitat == "fresh water" | habitat == "terrestrial/fresh water" ~ "TRUE",
      TRUE ~ "FALSE"
    ),
    isTerrestrial = case_when(
      habitat == "terrestrial" | habitat == "terrestrial/fresh water" ~ "TRUE",
      TRUE ~ "FALSE"
    )
  ) |>
  arrange(taxonID)

3.4 Description extension

In the description extension we include the native range information. This information is included in origin (e.g. E AS-Te NAM). We’ll separate, clean, map and combine these values.

Split origin on space in 4 columns. In case there are more than 4 values, these will be merged in native_range_4. The dataset currently contains no more than 4 values per record.

native_range <-
  input_data |>
  distinct(taxon_id, .keep_all = TRUE) |> # Only keep one row per taxon
  separate(
    origin,
    into = c("native_range_1", "native_range_2", "native_range_3", "native_range_4"),
    sep = " ",
    remove = FALSE,
    convert = FALSE,
    extra = "merge",
    fill = "right"
  )

Transform from wide to long (one line per native range):

native_range <-
  native_range |>
  pivot_longer(
    cols = c(native_range_1, native_range_2, native_range_3, native_range_4),
    names_to = "native_range_key",
    values_to = "native_range",
    values_drop_na = TRUE # Also removes records for which there is no native_range_1
  )

Map values:

native_range <-
  native_range |>
  arrange(id) |> # To see descriptions in order
  mutate(
    native_range = str_replace_all(native_range, "\\?", ""), # Strip ?
    native_range = str_trim(native_range) # Clean whitespace
  ) |>
  mutate(
    mapped_native_range = case_match(
      native_range,
      "AF" ~ "Africa (WGSRPD:2)",
      "AM" ~ "pan-American",
      "AS" ~ "Asia",
      "AS-Te" ~ "temperate Asia (WGSRPD:3)",
      "AS-Tr" ~ "tropical Asia (WGSRPD:4)",
      "AUS" ~ "Australasia (WGSRPD:5)",
      "Cult." ~ "cultivated origin",
      "E" ~ "Europe (WGSRPD:1)",
      "Hybr." ~ "hybrid origin",
      "NAM" ~ "Northern America (WGSRPD:7)",
      "SAM" ~ "Southern America (WGSRPD:8)",
      "Trop." ~ "Pantropical",
      .default = NA_character_
    )
  ) |>
  filter(!is.na(mapped_native_range))

Show mapped values:

native_range |>
  select(native_range, mapped_native_range) |>
  group_by(native_range, mapped_native_range) |>
  summarize(records = n()) |>
  arrange(native_range)

Map the data to Taxon Description:

description <-
  native_range |>
  mutate(
    .keep = "none",
    taxonID = taxon_id,
    description = mapped_native_range,
    type = "native range",
    language = "en"
  ) |>
  relocate(type, .after = description) |>
  arrange(taxonID)

4 Save to CSV

write_csv(taxon, here("data", "processed", "taxon.csv"), na = "")
write_csv(distribution, here("data", "processed", "distribution.csv"), na = "")
write_csv(species_profile, here("data", "processed", "speciesprofile.csv"), na = "")
write_csv(description, here("data", "processed", "description.csv"), na = "")