This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.

Load libraries:

library(tidyverse)      # To do data science
library(magrittr)       # To use %<>% pipes
library(here)           # To find files
library(janitor)        # To clean input data
library(readxl)         # To read Excel files
library(digest)         # To generate hashes
library(rgbif)          # To use GBIF services

Set file paths (all paths should be relative to this script):

# Raw files:
raw_data_file = "../data/raw/transcribed_10530_2016_1278_MOESM2_ESM.xlsx"
references = "../data/raw/transcribed_references.xlsx"

# Processed files:
dwc_taxon_file = "../data/processed/taxon.csv"
dwc_literature_references_file = "../data/processed/references.csv"
dwc_distribution_file = "../data/processed/distribution.csv"
dwc_profile_file = "../data/processed/speciesprofile.csv"
dwc_description_file = "../data/processed/description.csv"

1 Read and pre-process raw data

Create a data frame raw_data from the source data:

raw_data <- read_excel(path = raw_data_file) 

Clean the data somewhat: remove empty rows if present and clean names:

raw_data %<>% 
  remove_empty("rows") %<>%
  clean_names()

1.1 Generate taxonID

To uniquely identify a taxon in the taxon core and reference taxa in the extensions, we need a taxonID. Since we need it in all generated files, we generate it here in the raw data frame. It is a combination of dataset-shortname:taxon: and a hash based on the scientific name. As long as the scientific name doesn’t change, the ID will be stable. Some of the names require some minor cleaning first though:

raw_data %<>% mutate(species_clean = case_when(
  species == "Crassostrea rhizophoraeGuilding 1828" ~ "Crassostrea rhizophorae Guilding 1828",
  species == "Petricola pholadiformisLamarck, 1818" ~ "Petricola pholadiformis Lamarck, 1818",
  species == "Theba pisanaMüller, 1774)" ~ "Theba pisana (Müller, 1774)",
  species == "Maylandia sp. M. K. Meyer & W. Förster, 1984" ~ "Maylandia M. K. Meyer & W. Förster, 1984",
  TRUE ~ species)) 

Some scientific names don’t have spaces before the (. We add them here:

raw_data %<>% mutate(species_clean = str_replace_all(species_clean, "\\(", " ("), 
                     species_clean = str_replace_all(species_clean, "  ", " "))

Generate taxonID:

# Vectorize the digest function (the digest() function isn't vectorized. So if you pass in a vector, you get one value for the whole vector rather than a digest for each element of the vector):
vdigest <- Vectorize(digest)

# Generate taxonID:
raw_data %<>% mutate(taxonID = paste("rinse-pathways-checklist", "taxon", vdigest (species_clean, algo = "md5"), sep = ":"))

1.2 Further pre-processing

Add prefix raw_ to all column names to avoid name clashes with Darwin Core terms:

colnames(raw_data) <- paste0("raw_", colnames(raw_data))

Preview data:

raw_data %>% head()

2 Create taxon core

taxon <- raw_data

2.1 Term mapping

Map the data to Darwin Core Taxon.

2.1.1 language

taxon %<>% mutate(language = "en")

2.1.2 license

taxon %<>% mutate(license = "http://creativecommons.org/publicdomain/zero/1.0/") 

2.1.3 rightsHolder

taxon %<>% mutate(rightsHolder = "University of Cambridge")

2.1.4 accessRights

taxon %<>% mutate(accessRights = "http://www.inbo.be/en/norms-for-data-use") 

2.1.5 datasetID

taxon %<>% mutate(datasetID = "https://doi.org/10.15468/guejza")

2.1.6 institutionCode

taxon %<>% mutate(institutionCode = "University of Cambridge") 

2.1.7 datasetName

taxon %<>% mutate(datasetName = "RINSE - Pathways and vectors of biological invasions in Northwest Europe")

2.1.8 taxonID

taxon %<>% mutate(taxonID = raw_taxonID)

2.1.9 scientificName

taxon %<>% mutate(scientificName = raw_species_clean)

2.1.10 kingdom

Information for the higher classification is contained in raw_higher_classification:

taxon %>% distinct(raw_higher_classification)

These map to different taxonomical levels (e.g. Mammalia = class while Osteichthyes = order). We interpret this information in the mapping of kingdom, phylum, class and order.

taxon %<>% mutate(kingdom = case_when(
  raw_higher_classification == "Angiospermae" ~ "Plantae" ,
  TRUE ~ "Animalia")) 

2.1.11 phylum

taxon %<>% mutate(phylum = case_when(
  raw_higher_classification == "Angiospermae" ~ "Magnoliophyta" ,
  raw_higher_classification == "Mollusca" ~ "Mollusca",
  TRUE ~ "Chordata")) 

2.1.12 class

Osteichthyes, which is a superclass, includes Actinopterygii and Sarcopterygii (https://en.wikipedia.org/wiki/Osteichthyes). All the taxa in this list are Actinopterygii which is a class.

taxon %<>% mutate(class = case_when(
  raw_higher_classification == "Anseriformes" ~ "Aves",
  raw_higher_classification == "Mammalia" ~ "Mammalia",
  raw_higher_classification == "Osteichthyes" ~ "Actinopterygii", 
  TRUE ~ "")) 

2.1.13 order

taxon %<>% mutate(order = case_when(
  raw_higher_classification == "Anseriformes" ~ "Anseriformes",
  TRUE ~ "" )) 

2.1.14 taxonRank

The taxon rank can be provided by the nameparser() function of rgbif. First we create a data frame with the parsed names:

taxon_ranks <- parsenames(taxon$raw_species_clean)

And then select rank information from rankmarker. The field notho contains valuable information for the generic hybrids, so this column is selected as well.

taxon_ranks %<>% select(scientificname, rankmarker, notho) 

Merge taxon_ranks with taxon using scientific names:

taxon %<>% left_join(taxon_ranks, by = c("raw_species_clean" = "scientificname")) 

Show unique values for rankmarker and notho:

taxon %>% 
  select(rankmarker, notho) %>% 
  group_by_all() %>% 
  summarise(records = n())

Show scientificNames for which rankmarker = NA:

taxon %>% 
  select(scientificName, rankmarker) %>% 
  filter(is.na(rankmarker))

Maylandia M. K. Meyer & W. Förster, 1984 is a genus, the other two records are species.

Map taxonRank:

taxon %<>% mutate(taxonRank = case_when(
  !is.na(notho) ~ "hybrid",
  rankmarker == "sp." & is.na(notho) ~ "species",
  rankmarker == "infrasp." ~ "subspecies",
  is.na(rankmarker) & scientificName != "Maylandia M. K. Meyer & W. Förster, 1984" ~ "species",
  scientificName == "Maylandia M. K. Meyer & W. Förster, 1984" ~ "genus")) 

2.2 Post-processing

Remove the original columns:

taxon %<>% select(-starts_with("raw_"), -rankmarker, - notho)

Sort on taxonID:

taxon %<>% arrange(taxonID)

Preview data:

taxon %>% head()

Save to CSV:

write_csv(taxon, dwc_taxon_file, na = "")

3 Create literature reference extension

literature_references <- raw_data

3.1 Pre-processing

Information for the literature references is contained in raw_references, which are (sequences of) numbers in this case. The link between these numbers and the full references is provided in the original Supplementary material file (page 28-30). We copy-pasted this information to a newly generated Excel file references, here saved as the dataframe transcribed_references:

transcribed_references <- read_excel(path = references, col_types = "text")

3.1.1 Full references (on per row)

The literature reference extension is a dataframe in which one row contains one single reference (a short identifier and the full reference) per taxon. We generate this file in several steps:

  1. Provide all reference numbers in raw_references, instead of using the format containing - for a range of numbers between a delimited minimun and maximum (e.g. 3-7 is transformed to 3,4,5,6,7). To accomplish this, we first create the function hyphen_to_sequence():
hyphen_to_sequence <- function(input){
  seq_locs <- gregexpr("\\d+-\\d+", input)
  matched <- regmatches(input, seq_locs)[[1]]
  
  split_support <- function(text) {
    split_seq <- strsplit(text, "-")[[1]]
    paste(as.character(seq(split_seq[1], split_seq[2])), collapse = ",")
  }
  
  regmatches(input, seq_locs, invert = FALSE) <- list(map_chr(matched, split_support))
  input
}

Apply function:

literature_references %<>% 
  rowwise() %<>%
  mutate(raw_references_clean = hyphen_to_sequence(raw_references)) 
  1. Transform literature_references from a wide to a long dataset, with each row containing one reference number per taxon:
literature_references %<>%
  mutate(references_sep = raw_references_clean) %<>% 
  ungroup() %<>%
  separate_rows(references_sep, sep = ",")
  1. Replace the reference numbers provided in raw_references with the full references provided in the dataframe transcribed_references using the field number as a link.
literature_references %<>% left_join(transcribed_references, by = c("references_sep" = "number"))

This dataframe constitutes the basis for further mapping of the literature reference extension:

literature_references %>%
  select(raw_species, identifier, full_reference) %>%
  head()

3.1.2 Source references (sequence of identifiers)

For the mapping of the field source in the other extensions, we need to translate the sequence of reference numbers given for each taxon in raw_references into a sequence of all corresponding identifiers separated by |. As we need to do this several times, we generate a new dataframe identifier_sequence here, linking the sequence of numbers in raw_references with the sequence of identifiers.

Generate dataframe identifier_sequence with the required columns (add taxonID as this is necessary for spreading in the next step):

identifier_sequence <- literature_references %>% select(raw_taxonID, raw_references, references_sep, identifier)

To generate the required sequence of identifiers, we need to transform identifier_sequence from a dataframe containing one identifier per taxon per row, to one containing a sequence of identifiers per taxon per row. We do this by using the spread() function first, generating 33 identifier columns (corresponding to the 33 identifiers):

identifier_sequence %<>% mutate(references_sep = as.numeric(references_sep)) %<>% 
  spread(references_sep, identifier) %<>%
  select(-`<NA>`)

In the following step, we unite these 33 columns into one field id_sequence, using | as a separator. To unite the identifier columns (1-33) in a single step, we first rename the column names by adding the prefix ref_:

old_names <- c("raw_taxonID", "raw_references", c(1:33))
new_names <- c("raw_taxonID", "raw_references", paste0("ref_", c(1:33)))
identifier_sequence %<>% rename_at(vars(old_names), ~ new_names)

Generate id_sequence:

identifier_sequence %<>% unite("id_sequence", c(paste0("ref_", c(1:33))), sep = " | ")

Preview:

identifier_sequence %>% head()

Remove NA’s from the generated sequences:

identifier_sequence %<>% mutate(id_sequence = str_replace_all(id_sequence, "(([NA]{2}\\s[|]) | [|]\\s([NA]{2}))", ""))

Remove raw_taxonID:

identifier_sequence %<>% select(-raw_taxonID) 

Remove duplicate rows (some taxa have the same reference sequence):

identifier_sequence %<>% distinct(raw_references, .keep_all = TRUE)

We will use this dataframe to link the sequence of references numbers (raw_references) with the sequence of identifiers (id_sequence) in the mapping of the following extensions.

3.2 Term mapping

Map the data to Literature References.

3.2.1 taxonID

literature_references %<>% mutate(taxonID = raw_taxonID)

3.2.2 identifier

literature_references %<>% select(-identifier, everything()) 

3.2.3 bibliographicCitation

literature_references %<>% mutate(bibliographicCitation = full_reference) 

3.3 Post-processing

Remove the original columns:

literature_references %<>% select(-starts_with("raw_"), -references_sep, -full_reference)

Sort on taxonID:

literature_references %<>% arrange(taxonID) 

Preview data:

literature_references %>% head()

Save to CSV:

write_csv(literature_references, dwc_literature_references_file, na = "")

4 Create distribution extension

distribution <- raw_data

4.1 Pre-processing

Information for the distributions and associated dates of first observation is contained in eight columns in total:

  • 4 columns for the distribution information (Great Britain, France, Belgium and the Netherlands in resp. raw_established_gb, raw_established_france, raw_established_belgium and raw_established_netherlands)
  • 4 columns for the first record of the taxon in these countries (resp. raw_first_record_gb, raw_first_record_france, raw_first_record_belgium and raw_first_record_netherlands).

When a taxon is present in a country, this is indicated by a Y and a date of first record is provided.

We need to reduce the information for eight to two columns: one column containing country presence information (country column, which is the basis for locality, locationID and countryCode) and column containing the date of first observance (date column, which is the basis for eventDate). We do this in three steps:

  1. Paste date and establishment information for each country into one column (from 8 to 4 columns):
distribution %<>% mutate(record_gb = paste(raw_established_gb, raw_first_record_gb, sep = ","))  %<>% 
  mutate(record_fr = paste(raw_established_france, raw_first_record_france, sep = ",")) %<>% 
  mutate(record_be = paste(raw_established_belgium, raw_first_record_belgium, sep = ",")) %<>%
  mutate(record_nl = paste(raw_established_netherlands, raw_first_record_netherlands, sep = ",")) 
  1. Generate country and date by gathering the new columns (from 4 to 2 columns):
distribution %<>% gather("country", "date", record_gb, record_fr, record_be, record_nl) 
  1. Clean date by removing records with NA,NA and removing Y:
distribution %<>% filter(date != "NA,NA") %<>% mutate(date = str_replace_all(date, "Y,", "")) 
  1. Rename country:
distribution %<>% mutate(country = recode(country,
  "record_gb" = "Great Britain",
  "record_fr" = "France",
  "record_be" = "Belgium",
  "record_nl" = "The Netherlands")
)

Preview:

distribution %>%
  select(raw_species, country, date) %>%
  head()

4.2 Term mapping

Map the data to Species Distribution.

4.2.1 taxonID

distribution %<>% mutate(taxonID = raw_taxonID)

4.2.2 locationID

distribution %<>% mutate(locationID = case_when(
   country == "Great Britain" ~ " WGSRPD:GRB",
   country == "France" ~ "ISO_3166-2:FR",
   country == "Belgium" ~ "ISO_3166-2:BE",
   country == "The Netherlands" ~ "ISO_3166-2:NL"))

4.2.3 locality

distribution %<>% mutate(locality = country)

4.2.4 countryCode

distribution %<>% mutate(countryCode = case_when(
   country == "Great Britain" ~ "GB",
   country == "France" ~ "FR",
   country == "Belgium" ~ "BE",
   country == "The Netherlands" ~ "NL"))

4.2.5 occurrenceStatus

distribution %<>% mutate(occurrenceStatus = "present") 

4.2.6 establishmentMeans

distribution %<>% mutate(establishmentMeans = "introduced") 

4.2.7 eventDate

Inspect dates:

distribution %>% distinct(date) %>% arrange(date)

Check which dates deviate from the required ISO 8601 format (here YYYY)

unique(distribution[which(str_detect(distribution $ date,"^[0-9]{4}$") == FALSE), "date"])

Clean those dates:

distribution %<>% mutate(date = str_trim(date)) # remove whitespaces

distribution %<>% mutate(date = recode(date,    # recode deviating formats
   "n.d."  = "",
   "17634" = "1763",
   "2004," = "2004",
   "n..d"  = "",
   "199"   = "1990",
   "1985," = "1985",
   "NA"    = "",
   "19204" = "1904")) 

After this cleaning step, there should be no deviations from the ISO 8601 format:

unique(distribution[which(str_detect(distribution $ date,"^[0-9]{4}$") == FALSE), "date"])

We want to express eventDate as the range between the date of first and last record. In this case, we do not have information about the last record. We will consider the publication year of Zieritz et al. (2017) as the date when the presence of the species was last verified:

distribution %<>% mutate(eventDate = case_when(
  date == "" ~ "",
  date != "" ~ paste(date, "2017", sep = "/"))) 

4.2.8 source

Information about the source is provided in raw_references. We replace the sequence of reference numbers by the sequence of identifiers (not full references), provided in the dataframe identifier_sequence generated earlier.

distribution %<>% left_join(identifier_sequence, by = ("raw_references"))

Rename id_sequence:

distribution %<>%  rename("source" = "id_sequence")

4.3 Post-processing

Remove the original columns:

distribution %<>% select(-starts_with("raw_"), -country, -date)

Sort on taxonID:

distribution %<>% arrange(taxonID) 

Preview data:

distribution %>% head()

Save to CSV:

write_csv(distribution, dwc_distribution_file, na = "")

5 Create species profile extension

In this extension we express broad habitat characteristics (e.g. isTerrestrial) of the species.

species_profile <- raw_data

Habitat information can be found in raw_environment, which describes whether a species is found in freshwater, terrestrial, marine or a combination of these habitats.

Show unique values:

species_profile %>%
  select(raw_environment) %>%
  group_by_all() %>% 
  summarize(records = n())

Only two taxa have no information on the environment:

  • Salvelinus fontinalis (brook trout): a fish found in both freshwater and marine habitats
  • Rattus norvegicus(brown rat): a terrestrial mammal

We will add this information in the mapping. Environment information is mapped to isFreshwater, isTerrestrial, isMarine or a combination of these terms in the species profile extension.

5.1 Term mapping

Map the data to Species Profile.

5.1.1 taxonID

species_profile %<>% mutate(taxonID = raw_taxonID)

5.1.2 isMarine

species_profile %<>% mutate(isMarine = case_when(
  raw_environment == "M" |
  raw_environment == "M+F" |
  raw_environment == "T+M" | 
  raw_species == "Salvelinus fontinalis (Mitchill, 1814)" ~ "TRUE",
  TRUE ~ "FALSE"
))

5.1.3 isFreshwater

species_profile %<>% mutate(isFreshwater = case_when(
  raw_environment == "F" |
  raw_environment == "F,T" |
  raw_environment == "F+T" |
  raw_environment == "M+F" |
  raw_species == "Salvelinus fontinalis (Mitchill, 1814)" ~ "TRUE",
  TRUE ~ "FALSE"))

5.1.4 isTerrestrial

species_profile %<>% mutate(isTerrestrial = case_when(
  raw_environment == "F,T" |
  raw_environment == "F+T" |
  raw_environment == "T" |
  raw_environment == "T+M" |
  raw_species == "Rattus norvegicus (Berkenhout 1769)" ~ "TRUE",
  TRUE ~"FALSE"))

Show mapped values:

species_profile %>%
  select(raw_environment, isMarine, isFreshwater, isTerrestrial) %>%
  group_by_all() %>%
  summarize(records = n())

5.2 Post-processing

Remove the original columns:

species_profile %<>% select(-starts_with("raw_"))

Sort on taxonID:

species_profile %<>% arrange(taxonID)

Preview data:

species_profile %>% head()

Save to CSV:

write_csv(species_profile, dwc_profile_file, na = "")

6 Create description extension

In the description extension we want to include several important characteristics (hereafter refered to as descriptors) about the species:

  • Native range
  • Pathway of introduction

A single taxon can have multiple descriptions of the same type (e.g. multiple native ranges), expressed as multiple rows in the description extension. For each descriptor, we create a separate dataframe to process the specific information. We always specify which descriptor we map (type column) and its specific content (description column). After the mapping of these Darwin Core terms type and value, we merge the dataframes to generate one single description extension. We then continue the mapping process by adding the other Darwin Core terms (which content is independent of the type of descriptor, such as language).

6.1 Native range

Native range information (e.g. Australia) can be found in raw_origin.

Create separate dataframe:

native_range <- raw_data

Show unique values:

native_range %>% 
  distinct(raw_origin) %>% 
  arrange(raw_origin)

raw_origin contains multiple values (currently not more than 4), so we separate it in 4 columns:

native_range %<>% separate(raw_origin, into = paste("origin", c(1:4), sep = "_"))

Gather in a key and value column:

native_range %<>% gather("key", "value", paste("origin", c(1:4), sep = "_"), na.rm = TRUE) 

Inspect values:

native_range %>% 
  distinct(value) %>% 
  arrange(value)

Map values:

native_range %<>% mutate(mapped_value = recode(value,
  "Af"  = "Africa (WGSRPD:2)",
  "Ar"  = "Arctic",
  "As"  = "Asia",
  "Au"  = "Australia (WGSRPD:50)",
  "Aus" = "Australia (WGSRPD:50)",
  "d"   = "",
  "Eu"  = "Europe (WGSRPD:1)",
  "n"   = "",
  "na"  = "",
  "NAm" = "Northern America (WGSRPD:7)",
  "SAm" = "Southern America (WGSRPD:8)"
)) 

Show mapped values:

native_range %>%
  select(value, mapped_value) %>%
  group_by(value, mapped_value) %>%
  summarize(records = n()) 

Keep only non-empty descriptions:

native_range %<>% filter(!is.na(mapped_value) & mapped_value != "")

Drop the key and value columns and rename mapped_value as description:

native_range %<>%
  select(-key, -value) %>%
  rename(description = mapped_value)

Create a type field to indicate the type of description:

native_range %<>% mutate(type = "native range")

6.2 Pathway of introduction

Pathway information in this dataset is a combination of the columns starting with raw_pathway_ (4 columns: raw_pathway_accidental, raw_pathway_disperal, raw_pathway_import_escape and raw_pathway_import_release) and raw_vector_ (5 columns: raw_vector_biocontrol, raw_vector_leisure, raw_vector_industry, raw_vector_ornamental and raw_vector_research):

raw_data %>% select(starts_with("raw_vector"), starts_with("raw_pathway")) %>% 
  group_by_all() %>% 
  summarize(records = n()) 

This information will be mapped to the six main categories of the CBD standard. How these RINSE pathways map to the CBD standard is described in the source paper Zieritz et al. (2017) (in Methods > Data analysis):

RINSE pathway CBD category
accidental stowaway, contaminant
dispersal corridor, undaided
import_release release
import_escape escape

For certain taxa, we will complement this pathway information with the specific vector of introduction (contained in columns starting with raw_vector_), using the subcategories of the CBD standard:

  • For pathways accidental and disperal, we will never provide vector information. This is because vectors of accidentally introduced species were not analysed in the present study, due to a lack of reliable data (see Zieritz et al. 2017).
  • For pathways import_escape and import_release, we will map vector information to the CBD subcategories only when they map unambigiously to this standard (see table below and this issue) and when these vectors are the only vectors of introduction for these taxon.
RINSE pathway RINSE vector CBD category CBD subcategory
accidental biocontrol stowaway, contaminant NA
accidental industry stowaway, contaminant NA
accidental leisure stowaway, contaminant NA
accidental ornamental stowaway, contaminant NA
accidental research stowaway, contaminant NA
dispersal biocontrol corridor, unaided NA
dispersal industry corridor, unaided NA
dispersal leisure corridor, unaided NA
dispersal ornamental corridor, unaided NA
dispersal research corridor, unaided NA
import_escape biocontrol escape NA
import_escape industry escape NA
import_escape leisure escape live food and bait
import_escape ornamental escape NA
import_escape research escape research and ex-situ breeding (in facilities)
import_release biocontrol release biological control
import_release industry release NA
import_release leisure release NA
import_release ornamental release NA
import_release research release NA

We map this information in the new column description, for which we use the TrIAS vocabulary.

Create separate dataframe:

pathway_desc <- raw_data

We gather pathway information in one new column pathway:

pathway_desc %<>% gather(pathway, value, starts_with("raw_pathway_"), na.rm = TRUE, convert = TRUE)

Column “value” will only contain “Y”, so no need for this column

pathway_desc %<>% select(-value)

Map patwhay information to the CBD main category:

pathway_desc %<>% mutate(description = recode(pathway,
  "raw_pathway_accidental" = "stowaway,contaminant",
  "raw_pathway_dispersal" = "corridor,natural_dispersal",
  "raw_pathway_import_escape" = "escape",
  "raw_pathway_import_release" = "release"
))

Arrange by taxonID to see things more in context:

pathway_desc %<>% arrange(raw_taxonID)

Map vector information to CBD subcategories for some taxa:

pathway_desc %<>% mutate(description = case_when(
  description == "escape" &
    raw_vector_research == "Y" & 
    raw_vector_leisure == "Y" &
    is.na(raw_vector_ornamental) &
    is.na(raw_vector_biocontrol) & 
    is.na(raw_vector_industry) ~ "escape_research, escape_food_bait",
  description == "escape" &
    raw_vector_research == "Y" & 
    is.na(raw_vector_leisure ) &
    is.na(raw_vector_ornamental) &
    is.na(raw_vector_biocontrol) & 
    is.na(raw_vector_industry) ~ "escape_research",
  description == "escape" &
    is.na(raw_vector_research) & 
    raw_vector_leisure == "Y" &
    is.na(raw_vector_ornamental) &
    is.na(raw_vector_biocontrol) & 
    is.na(raw_vector_industry) ~ "escape_food_bait",
  description == "release" &
    is.na(raw_vector_research) & 
    is.na(raw_vector_leisure) &
    is.na(raw_vector_ornamental) &
    raw_vector_biocontrol == "Y" & 
    is.na(raw_vector_industry) ~ "release_biocontrol",
TRUE ~ description # Leave rest as is
))

Separate description when more than two values are provided, e.g. in case of RINSE pathway = accidental or RINSE pathway = dispersal:

pathway_desc %<>% separate_rows(description, sep = ",")

Remove all empty description values:

pathway_desc %<>% filter(description != "") 

Remove duplicate rows:

pathway_desc %<>% distinct(raw_taxonID, description, .keep_all = TRUE)

Remove pathway:

pathway_desc %<>% select(-pathway)

Add the prefix cbd_2014_pathway: to refer to this standard:

pathway_desc %<>% mutate(description = paste ("cbd_2014_pathway", description, sep = ":"))

Create a type field to indicate the type of description:

pathway_desc %<>% mutate(type = "pathway")

6.3 Union descriptions

Union native range and pathway of introduction:

description_ext <- bind_rows(native_range, pathway_desc)

6.4 Term mapping

Map the data to Taxon Description.

6.4.1 taxonID

description_ext %<>% mutate(taxonID = raw_taxonID)

6.4.2 description

description_ext %<>% mutate(description = description)

6.4.3 type

description_ext %<>% mutate(type = type)

6.4.4 source

Information about the source is provided in raw_references. We replace the sequence of reference numbers by the sequence of identifiers (not full references), provided in the dataframe identifier_sequence generated earlier.

description_ext %<>% left_join(identifier_sequence, by = ("raw_references"))

Rename id_sequence:

description_ext %<>%  rename("source" = "id_sequence")

6.4.5 language

description_ext %<>% mutate(language = "en")

6.5 Post-processing

Remove the original columns:

description_ext %<>% select(-starts_with("raw_"))

Move taxonID to the first position:

description_ext %<>% select(taxonID, everything())

Sort on taxonID:

description_ext %<>% arrange(taxonID)

Preview data:

description_ext %>% 
  mutate(source = substr(source, 1, 20)) %>% # Shorten source to make it easier to display
  head()

Save to CSV:

write_csv(description_ext, dwc_description_file, na = "")