6 Darwin Core mapping

In this chapter we standardize the unified information to a Darwin Core checklist that can be harvested by GBIF.

6.1 Read unified data

checklists <- read_csv(here("data", "raw", "checklists.csv"))
input_taxa <- read_csv(here("data", "interim", "taxa_unified.csv"))
input_distributions <- read_csv(
  here("data", "interim", "distributions_unified.csv"),
  na = "",
  col_types = cols(
    .default = col_character(),
    verificationKey = col_double(),
    startYear = col_double(),
    endYear = col_double()
  )
)
input_speciesprofiles <- read_csv(
  here("data", "interim", "speciesprofiles_unified.csv")
)
input_descriptions <- read_csv(here("data", "interim", "descriptions_unified.csv"))

6.2 Preview data

  1. Number of rows per file and corresponding mapping section in this chapter:
File Number of rows
taxa 3902
distributions 10398
speciesprofiles 3306
descriptions 13446
  1. Number of taxa per checklist:
  1. Number of taxa per kingdom:
  1. Number of taxa per rank:
  1. Number of taxa and descriptions per type:

6.3 How we cite our sources

Each row of information in the Taxon core and the extensions is based on a specific source:

File Source Field for citation Mapping section
Taxon core a taxon in the GBIF Backbone Taxonomy bibliographicCitation 6.4
Distribution extension one or more taxa in a source checklist source 6.5
Species profile extension a taxon in a source checklist source 6.6
Description extension a taxon in a source checklist source 6.7

To reference this source, we will use the GBIF citation format for species pages, prefixed with the URL of that page. E.g. for the distribution of Nymphea marliacea Marliac this would be:

https://www.gbif.org/species/141264581: Nymphaea marliacea Marliac in Verloove F, Groom Q, Brosens D, Desmet P, Reyserhove L (2018). Manual of the Alien Plants of Belgium. Version 1.7. Botanic Garden Meise. Checklist dataset https://doi.org/10.15468/wtda1m.

This information is a combination of:

  • taxonKey: e.g. 1412645812 (contained in distributions.csv),
  • scientificName: e.g. Nymphaea marliacea Marliac (contained in distributions.csv),
  • citation: e.g. Verloove F, Groom Q, Brosens D, Desmet P, Reyserhove L (2018). Manual of the Alien Plants of Belgium. Version 1.7. Botanic Garden Meise. Checklist dataset https://doi.org/10.15468/wtda1m. (contained in checklists.csv)

To generate this full citation, we create a helper function add_source_citation(df, new_column, dataset_info).

6.4 Taxon core

6.4.1 Pre-processing

  1. Create a dataframe taxon from the unified taxa.

  2. Separate canonicalName in canonicalName_genus, canonicalName_species and canonicalName_infraspecific (on whitespace).

6.4.2 Term mapping

Map the data to Darwin Core Taxon.

6.4.2.1 language

taxon %<>% mutate(dwc_language = "en")

6.4.2.2 license

The license under which (each record of) the unified checklist will be published should be the most restrictive license of the source checklists. The potential licenses are the three Creative Commons licenses supported by GBIF (ordered from least to most restrictive):

The actual licenses of the source checklists and the GBIF Backbone Taxonomy (which we use for the taxon core) are:

Based on the above ranking, the most restrictive license is:

## [1] "http://creativecommons.org/licenses/by/4.0/legalcode"

Which we use for our license:

taxon %<>% mutate(dwc_license = most_restrictive_license)

6.4.2.3 rightsHolder

We do not set rightsHolder as the taxon and its related information is based on different source checklists (which in turn are based on other sources), published by different organizations, and mostly released under CC0. Instead, we make an effort to cite the sources (see 6.3).

taxon %<>% mutate(dwc_rightsHolder = NA)

6.4.2.4 bibliographicCitation

See 6.3:

# Add temporary field with datasetKey of GBIF Backbone Taxonomy
# taxon %<>% mutate(datasetKey = "d7dddbf4-2cf0-4f39-9b2a-bb099caae36c")

taxon %<>% add_source_citation(
  new_column = "dwc_bibliographicCitation",
  dataset_info = checklists,
  taxon_key = "verificationKey"
)

6.4.2.5 datasetID

taxon %<>% mutate(dwc_datasetID = "https://doi.org/10.15468/xoidmd")

6.4.2.6 institutionCode

taxon %<>% mutate(dwc_institutionCode = "ISSG") # Invasive Species Specialist Group ISSG

6.4.2.7 datasetName

taxon %<>% mutate(dwc_datasetName = "Global Register of Introduced and Invasive Species - Belgium")

6.4.2.8 references

URL of the GBIF Backbone Taxonomy taxon on gbif.org:

taxon %<>% mutate(dwc_references = paste0("https://www.gbif.org/species/", verificationKey))

6.4.2.9 taxonID

URL of the GBIF Backbone Taxonomy taxon on gbif.org:

taxon %<>% mutate(dwc_taxonID = paste0("https://www.gbif.org/species/", verificationKey))

6.4.2.10 scientificName

taxon %<>% mutate(dwc_scientificName = scientificName)

6.4.2.11 kingdom

taxon %<>% mutate(dwc_kingdom = kingdom)

6.4.2.12 phylum

taxon %<>% mutate(dwc_phylum = phylum)

6.4.2.13 class

taxon %<>% mutate(dwc_class = class)

6.4.2.14 order

taxon %<>% mutate(dwc_order = order)

6.4.2.15 family

taxon %<>% mutate(dwc_family = family)

6.4.2.16 genus

genus is part of the higher classification, which is provided by the GBIF Backbone Taxonomy. We will use that, but note that for some synonyms it might differ from the genus in the scientific name:

taxon %<>% mutate(dwc_genus = genus)

6.4.2.17 specificEpithet

taxon %<>% mutate(dwc_specificEpithet = canonicalName_species)

6.4.2.18 infraspecificEpithet

taxon %<>% mutate(dwc_infraspecificEpithet = canonicalName_infraspecific)

6.4.2.19 taxonRank

Inspect values:

Map values as is, in UPPERCASE, so it is clearer this information comes from the GBIF Backbone Taxonomy:

taxon %<>% mutate(dwc_taxonRank = rank)

6.4.2.20 scientificNameAuthorship

taxon %<>% mutate(dwc_scientificNameAuthorship = authorship)

6.4.2.21 taxonRemarks

Here we list the checklists that were considered for unifying information about a taxon, i.e. the checklists we selected (see 1.1) in which the taxon appeared and got through verification (see 3.2). In the case of multiple considered checklists, it is possible that not all of them are selected as a source (see 6.3) when unifying information.

The datasetKeys of the considered checklists are listed in datasetKeys. Since we want the DOI instead, we will separate this information into columns, gather to rows and join with the DOI, and spread and combine again into a single column.

Map to taxonRemarks:

taxon %<>% mutate(dwc_taxonRemarks = paste(
  "Sources considered for this taxon:", datasetDOIs
))

6.4.3 Post-processing

  1. Only keep the Darwin Core columns.

  2. Drop the dwc_ prefix.

  3. Sort on taxonID.

  4. Preview data:

  1. Save to CSV.

6.5 Distribution extension

6.5.1 Pre-processing

Create a dataframe distribution from the unified distributions.

6.5.2 Term mapping

Map the data to Species Distribution. Because of the scope (see 1.3) of the dataset, we can set all distributions to occurrenceStatus:present and establishmentMeans:introduced.

6.5.2.1 taxonID

distribution %<>% mutate(dwc_taxonID = paste0("https://www.gbif.org/species/", verificationKey))

6.5.2.2 locationID

distribution %<>% mutate(dwc_locationID = locationId)

6.5.2.3 locality

distribution %<>% mutate(dwc_locality = locality)

6.5.2.4 countryCode

distribution %<>% mutate(dwc_countryCode = "BE")

6.5.2.5 occurrenceStatus

distribution %<>% mutate(dwc_occurrenceStatus = "present")

6.5.2.6 establishmentMeans

distribution %<>% mutate(dwc_establishmentMeans = "introduced")

6.5.2.7 eventDate

The distribution information applies to a certain date range, which we will express here as an ISO 8601 date yyyy/yyyy (startYear/endYear). How the information for startYear and endYear is extracted from the source checklists, is described in 5.3. As a result, each taxon in input_distribution has or an startYear and an endYear or no eventDate information at all.

distribution %<>% mutate(dwc_eventDate = case_when(
  is.na(startYear) & is.na(endYear) ~ "",
  TRUE ~ paste(startYear, endYear, sep = "/")
))
  • Minimum year: 1201
  • Maximum year: 2023

6.5.2.8 source

A distribution can have multiple source taxa (i.e. two verified synonyms from the same checklist). We therefore separate datasetKeys and scientificNames in a maximum of three columns and build a source (see 6.3) for each.

Combine three source columns and remove NA:

distribution %<>% mutate(
  dwc_source = paste(source_1, source_2, source_3, sep = " | ") %>% str_remove_all(" \\| NA")
)

6.5.3 Post-processing

  1. Only keep the Darwin Core columns.

  2. Drop the dwc_ prefix.

  3. Sort on taxonID.

  4. Preview data:

  1. Save to CSV.

6.6 Species profile extension

Create a dataframe species_profile from the unified species profiles.

6.6.1 Term mapping

Map the data to Species Profile.

6.6.1.1 taxonID

species_profile %<>% mutate(dwc_taxonID = paste0("https://www.gbif.org/species/", verificationKey))

6.6.1.2 isMarine

species_profile %<>% mutate(dwc_isMarine = marine)

6.6.1.3 isFreshwater

species_profile %<>% mutate(dwc_isFreshwater = freshwater)

6.6.1.4 isTerrestrial

species_profile %<>% mutate(dwc_isTerrestrial = terrestrial)

6.6.1.5 isInvasive

The source checklists currently do not include information on the invasive nature of the taxa. We plan to add that information in an update of the dataset.

species_profile %<>% mutate(dwc_isInvasive = "")

6.6.1.6 habitat

ISSG also used the field habitat, in which we will summarize the information from isMarine, isFreshwater and isTerrestrial.

Map habitat:

species_profile %<>% mutate(dwc_habitat = case_when(
  marine == "FALSE" & freshwater == "FALSE" & terrestrial == "TRUE" ~ "terrestrial",
  marine == "FALSE" & freshwater == "TRUE" & terrestrial == "FALSE" ~ "freshwater",
  marine == "FALSE" & freshwater == "TRUE" & terrestrial == "TRUE" ~ "freshwater|terrestrial",
  marine == "TRUE" & freshwater == "FALSE" & terrestrial == "FALSE" ~ "marine",
  marine == "TRUE" & freshwater == "FALSE" & terrestrial == "TRUE" ~ "marine|terrestrial",
  marine == "TRUE" & freshwater == "TRUE" & terrestrial == "FALSE" ~ "marine|freshwater",
  marine == "TRUE" & freshwater == "TRUE" & terrestrial == "TRUE" ~ "marine|freshwater|terrestrial"
))

Show mapping:

6.6.1.7 source

See 6.3:

species_profile %<>% add_source_citation(
  new_column = "dwc_source",
  dataset_info = checklists
)

6.6.2 Post-processing

  1. Only keep the Darwin Core columns.

  2. Drop the dwc_ prefix.

  3. Sort on taxonID.

  4. Preview data:

  1. Save to CSV.

6.7 Description extension

6.7.1 Pre-processing

Create a dataframe description from the unified descriptions.

6.7.2 Term mapping

Map the data to Taxon Description.

6.7.2.1 taxonID

description %<>% mutate(dwc_taxonID = paste0("https://www.gbif.org/species/", verificationKey))

6.7.2.2 description

description %<>% mutate(dwc_description = description)

6.7.2.3 type

description %<>% mutate(dwc_type = type)

6.7.2.4 language

description %<>% mutate(dwc_language = "en")

6.7.2.5 source

See 6.3:

description %<>% add_source_citation(
  new_column = "dwc_source",
  dataset_info = checklists
)

6.7.3 Post-processing

  1. Only keep the Darwin Core columns.

  2. Drop the dwc_ prefix.

  3. Sort on taxonID.

  4. Preview data:

  1. Save to CSV.