6 Darwin Core mapping
In this chapter we standardize the unified information to a Darwin Core checklist that can be harvested by GBIF.
6.1 Read unified data
checklists <- read_csv(here("data", "raw", "checklists.csv"))
input_taxa <- read_csv(here("data", "interim", "taxa_unified.csv"))
input_distributions <- read_csv(
here("data", "interim", "distributions_unified.csv"),
na = "",
col_types = cols(
.default = col_character(),
verificationKey = col_double(),
startYear = col_double(),
endYear = col_double()
)
)
input_speciesprofiles <- read_csv(
here("data", "interim", "speciesprofiles_unified.csv")
)
input_descriptions <- read_csv(here("data", "interim", "descriptions_unified.csv"))
6.2 Preview data
- Number of rows per file and corresponding mapping section in this chapter:
File | Number of rows |
---|---|
taxa | 3902 |
distributions | 10398 |
speciesprofiles | 3306 |
descriptions | 13446 |
- Number of taxa per checklist:
- Number of taxa per kingdom:
- Number of taxa per rank:
- Number of taxa and descriptions per type:
6.3 How we cite our sources
Each row of information in the Taxon core and the extensions is based on a specific source:
File | Source | Field for citation | Mapping section |
---|---|---|---|
Taxon core | a taxon in the GBIF Backbone Taxonomy | bibliographicCitation |
6.4 |
Distribution extension | one or more taxa in a source checklist | source |
6.5 |
Species profile extension | a taxon in a source checklist | source |
6.6 |
Description extension | a taxon in a source checklist | source |
6.7 |
To reference this source, we will use the GBIF citation format for species pages, prefixed with the URL of that page. E.g. for the distribution of Nymphea marliacea Marliac this would be:
https://www.gbif.org/species/141264581: Nymphaea marliacea Marliac in Verloove F, Groom Q, Brosens D, Desmet P, Reyserhove L (2018). Manual of the Alien Plants of Belgium. Version 1.7. Botanic Garden Meise. Checklist dataset https://doi.org/10.15468/wtda1m.
This information is a combination of:
taxonKey
: e.g.1412645812
(contained indistributions.csv
),scientificName
: e.g.Nymphaea marliacea Marliac
(contained indistributions.csv
),citation
: e.g.Verloove F, Groom Q, Brosens D, Desmet P, Reyserhove L (2018). Manual of the Alien Plants of Belgium. Version 1.7. Botanic Garden Meise. Checklist dataset https://doi.org/10.15468/wtda1m.
(contained inchecklists.csv
)
To generate this full citation, we create a helper function add_source_citation(df, new_column, dataset_info)
.
6.4 Taxon core
6.4.1 Pre-processing
Create a dataframe
taxon
from the unified taxa.Separate
canonicalName
incanonicalName_genus
,canonicalName_species
andcanonicalName_infraspecific
(on whitespace).
6.4.2 Term mapping
Map the data to Darwin Core Taxon.
6.4.2.2 license
The license under which (each record of) the unified checklist will be published should be the most restrictive license of the source checklists. The potential licenses are the three Creative Commons licenses supported by GBIF (ordered from least to most restrictive):
The actual licenses of the source checklists and the GBIF Backbone Taxonomy (which we use for the taxon core) are:
Based on the above ranking
, the most restrictive license is:
## [1] "http://creativecommons.org/licenses/by/4.0/legalcode"
Which we use for our license
:
6.4.2.3 rightsHolder
We do not set rightsHolder
as the taxon and its related information is based on different source checklists (which in turn are based on other sources), published by different organizations, and mostly released under CC0. Instead, we make an effort to cite the sources (see 6.3).
6.4.2.4 bibliographicCitation
See 6.3:
6.4.2.16 genus
genus
is part of the higher classification, which is provided by the GBIF Backbone Taxonomy. We will use that, but note that for some synonyms it might differ from the genus in the scientific name:
6.4.2.19 taxonRank
Inspect values:
Map values as is, in UPPERCASE, so it is clearer this information comes from the GBIF Backbone Taxonomy:
6.4.2.21 taxonRemarks
Here we list the checklists that were considered for unifying information about a taxon, i.e. the checklists we selected (see 1.1) in which the taxon appeared and got through verification (see 3.2). In the case of multiple considered checklists, it is possible that not all of them are selected as a source (see 6.3) when unifying information.
The datasetKey
s of the considered checklists are listed in datasetKeys
. Since we want the DOI instead, we will separate this information into columns, gather to rows and join with the DOI, and spread and combine again into a single column.
Map to taxonRemarks
:
6.4.3 Post-processing
Only keep the Darwin Core columns.
Drop the
dwc_
prefix.Sort on
taxonID
.Preview data:
- Save to CSV.
6.5 Distribution extension
6.5.2 Term mapping
Map the data to Species Distribution. Because of the scope (see 1.3) of the dataset, we can set all distributions to occurrenceStatus:present
and establishmentMeans:introduced
.
6.5.2.7 eventDate
The distribution information applies to a certain date range, which we will express here as an ISO 8601 date yyyy/yyyy
(startYear/endYear
). How the information for startYear
and endYear
is extracted from the source checklists, is described in 5.3. As a result, each taxon in input_distribution
has or an startYear
and an endYear
or no eventDate
information at all.
distribution %<>% mutate(dwc_eventDate = case_when(
is.na(startYear) & is.na(endYear) ~ "",
TRUE ~ paste(startYear, endYear, sep = "/")
))
- Minimum year: 1201
- Maximum year: 2023
6.5.2.8 source
A distribution can have multiple source taxa (i.e. two verified synonyms from the same checklist). We therefore separate datasetKeys
and scientificNames
in a maximum of three columns and build a source (see 6.3) for each.
Combine three source columns and remove NA
:
6.5.3 Post-processing
Only keep the Darwin Core columns.
Drop the
dwc_
prefix.Sort on
taxonID
.Preview data:
- Save to CSV.
6.6 Species profile extension
Create a dataframe species_profile
from the unified species profiles.
6.6.1 Term mapping
Map the data to Species Profile.
6.6.1.5 isInvasive
The source checklists currently do not include information on the invasive nature of the taxa. We plan to add that information in an update of the dataset.
6.6.1.6 habitat
ISSG also used the field habitat
, in which we will summarize the information from isMarine
, isFreshwater
and isTerrestrial
.
Map habitat
:
species_profile %<>% mutate(dwc_habitat = case_when(
marine == "FALSE" & freshwater == "FALSE" & terrestrial == "TRUE" ~ "terrestrial",
marine == "FALSE" & freshwater == "TRUE" & terrestrial == "FALSE" ~ "freshwater",
marine == "FALSE" & freshwater == "TRUE" & terrestrial == "TRUE" ~ "freshwater|terrestrial",
marine == "TRUE" & freshwater == "FALSE" & terrestrial == "FALSE" ~ "marine",
marine == "TRUE" & freshwater == "FALSE" & terrestrial == "TRUE" ~ "marine|terrestrial",
marine == "TRUE" & freshwater == "TRUE" & terrestrial == "FALSE" ~ "marine|freshwater",
marine == "TRUE" & freshwater == "TRUE" & terrestrial == "TRUE" ~ "marine|freshwater|terrestrial"
))
Show mapping:
6.6.1.7 source
See 6.3:
6.6.2 Post-processing
Only keep the Darwin Core columns.
Drop the
dwc_
prefix.Sort on
taxonID
.Preview data:
- Save to CSV.
6.7 Description extension
6.7.3 Post-processing
Only keep the Darwin Core columns.
Drop the
dwc_
prefix.Sort on
taxonID
.Preview data:
- Save to CSV.