This file describes the steps required to map the data to Darwin Core Taxon.

1 Setup

Load libraries:

2 Read data

  1. Define data types

  2. Read taxon data

  3. Read literature references

3 Map taxon core

Preview of the data:

Start with record-level terms which contain metadata about the dataset (which is generally the same for all records).

3.1 language

3.2 license

3.3 rightsHolder

3.4 datasetID

3.5 institutionCode

3.6 datasetName

The following terms contain information about the taxon:

3.7 taxonID

3.8 scientificName

The information in scientificName will be a compilation of several fields: sp_genus, sp_species, sp_authority, sp_subtaxon and sp_subtaxon_authority. We paste this information together to generate the field dwc_scientificName. Before we concatenate this information, we clean the authorship information a bit:

  1. Clean authorships:

  2. Paste information together

  3. Remove all NA

  4. We use the GBIF nameparser to retrieve nomenclatural information for the scientific names in the checklist.

  5. Show scientific names with nomenclatural issues, i.e. not of type = SCIENTIFIC or that could not be fully parsed (parsed = TRUE and parsedpartially = FALSE). Note: these are not necessarily incorrect:

  1. Total amount of scientific names with nomenclatural issues:
## [1] 401

Cleaning of taxa with nomenclatural issues is not within the scope of this mapping. However, we can perform some rough cleaning to eliminate the INFORMAL taxa, by removing sp.:

Some other taxa need special inspection, especially the doubtful ones (probably due to UTF-8 issues)

  1. All taxa should be unique. We here scan for duplicated taxa:
  1. Specify the scientificnames and associated values of idspecies to be removed from the taxon core

  2. Link those with the replacement values for idspecies

  3. Remove duplicated taxa from taxon core:

  4. Save remove_taxa to scan other extension files for the presence of duplicated taxa

3.9 kingdom

No kingdom information is provided. This is not an obligatory field but strongly recommended. It can easily be derived from information in phylum:

However, for 389 taxa have no phylum, there’s no information available. For these records, we try to derive phylum and kingdom information from class:

  1. we complete phylum information

Some of these classes are not correct, e.g. Nematoda is a phylum, not a class. Cleaning this information is not within the scope of this mapping.

Not all phylum information is correct, e.g. Bacteria is a kingdom, not a phylum.

  1. Trim whitespaces in phylum_complete

  2. Phylum Labyrinthista does not exist. This should be phylum Bigyra (for Labyrinthula zosterae)

  3. Based on this information, map kingdom:

3.10 phylum

3.11 class

3.12 order

3.13 family

3.14 genus

3.15 specificEpithet

3.16 infraspecificEpithet

3.17 taxonRank

Information for taxonRank is provided in the field subtaxon_rank, but is only given for varieties, aggregates, hybrids, subspecies or forms. Taxon rank information can also be retrieved by the GBIF nameparser function. This information was retrieved earlier in this script, in the dataframe parsed_names. We add the information to taxon.

  1. Inspect rankmarker values generated by the GBIF nameparser and and compare with subtaxon_rank information from the DAISIE checklist:

We decided to use the information contained in rankmarker because GBIF rankmarker will provide cleaner information than subtaxon_rank, even if there might be some loss of information. However, rankmarker also contains NA. We inspect dwc_scientificName and subtaxon_rank for these values:

Concrete actions to undertake: - scientific names without subtaxon_rank: - Acaena anserinifolia x inermis: species - Dahlia coccinea x pinnata: species - Geoplana (=Australoplana) sanguinea: species - Hyalomma Scupense "Delpy, 1946": species - Rosa Hollandica': species - Rest: genera - scientific names with subtaxon_rank = agg.: genus - scientific names with subtaxon_rank = hyb: species - scientific name = Oidium Pseudoidium: wrong scientific name, refers to genus Oidium or Pseudoidium

  1. Define taxa without subtaxon_rank which are in fact species

  2. Map taxonRank

  3. summarize mapping:

3.18 taxonRemarks

taxon includes a reference to the consulted source via sourceid. We map the sources under taxonRemarks.

Rename source to taxonRemarks:

4 Post-processing

  1. Only keep the Darwin Core columns

  2. Drop the dwc_ prefix

  3. Sort on taxonID

  4. Export all taxonID’s (required for filtering the records in the extensions):

  5. Export core_taxa:

  6. Preview data

7.Save to CSV: