This file describes the steps required to map the data to Darwin Core Taxon.
Load libraries:
Define data types
Read taxon data
Read literature references
Preview of the data:
Start with record-level terms which contain metadata about the dataset (which is generally the same for all records).
The following terms contain information about the taxon:
The information in scientificName will be a compilation of several fields: sp_genus, sp_species, sp_authority, sp_subtaxon and sp_subtaxon_authority. We paste this information together to generate the field dwc_scientificName. Before we concatenate this information, we clean the authorship information a bit:
Clean authorships:
Paste information together
Remove all NA
We use the GBIF nameparser to retrieve nomenclatural information for the scientific names in the checklist.
Show scientific names with nomenclatural issues, i.e. not of type = SCIENTIFIC or that could not be fully parsed (parsed = TRUE and parsedpartially = FALSE). Note: these are not necessarily incorrect:
## [1] 401
Cleaning of taxa with nomenclatural issues is not within the scope of this mapping. However, we can perform some rough cleaning to eliminate the INFORMAL taxa, by removing sp.:
Some other taxa need special inspection, especially the doubtful ones (probably due to UTF-8 issues)
Specify the scientificnames and associated values of idspecies to be removed from the taxon core
Link those with the replacement values for idspecies
Remove duplicated taxa from taxon core:
Save remove_taxa to scan other extension files for the presence of duplicated taxa
No kingdom information is provided. This is not an obligatory field but strongly recommended. It can easily be derived from information in phylum:
However, for 389 taxa have no phylum, there’s no information available. For these records, we try to derive phylum and kingdom information from class:
Some of these classes are not correct, e.g. Nematoda is a phylum, not a class. Cleaning this information is not within the scope of this mapping.
Not all phylum information is correct, e.g. Bacteria is a kingdom, not a phylum.
Trim whitespaces in phylum_complete
Phylum Labyrinthista does not exist. This should be phylum Bigyra (for Labyrinthula zosterae)
Based on this information, map kingdom:
Information for taxonRank is provided in the field subtaxon_rank, but is only given for varieties, aggregates, hybrids, subspecies or forms. Taxon rank information can also be retrieved by the GBIF nameparser function. This information was retrieved earlier in this script, in the dataframe parsed_names. We add the information to taxon.
rankmarker values generated by the GBIF nameparser and and compare with subtaxon_rank information from the DAISIE checklist:We decided to use the information contained in rankmarker because GBIF rankmarker will provide cleaner information than subtaxon_rank, even if there might be some loss of information. However, rankmarker also contains NA. We inspect dwc_scientificName and subtaxon_rank for these values:
Concrete actions to undertake: - scientific names without subtaxon_rank: - Acaena anserinifolia x inermis: species - Dahlia coccinea x pinnata: species - Geoplana (=Australoplana) sanguinea: species - Hyalomma Scupense "Delpy, 1946": species - Rosa Hollandica': species - Rest: genera - scientific names with subtaxon_rank = agg.: genus - scientific names with subtaxon_rank = hyb: species - scientific name = Oidium Pseudoidium: wrong scientific name, refers to genus Oidium or Pseudoidium
Define taxa without subtaxon_rank which are in fact species
Map taxonRank
summarize mapping:
taxon includes a reference to the consulted source via sourceid. We map the sources under taxonRemarks.
Rename source to taxonRemarks:
Only keep the Darwin Core columns
Drop the dwc_ prefix
Sort on taxonID
Export all taxonID’s (required for filtering the records in the extensions):
Export core_taxa:
Preview data
7.Save to CSV: