This file describes the steps required to map the data to Darwin Core Taxon.
Load libraries:
Define data types
Read taxon data
Read literature references
Preview of the data:
Start with record-level terms which contain metadata about the dataset (which is generally the same for all records).
The following terms contain information about the taxon:
The information in scientificName
will be a compilation of several fields: sp_genus
, sp_species
, sp_authority
, sp_subtaxon
and sp_subtaxon_authority
. We paste this information together to generate the field dwc_scientificName
. Before we concatenate this information, we clean the authorship information a bit:
Clean authorships:
Paste information together
Remove all NA
We use the GBIF nameparser to retrieve nomenclatural information for the scientific names in the checklist.
Show scientific names with nomenclatural issues, i.e. not of type = SCIENTIFIC
or that could not be fully parsed (parsed = TRUE
and parsedpartially
= FALSE
). Note: these are not necessarily incorrect:
## [1] 401
Cleaning of taxa with nomenclatural issues is not within the scope of this mapping. However, we can perform some rough cleaning to eliminate the INFORMAL
taxa, by removing sp.
:
Some other taxa need special inspection, especially the doubtful ones (probably due to UTF-8 issues)
Specify the scientificnames and associated values of idspecies
to be removed from the taxon core
Link those with the replacement values for idspecies
Remove duplicated taxa from taxon core:
Save remove_taxa
to scan other extension files for the presence of duplicated taxa
No kingdom information is provided. This is not an obligatory field but strongly recommended. It can easily be derived from information in phylum
:
However, for 389 taxa have no phylum, there’s no information available. For these records, we try to derive phylum and kingdom information from class
:
Some of these classes are not correct, e.g. Nematoda
is a phylum, not a class. Cleaning this information is not within the scope of this mapping.
Not all phylum information is correct, e.g. Bacteria
is a kingdom, not a phylum.
Trim whitespaces in phylum_complete
Phylum Labyrinthista
does not exist. This should be phylum Bigyra
(for Labyrinthula zosterae)
Based on this information, map kingdom
:
Information for taxonRank
is provided in the field subtaxon_rank
, but is only given for varieties, aggregates, hybrids, subspecies or forms. Taxon rank information can also be retrieved by the GBIF nameparser function. This information was retrieved earlier in this script, in the dataframe parsed_names
. We add the information to taxon
.
rankmarker
values generated by the GBIF nameparser and and compare with subtaxon_rank
information from the DAISIE checklist:We decided to use the information contained in rankmarker
because GBIF rankmarker will provide cleaner information than subtaxon_rank
, even if there might be some loss of information. However, rankmarker also contains NA
. We inspect dwc_scientificName
and subtaxon_rank
for these values:
Concrete actions to undertake: - scientific names without subtaxon_rank
: - Acaena anserinifolia x inermis
: species - Dahlia coccinea x pinnata
: species - Geoplana (=Australoplana) sanguinea
: species - Hyalomma Scupense "Delpy, 1946"
: species - Rosa Hollandica'
: species - Rest: genera - scientific names with subtaxon_rank
= agg.
: genus - scientific names with subtaxon_rank
= hyb
: species - scientific name = Oidium Pseudoidium
: wrong scientific name, refers to genus Oidium
or Pseudoidium
Define taxa without subtaxon_rank
which are in fact species
Map taxonRank
summarize mapping:
taxon
includes a reference to the consulted source via sourceid
. We map the sources under taxonRemarks
.
Rename source
to taxonRemarks
:
Only keep the Darwin Core columns
Drop the dwc_
prefix
Sort on taxonID
Export all taxonID’s (required for filtering the records in the extensions):
Export core_taxa
:
Preview data
7.Save to CSV: