This file describes the steps required to reorganize the input datasets. The goal of this reorganization is to improve the readability of the mapping scripts later.
Load libraries:
Define data types
Read literature references
Read taxon data
Read distribution data:
The file literature_references
includes all sources used to compile the DAISIE checklist.
literature_references
:Each source can be identified by its sourceid
and is composed of: - a citation (shortref
) and/or - a full references (longref
) and/or - an url (url
)
Most input files use the field sourceid
to link with the sources included in literature_references
. In the Darwin Core mapping process, we want to replace the sourceid
with the full reference to each source. To accomplish this, we need to compile the full reference from: - shortref
and/or - longref
and/or - url
Cleaning the content of this field is outside scope op this mapping). However, several longref
values start with a number (e.g. 1
, 1, 28
). These can easily be removed.
Remove numbers from longref
Verify if numbers are removed (should be TRUE
)
## [1] TRUE
shortref
Cleaning of this fields is outside the scope of this mapping.
url
This information is, in most cases, wrong: url
often refers to scientific names rather than to a real url. We remove the scientific names from url
and keep real url values only.
Remove non-valid url values
Verify cleaning step
shortref
, longref
and url
For most records, one or more of the above-mentioned fields are empty. We need to know which (and how many) meaningful combinations of these fields occur in literature_references
:
We use the following rules to generate source
:
source
)We can now use literature_references
to add the full reference to the generated Darwin Core files (except the distribution extension, see further), using sourceid
as the link.
The distribution extension has no sourceid
field to link with the full reference in literature_reference
. In the case of the distribution extension, a specific reference can only be linked to a record using a combination between: - id_sp_region
- field_name
We save this information in a separate dataframe distribution_sources
here and export it to use it later to generate the full reference for the distribution extension.
Generate `distribution_sources
Export as distribution_sources
Remove duplicated sources in literature_references
(duplicates are only needed for the mapping of the distribution extension) and export them as interim file.
Remove duplicates:
Export as interim_literature_references
:
input_taxon
has one field that can’t be used for mapping of the Taxon Core: - ecofunct_group
We will integrate this information in the description extension.
Save information on ecofunctional groups (create ecofunct_group
):
Remove records for which ecofunct_group
is empty:
Export as ecofunctional_group