This file describes the steps required to reorganize the input datasets. The goal of this reorganization is to improve the readability of the mapping scripts later.

1 Setup

Load libraries:

2 Read data

  1. Define data types

  2. Read literature references

  3. Read taxon data

  4. Read distribution data:

3 Compile reference data

The file literature_references includes all sources used to compile the DAISIE checklist.

  1. Inspect literature_references:

Each source can be identified by its sourceid and is composed of: - a citation (shortref) and/or - a full references (longref) and/or - an url (url)

Most input files use the field sourceid to link with the sources included in literature_references. In the Darwin Core mapping process, we want to replace the sourceid with the full reference to each source. To accomplish this, we need to compile the full reference from: - shortref and/or - longref and/or - url

  1. Inspect longref:

Cleaning the content of this field is outside scope op this mapping). However, several longref values start with a number (e.g. 1, 1, 28). These can easily be removed.

  1. Remove numbers from longref

  2. Verify if numbers are removed (should be TRUE)

## [1] TRUE
  1. Inspect shortref

Cleaning of this fields is outside the scope of this mapping.

  1. Inspect url

This information is, in most cases, wrong: url often refers to scientific names rather than to a real url. We remove the scientific names from url and keep real url values only.

  1. Remove non-valid url values

  2. Verify cleaning step

  1. Show mapping of full reference from shortref, longref and url

For most records, one or more of the above-mentioned fields are empty. We need to know which (and how many) meaningful combinations of these fields occur in literature_references:

We use the following rules to generate source:

  • If a longref is provided, never use a shortref. Url is integrated where provided
  • If no longref is provided, use a shortref and/or Url:
  1. Generate full reference (source)

We can now use literature_references to add the full reference to the generated Darwin Core files (except the distribution extension, see further), using sourceid as the link.

4 Generate reference dataset for distribution extension

The distribution extension has no sourceid field to link with the full reference in literature_reference. In the case of the distribution extension, a specific reference can only be linked to a record using a combination between: - id_sp_region - field_name

We save this information in a separate dataframe distribution_sources here and export it to use it later to generate the full reference for the distribution extension.

  1. Generate `distribution_sources

  2. Export as distribution_sources

4.1 Generate interim_literature_references

Remove duplicated sources in literature_references (duplicates are only needed for the mapping of the distribution extension) and export them as interim file.

  1. Remove duplicates:

  2. Export as interim_literature_references:

5 Generate additional files for description extension

input_taxon has one field that can’t be used for mapping of the Taxon Core: - ecofunct_group

We will integrate this information in the description extension.

  1. Save information on ecofunctional groups (create ecofunct_group):

  2. Remove records for which ecofunct_group is empty:

  3. Export as ecofunctional_group