This file describes the steps required to reorganize the input datasets. The goal of this reorganization is to improve the readability of the mapping scripts later.

1 Setup

Load libraries:

2 Read data

Define data types
Read literature references
Read taxon data
Read distribution data:

3 Compile reference data

The file literature_references includes all sources used to compile the DAISIE checklist.

Inspect literature_references:

Each source can be identified by its sourceid and is composed of: - a citation (shortref) and/or - a full references (longref) and/or - an url (url)

Most input files use the field sourceid to link with the sources included in literature_references. In the Darwin Core mapping process, we want to replace the sourceid with the full reference to each source. To accomplish this, we need to compile the full reference from: - shortref and/or - longref and/or - url

Inspect longref:

Cleaning the content of this field is outside scope op this mapping). However, several longref values start with a number (e.g. 1, 1, 28). These can easily be removed.

Remove numbers from longref
Verify if numbers are removed (should be TRUE)

## [1] TRUE

Inspect shortref

Cleaning of this fields is outside the scope of this mapping.

Inspect url

This information is, in most cases, wrong: url often refers to scientific names rather than to a real url. We remove the scientific names from url and keep real url values only.

Remove non-valid url values
Verify cleaning step

Show mapping of full reference from shortref, longref and url

For most records, one or more of the above-mentioned fields are empty. We need to know which (and how many) meaningful combinations of these fields occur in literature_references:

We use the following rules to generate source:

If a longref is provided, never use a shortref. Url is integrated where provided
If no longref is provided, use a shortref and/or Url:

Generate full reference (source)

We can now use literature_references to add the full reference to the generated Darwin Core files (except the distribution extension, see further), using sourceid as the link.

4 Generate reference dataset for distribution extension

The distribution extension has no sourceid field to link with the full reference in literature_reference. In the case of the distribution extension, a specific reference can only be linked to a record using a combination between: - id_sp_region - field_name

We save this information in a separate dataframe distribution_sources here and export it to use it later to generate the full reference for the distribution extension.

Generate `distribution_sources
Export as distribution_sources

4.1 Generate interim_literature_references

Remove duplicated sources in literature_references (duplicates are only needed for the mapping of the distribution extension) and export them as interim file.

Remove duplicates:
Export as interim_literature_references:

5 Generate additional files for description extension

input_taxon has one field that can’t be used for mapping of the Taxon Core: - ecofunct_group

We will integrate this information in the description extension.

Save information on ecofunctional groups (create ecofunct_group):
Remove records for which ecofunct_group is empty:
Export as ecofunctional_group

Pre-processing step for Darwin Core mapping

For: Inventory of alien species in Europe (DAISIE)

Lien Reyserhove

David Roy

2021-03-19