This file describes the steps required to map the data to Species Distribution.

1 Setup

Load libraries:

2 Read data

Define data types
Read data
Merge source data:

3 Map distribution extension

Preview of the data:

3.1 taxonID

Remove all taxonID’s that are not included in the taxon core
Scan for duplicated taxa (see this issue):

Map taxonID

3.2 locationID and locality

Six fields in input_distributions are used to map locationID and locality:

Two of these fields refer to verbatim location information (mapped to locality): - region_country - region_coast

Two of these fields contain a location code (mapped to locationID): - code_region (for location linked to countries) - code_coast (for location linked to coastal areas)

The codes in code_region and code_coast refer to standards used/developed for the codes (mapped to locationID), respectively: - TDWG or DAISIE consortium (system_country) and - IHO23_4 or DIHO23_4 (system_coast)

Note that coastal information is not always provided (NA). For these records, we will exclude information related to coastal regions:

region_country	region_coast	locationID	locality
country	NA	system_country: code_region	region_country
country	coast	system_country: code_region	system_coast: code_coast

Convert system_country to uppercase:
Map locationID
Map locality

Some information in the description extension (e.g. for impact on use or impact on ecolgy) is a property of a particular taxon in a particular region, identified by id_sp_region, rather than a property of a species as a whole. To emphasize that this descriptor is linked to a taxon in a particular region, we will add the locationID to the descriptor:

idspecies	sp_in_region	description	description_with_locationID
1	1.1	impact_on_use_1.1	impact_on_use_1.1 (locationID)
1	1.2	impact_on_use_1.2	impact_on_use_1.2 (locationID)

We here save the link between id_sp_region and locationID in a separate dataframe, to use it later in the mapping of the description extension:

Export sp_in_region_with_location as .csv

3.3 countryCode

Map countryCode to the ISO 3166 Code

3.4 occurrenceStatus

Information to map occurrenceStatus is contained abundance and population_status.

In most cases , we can translate abundance to the GBIF controlled vocabulary of occurrenceStatus:

abundance	occurrenceStatus
Absent or extinct	absent (or extinct, see below)
Abundant	common
Common	common
Local	present
Rare	rare
Single record	present
Sporadic	irregular
Unknown	doubtful

However, for 30 taxa, population_status contains the field extinct, which is valuable information for occurrenceStatus:

population_status	abundance	records
Extinct	Absent or extinct	9
Extinct	Local	6
Extinct	Rare	11
Extinct	Single record	4
Extinct	Unknown	8

We decided to set occurrencesStatus as extinct, irrespective of the content of abundance(see this issue).

3.5 establishmentMeans

Information for the mapping of establishmentMeans is contained in species_status:

For mapping soecies_status to establishmentMeans, see issue 15 on GitHub:

3.6 eventDate

3.6.1 Clean start year

Inspect content of start_year, which contains the information for eventDate:

Besides a lot of NA values, we have many YYYY formatted years (good to go) and a smaller group of others:

NA cases:
- Unknown -> NA
- unknown -> NA
- . -> NA
- ? -> NA
- since long -> NA
negative years: also to NA
- -5300, -2200, -750 -> NA
before/after cases, question marks,… remove the </>/? signs
- <1925 -> 1925, i.e. year itself
- year with question mark, e;g. 1921?, 1930 ? -> year itself is best guess, so extract year
- 1999-> clean to 1999
full dates: 10.06.1995., 01/04/1993, 15/10/2005,…
multiple years:
- range of years: 1889-1892 -> take first year
- options: ‘2000, 2001’; 1992 or 2010, 2000 OR 2004 -> take first year occurrence
specials: 20. century, 1957*; 2008**, , 2004, earlier unconformed records, March,1993, 90`s
- try to extract a 4-digit year (or use Damiano’s improved functionality)

The remaining will probably require some cleanup manually.

Get an overview of the amount of records with just a YYYY year format:

## [1] 23753

We also have an amount of negative years to take into account. Let’s just consider these with 3 or 4 digits:

## [1] 178

Let’s clean the start years information step by step:

Everything that is NA or should be NA, make it NA:
For all negative values, make it NA:
When using a < or > sign, with a ? or \n added, just take the year:
When a full date is available, parse it to ISO 8601 date format:
When textwise containing a single or multiple years, extract the first year in the text:
Replace some specials still present:

First have a look at the specials remaining, not being a integer year (1 or more digits [0-9]) or a formatted date format:

and replace those values:

recheck cleanup action:

Show content:

3.6.2 Clean end year

Inspect content of end_year, which contains the information for eventDate:

Besides a lot of NA values, we have many YYYY formatted years (good to go) and a smaller group of others:

NA cases:
- unknown -> NA
- -> NA
- ? -> NA
negative years: also to NA
- -2200, -750 -> NA
before/after cases, question marks,… remove the </>/? signs
- year with question mark, 2004? -> year itself is best guess, so extract year
full dates: 15.06.2003.
specials: 20. century, 1950’s* try to extract a 4-digit year (or use Damiano’s improved functionality)

The remaining will probably require some cleanup manually.

Get an overview of the amount of records with just a YYYY year format:

## [1] 18222

We also have an amount of negative years to take into account. Let’s just consider these with 3 or 4 digits:

## [1] 129

Let’s clean the end years information step by step:

Everything that is NA or should be NA, make it NA:
For all negative values, make it NA:
When using a < or > sign, with a ? or \n added, just take the year:
When a full date is available, parse it to ISO 8601 date format:
When textwise containing a single or multiple years, extract the first year in the text:
Replace some specials still present:

First have a look at the specials remaining, not being a integer year (1 or more digits [0-9]) or a formatted date format:

and replace those values:

recheck cleanup action:

Show content:

Inspect all combinations for start_year and end_year:

Inspect which end_year falls before start_year:

Create eventDate:

3.7 occurrenceRemarks

In this field we will gather all the extra information related to a distribution. The syntax we will use is: field: value | field: value. Even if a column does not contain information, it will be included as field: NA which will make it easier for end users to separate the occurrenceRemarks field back into multiple columns. Note that input_pathways and input_impact have a sourceid column that will be ignored in this mapping (it is very seldom populated).

First, we clean region_of_first_record information by removing trailing ;

Map dwc_occurrenceRemarks:

Some records in occurrenceRemarks contain a carriage return. We remove these here:

3.8 source

There’s no sourceid to link the sources in literature_references with the distribution extension. For this, we need the file distribution_sources, generated earlier.

Thus, we need to match distribution_sources

id_sp_region	field_name	source
1	property_A	source_1
1	property_B	source_2
2	property_A	source_3
2	property_B	source_4
3	property_A	source_5
3	property_B	source_6

with distribution

id_sp_region	property_A	property_B
1	A1	B1
2	A2	B2
3	A3	B2

To generate:

id_sp_region	property_A	property_B
1	A1	B1
2	A2	B2
3	A3	B2

To link both datasets, we use id_sp_region

Transform distribution_sources from a long to a wide dataset:
Clean column names in distribution_sources:
Inspect column names:

##  [1] "current_distrib"      "current_distribution" "distribution"        
##  [4] "ecoimpact_id"         "ecological_impact"    "first_observation"   
##  [7] "general_references"   "history_is_known"     "impact_on_uses"      
## [10] "introduction_dates"   "introduction_history" "status"              
## [13] "useimpact_id"

Some remarks: - Some field names are linked to variables in distribution - distribution, - general_references, - introduction_dates, - introduction_history - Some field names are linked to variables to be included in description - current_distrib, - current_distribution, - ecoimpact_id, - ecological_impact, - first_observation - impact_on_uses, - status - useimpact_id

Map source information for distribution using ´|´ as a separator
Remove |NA:
Add source information to distribution using id_sp_region:
Rename source:

4 Post-processing

Only keep the Darwin Core columns
Drop the dwc_ prefix
Preview data

Save to CSV

Darwin Core mapping script for Distribution Extension

For: Inventory of alien species in Europe (DAISIE)

Lien Reyserhove

David Roy

2021-03-19