This file describes the steps required to map the data to Species Distribution.

1 Setup

Load libraries:

2 Read data

  1. Define data types

  2. Read data

  3. Merge source data:

3 Map distribution extension

Preview of the data:

3.1 taxonID

  1. Remove all taxonID’s that are not included in the taxon core

  2. Scan for duplicated taxa (see this issue):

  1. Map taxonID

3.2 locationID and locality

Six fields in input_distributions are used to map locationID and locality:

Two of these fields refer to verbatim location information (mapped to locality): - region_country - region_coast

Two of these fields contain a location code (mapped to locationID): - code_region (for location linked to countries) - code_coast (for location linked to coastal areas)

The codes in code_region and code_coast refer to standards used/developed for the codes (mapped to locationID), respectively: - TDWG or DAISIE consortium (system_country) and - IHO23_4 or DIHO23_4 (system_coast)

Note that coastal information is not always provided (NA). For these records, we will exclude information related to coastal regions:

region_country region_coast locationID locality
country NA system_country: code_region region_country
country coast system_country: code_region system_coast: code_coast
  1. Convert system_country to uppercase:

  2. Map locationID

  3. Map locality

Some information in the description extension (e.g. for impact on use or impact on ecolgy) is a property of a particular taxon in a particular region, identified by id_sp_region, rather than a property of a species as a whole. To emphasize that this descriptor is linked to a taxon in a particular region, we will add the locationID to the descriptor:

idspecies sp_in_region description description_with_locationID
1 1.1 impact_on_use_1.1 impact_on_use_1.1 (locationID)
1 1.2 impact_on_use_1.2 impact_on_use_1.2 (locationID)

We here save the link between id_sp_region and locationID in a separate dataframe, to use it later in the mapping of the description extension:

  1. Export sp_in_region_with_location as .csv

3.3 countryCode

Map countryCode to the ISO 3166 Code

3.4 occurrenceStatus

Information to map occurrenceStatus is contained abundance and population_status.

In most cases , we can translate abundance to the GBIF controlled vocabulary of occurrenceStatus:

abundance occurrenceStatus
Absent or extinct absent (or extinct, see below)
Abundant common
Common common
Local present
Rare rare
Single record present
Sporadic irregular
Unknown doubtful

However, for 30 taxa, population_status contains the field extinct, which is valuable information for occurrenceStatus:

population_status abundance records
Extinct Absent or extinct 9
Extinct Local 6
Extinct Rare 11
Extinct Single record 4
Extinct Unknown 8

We decided to set occurrencesStatus as extinct, irrespective of the content of abundance(see this issue).

3.5 establishmentMeans

Information for the mapping of establishmentMeans is contained in species_status:

For mapping soecies_status to establishmentMeans, see issue 15 on GitHub:

3.6 eventDate

3.6.1 Clean start year

Inspect content of start_year, which contains the information for eventDate:

Besides a lot of NA values, we have many YYYY formatted years (good to go) and a smaller group of others:

  • NA cases:
    • Unknown -> NA
    • unknown -> NA
    • . -> NA
    • ? -> NA
    • since long -> NA
  • negative years: also to NA
    • -5300, -2200, -750 -> NA
  • before/after cases, question marks,… remove the </>/? signs
    • <1925 -> 1925, i.e. year itself
    • year with question mark, e;g. 1921?, 1930 ? -> year itself is best guess, so extract year
    • 1999-> clean to 1999
  • full dates: 10.06.1995., 01/04/1993, 15/10/2005,…
  • multiple years:
    • range of years: 1889-1892 -> take first year
    • options: ‘2000, 2001’; 1992 or 2010, 2000 OR 2004 -> take first year occurrence
  • specials: 20. century, 1957*; 2008**, , 2004, earlier unconformed records, March,1993, 90`s
    • try to extract a 4-digit year (or use Damiano’s improved functionality)

The remaining will probably require some cleanup manually.

Get an overview of the amount of records with just a YYYY year format:

## [1] 23753

We also have an amount of negative years to take into account. Let’s just consider these with 3 or 4 digits:

## [1] 178

Let’s clean the start years information step by step:

  1. Everything that is NA or should be NA, make it NA:

  2. For all negative values, make it NA:

  3. When using a < or > sign, with a ? or \n added, just take the year:

  4. When a full date is available, parse it to ISO 8601 date format:

  5. When textwise containing a single or multiple years, extract the first year in the text:

  6. Replace some specials still present:

First have a look at the specials remaining, not being a integer year (1 or more digits [0-9]) or a formatted date format:

and replace those values:

recheck cleanup action:

Show content:

3.6.2 Clean end year

Inspect content of end_year, which contains the information for eventDate:

Besides a lot of NA values, we have many YYYY formatted years (good to go) and a smaller group of others:

  • NA cases:
    • unknown -> NA
    • -> NA
    • ? -> NA
  • negative years: also to NA
    • -2200, -750 -> NA
  • before/after cases, question marks,… remove the </>/? signs
    • year with question mark, 2004? -> year itself is best guess, so extract year
  • full dates: 15.06.2003.
  • specials: 20. century, 1950’s* try to extract a 4-digit year (or use Damiano’s improved functionality)

The remaining will probably require some cleanup manually.

Get an overview of the amount of records with just a YYYY year format:

## [1] 18222

We also have an amount of negative years to take into account. Let’s just consider these with 3 or 4 digits:

## [1] 129

Let’s clean the end years information step by step:

  1. Everything that is NA or should be NA, make it NA:

  2. For all negative values, make it NA:

  3. When using a < or > sign, with a ? or \n added, just take the year:

  4. When a full date is available, parse it to ISO 8601 date format:

  5. When textwise containing a single or multiple years, extract the first year in the text:

  6. Replace some specials still present:

First have a look at the specials remaining, not being a integer year (1 or more digits [0-9]) or a formatted date format:

and replace those values:

recheck cleanup action:

Show content:

Inspect all combinations for start_year and end_year:

Inspect which end_year falls before start_year:

Create eventDate:

3.7 occurrenceRemarks

In this field we will gather all the extra information related to a distribution. The syntax we will use is: field: value | field: value. Even if a column does not contain information, it will be included as field: NA which will make it easier for end users to separate the occurrenceRemarks field back into multiple columns. Note that input_pathways and input_impact have a sourceid column that will be ignored in this mapping (it is very seldom populated).

First, we clean region_of_first_record information by removing trailing ;

Map dwc_occurrenceRemarks:

Some records in occurrenceRemarks contain a carriage return. We remove these here:

3.8 source

There’s no sourceid to link the sources in literature_references with the distribution extension. For this, we need the file distribution_sources, generated earlier.

Thus, we need to match distribution_sources

id_sp_region field_name source
1 property_A source_1
1 property_B source_2
2 property_A source_3
2 property_B source_4
3 property_A source_5
3 property_B source_6

with distribution

id_sp_region property_A property_B
1 A1 B1
2 A2 B2
3 A3 B2

To generate:

id_sp_region property_A property_B
1 A1 B1
2 A2 B2
3 A3 B2

To link both datasets, we use id_sp_region

  1. Transform distribution_sources from a long to a wide dataset:

  2. Clean column names in distribution_sources:

  3. Inspect column names:

##  [1] "current_distrib"      "current_distribution" "distribution"        
##  [4] "ecoimpact_id"         "ecological_impact"    "first_observation"   
##  [7] "general_references"   "history_is_known"     "impact_on_uses"      
## [10] "introduction_dates"   "introduction_history" "status"              
## [13] "useimpact_id"

Some remarks: - Some field names are linked to variables in distribution - distribution, - general_references, - introduction_dates, - introduction_history - Some field names are linked to variables to be included in description - current_distrib, - current_distribution, - ecoimpact_id, - ecological_impact, - first_observation - impact_on_uses, - status - useimpact_id

  1. Map source information for distribution using ´|´ as a separator

  2. Remove |NA:

  3. Add source information to distribution using id_sp_region:

  4. Rename source:

4 Post-processing

  1. Only keep the Darwin Core columns

  2. Drop the dwc_ prefix

  3. Preview data

  1. Save to CSV