Searching of iDigBio records — idig_search

Function to query the iDigBio API for specimen records

Usage

idig_search_records(
  rq,
  fields = FALSE,
  max_items = 1e+05,
  limit = 0,
  offset = 0,
  sort = FALSE,
  ...
)

Arguments

rq: iDigBio record query in nested list format
fields: vector of fields that will be contained in the data.frame, limited set returned by default, use "all" to get all indexed fields
max_items: maximum number of results allowed to be retrieved (fail -safe)
limit: maximum number of results returned
offset: number of results to skip before returning results
sort: vector of fields to use for sorting, UUID is always appended to make paging safe
...: additional parameters

Value

A data frame with fields requested or the following default fields:

UUID: Unique identifier assigned by iDigBio.
occurrenceID
catalognumber
family - may be reassigned by iDigBio
genus - may be reassigned by iDigBio
scientificname - may be reassigned by iDigBio
country - may be modified by iDigBio
stateprovince
geopoint: Assigned by iDigBio.
data.dwc:eventDate
data.dwc:year
data.dwc:month
data.dwc:day
datecollected: May be reassigned by iDigBio, see more here
collector: Assigned by iDigBio.
recordset: Assigned by iDigBio.

Details

Wraps idig_search to provide defaults specific to searching specimen records. Using this function instead of idig_search directly is recommened.

Queries need to be specified as a nested list structure that will serialize to an iDigBio query object's JSON as expected by the iDigBio API: https://github.com/iDigBio/idigbio-search-api/wiki/Query-Format

As an example, the first sample query looks like this in JSON in the API documentation:


{
  "scientificname": {
    "type": "exists"
  },
  "family": "asteraceae"
}

To rewrite this in R for use as the rq parameter to idig_search_records or idig_search_media, it would look like this:


rq <- list("scientificname"=list("type"="exists"),
           "family"="asteraceae"
           )

An example of a more complex JSON query with nested structures:


{
  "geopoint": {
   "type": "geo_bounding_box",
   "top_left": {
     "lat": 19.23,
     "lon": -130
    },
    "bottom_right": {
      "lat": -45.1119,
      "lon": 179.99999
    }
   }
 }

To rewrite this in R for use as the rq parameter, use nested calls to the list() function:


rq <- list(geopoint=list(
                         type="geo_bounding_box",
                         top_left=list(lat=19.23, lon=-130),
                         bottom_right=list(lat=-45.1119, lon= 179.99999)
                        )
           )

See the Examples section below for more samples of simpler and more complex queries. Please refer to the API documentation for the full functionality availible in queries.

All matching results are returned up to the max_items cap (default 100,000). If more results are wanted, a higher max_items can be passed as an option. This API loads records 5,000 at a time using HTTP so performance with large sets of data is not very good. Expect result sets over 50,000 records to take tens of minutes. You can use the idig_count_records or idig_count_media functions to find out how many records a query will return; these are fast.

The iDigBio API will only return 5,000 records at a time but this function will automatically page through the results and return them all. Limit and offset are availible if manual paging of results is needed though the max_items cap still applies. The item count comes from the results header not the count of actual records in the limit/offset window.

Return is a data.frame containing the requested fields (or the default fields). The columns in the data frame are untyped and no factors are pre- built. Attribution and other metadata is attached to the dataframe in the data.frame's attributes. (I.e. attributes(df))

Author

Matthew Collins

Examples

if (FALSE) { # \dontrun{
# Simple example of retriving records in a genus:
idig_search_records(rq=list(genus="acer"), limit=10)

# This complex query shows that booleans passed to the API are represented
# as strings in R, fields used in the query don't have to be returned, and
# the syntax for accessing raw data fields:
idig_search_records(rq=list("hasImage"="true", genus="acer"),
            fields=c("uuid", "data.dwc:verbatimLatitude"), limit=100)

# Searching inside a raw data field for a string, note that raw data fields
# are searched as full text, indexed fields are search with exact matches:

idig_search_records(rq=list("data.dwc:dynamicProperties"="parasite"),
            fields=c("uuid", "data.dwc:dynamicProperties"), limit=100)

# Retriving a data.frame for use with MaxEnt. Notice geopoint is expanded
# to two columns in the data.frame: gepoint.lat and geopoint.lon:
df <- idig_search_records(rq=list(genus="acer", geopoint=list(type="exists")),
          fields=c("uuid", "geopoint"), limit=10)
write.csv(df[c("uuid", "geopoint.lon", "geopoint.lat")],
          file="acer_occurrences.csv", row.names=FALSE)

} # }