dimensions-data-schemas

Documentation & data schemas for the externally available bulk data releases

Motivations

The Dimensions bulk export is designed for users who need to do analysis over significant portions of the Dimensions data. This document is intended as a quick overview of how you can use this and what the data looks like.

While Dimensions provides an API for easily accessing specific documents or groups of documents with various filters, some tasks require using a significant portion of the total data. There are two key difficulties:

  1. Accessing millions of individual documents through the API can take time
  2. Keeping up to date, identifying which documents to update

The Dimensions bulk export is aimed at tackling these two issues. It is not intended as a replacement for the API, but as a complement. Many tasks will be best served by the API, and we would highly encourage users to start with the API rather than rebuilding what we can offer. However, if you are doing analysis over a significant portion of the data (network analysis is a common use case), then the bulk export should make things easier.

General information on the data

Overall structure

The data is provided as folders on AWS S3. The data is put together as JSONL files, where each line is a single JSON document representing a publication, grant, patent document, etc.. Each file will contain up to 10000 documents (typically 10000 but sometimes smaller). These files should not be relied on to contain any logical grouping (e.g. by date or publisher), the batch sizes are chosen to make downloading/processing simpler and specific sizes should not be relied upon. More specific information on the regularity and the folder structure can be found in the dedicated document on the different document types.

Delivery Overview

The following table shows the regularity and the amount of data to be expected as of November 2020. Updates are normally delivered daily. If a particular consecutive daily update is missing, it means there are no updates available for that day.

  Delivery Amount of records Size regularity
Publications baseset 113 M > 3 TB 2-4 times a year
  update 8 - 80 K About 20 - 800 MB daily
Patents baseset 135 M About 360 GB 2-4 times a year
  updates 50K - 1M About 4 - 30 GB weekly
Grants always full set 5.6 M About 14 GB 10-12 times a year
Clinical Trials baseset 650 K About 2.5 GB 2-4 times a year
  updates 1 - 5K About 5 - 35 MB daily
Data sets baseset 8 M About 22 GB 1-2 times a year
  updates 2 - 200 K 5 MB - 2 GB daily
(Technical) Reports baseset About 1 M About 15 GB 1-2 times a year
  updates 1 - 10 K 5 - 25 MB daily
Policy Documents baseset 715 K ~550 MB 1-3 times a year
  updates ~ 1 - 10 K 10 MB weekly

Secure Data Access

For accessing the Dimensions data, a specific folder in Amazon S3 is provided where the data can be accessed in a read only format. The credentials will be passed along using secure communication.

Dimensions provides AWS credentials to allow access for a variety of tools, e.g.:

Configuring your tool with credentials:

S3 Bucket Path:	ai.dimensions.data/sourcename
Access Key ID:	XXXX
Secret Access Key: ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­	XXXX

Make sure that you put in the whole path, including the source you want to access. You will find the specific path in the documentation of each source further down below in the “S3 bucket path” chapters. After setting up credentials in the chosen tool, downloading the file is a simple drag and drop operation.

With Amazon’s command line tools to download a single file:

aws s3 cp s3://ai.dimensions.data/sourcename/YYYYMMDD target

To download significant amounts or to pull all updates, the sync command is very helpful (it only downloads new or not successfully downloaded files):

aws s3 sync s3://ai.dimensions.data/sourcename​/YYYYMMDD target

Note that for both of these, the target can be another location on S3.

If you store your data there, it can be much faster to copy things directly like this rather than via your own machines. This is especially true within the same region (US-EAST-1) if you set your threads high.

Data delivery

Data will be provided as an initial, large set of files called a baseset. The initial folder will look something like this:

0000000001_0000000098/records_0000001.jsonl
0000000001_0000000098/records_0000002.jsonl	

Grants will be provided as a single folder each release. The other releases will receive regular updates during a release.

For new data, a new folder will be created with new and updated documents on a regular basis. These documents will be complete, that is if anything changes then the whole document will appear again (so no complex diffing is required):

0000000099_0000000099/records_0000001.jsonl
0000000099_0000000099/records_0000002.jsonl	

The folders will be orderable lexicographically, and there is no correspondence between the filenames in one update to another. 0000000099_0000000099/records_0000001.jsonl does not contain the updates to 0000000001_0000000098/records_0000001.jsonl it is simply the first batch of updated files.

While these may seem like restrictions, the goal is to provide you with a simple process for getting up to date quickly rather than provide complex guarantees for random access.

Download each folder in order (lexicographically) If an item has an ID you have seen before, the previous document with that ID should be replaced.

When items are deleted (for example, when they are discovered to be a duplicate and merged) they will appear with their ID and an obsolete status (details in the schemas).

New basesets

Over time, significant updates will happen across the data. To prevent the situation of having to process years of daily updates and seeing the same documents repeatedly, a new baseset will be created. The current and previous baseset will always be updated, to smooth the switchover. This allows time to prepare for any format changes as well.

Each provided group of baseset & updates will be placed in a folder with a date, and so you can pick the latest date for the latest data.

Data formats

Shared

There are several structures in the data that are shared between different content types. The current latest formats for these are

People for publication authors, grant investigators, etc.

Funding for links to grants and funders.

Categories for classification into FOR, RCDC, etc.

The categories contain IDs that can be looked up in the categories lookup releases in the following location

s3://ai.dimensions.data/categories-lookup/

Categories

Over time, we update the machine learning models or training data behind the classifications. When there are significant changes, we update the version of the model. The version used for a classification is provided in the categories list under the key version.

To support migration from one model to another, new models will be added in preview before being released to production. Once a new model is released, the previous version will be deprecated.

Currently, all production models are version 1.

Category type Version Status
broad_research_areas 1 production
broad_research_areas 2020 preview
cancer_types 1 production
cancer_types 2020 preview
cso 1 production
cso 2020 preview
for 1 production
for 2020 preview
ford 2020 preview
health_research_areas 1 production
health_research_areas 2020 preview
hrcs_hc 1 production
hrcs_hc 2020 preview
hrcs_rac 1 production
hrcs_rac 2020 preview
rcdc 1 production
rcdc 2020 preview
sdg 1 production
uoa 1 production
uoa 2020 preview

Publications

S3 bucket path

The S3 bucket path to access publications:

s3://ai.dimensions.data/publications

Data format

The latest release in s3://ai.dimensions.data/publications/20210121/ contains documents that may be active in format version 7 (see JSON schema here) or obsolete (see JSON schema here).

Version 7 release notes

This format version is backwards compatible with versions 5 and 6, no fields have been removed.

There are several new fields to be aware of:

Patents

S3 bucket path

The S3 bucket path to access patents:

s3://ai.dimensions.data/patents

Data format

The latest release in s3://ai.dimensions.data/patents/20210309/ contains documents that may be active or obsolete in format version 3 (see JSON schema here). active can be either updates or new documents, while obsolete indicates that the previous version of this documentshould be deleted.

Grants

S3 bucket path

The S3 bucket path:

s3://ai.dimensions.data/grants

Data format

With the release in s3://ai.dimensions.data/grants/20201020/ we started provding documents that comply with this format: see JSON schema here.

Clinical Trials

S3 bucket path

The S3 bucket path:

s3://ai.dimensions.data/clinicaltrials

Data format

With the release in s3://ai.dimensions.data/grants/20210615/ we started provding documents that comply with this format: see JSON schema here.

(Technical) Reports

The S3 bucket path to access reports:

s3://ai.dimensions.data/reports

Data format

Documents provided validate against this JSON schema.

Policy Documents

S3 bucket path

The S3 bucket path to access policy documents:

s3://ai.dimensions.data/policydocuments

Data format

The latest release in s3://ai.dimensions.data/policydocuments/20210428 contains documents in this format: see JSON schema here.