Motivations

The Dimensions bulk export is designed for users who need to do analysis over significant portions of the Dimensions data. This document is intended as a quick overview of how you can use this and what the data looks like.

While Dimensions provides an API for easily accessing specific documents or groups of documents with various filters, some tasks require using a significant portion of the total data. There are two key difficulties:

Accessing millions of individual documents through the API can take time
Keeping up to date, identifying which documents to update

The Dimensions bulk export is aimed at tackling these two issues. It is not intended as a replacement for the API, but as a complement. Many tasks will be best served by the API, and we would highly encourage users to start with the API rather than rebuilding what we can offer. However, if you are doing analysis over a significant portion of the data (network analysis is a common use case), then the bulk export should make things easier.

General information on the data

Overall structure

The data is provided as folders on AWS S3. The data is put together as JSONL files, where each line is a single JSON document representing a publication, grant, patent document, etc.. For the larger datasets the file will contain up to 10000 documents (typically 10000 but sometimes smaller). These files should not be relied on to contain any logical grouping (e.g. by date or publisher), the batch sizes are chosen to make downloading/processing simpler and specific sizes should not be relied upon. More specific information on the regularity and the folder structure can be found in the dedicated document on the different document types.

Delivery Overview

The following table shows the regularity and the amount of data to be expected as of January 2022. Updates are normally delivered as a baseset and consecutive _update_s. If a particular consecutive update is missing, it means there are no updates available for that day.

	Delivery	Amount of records	Size	regularity
Categories	always full set	~1000	<100KB	On demand - when we add new categories
Organisations	always full set	~398K	<130MB	3-4 times per year
Funder Groups	always full set	~20	About 20KB	10-12 times a year
Research Org Groups	always full set	~60	About 70KB	10-12 times a year
Publications	baseset	~161M	> 3TB	1-2 times a year
	update	8 - 80K	About 20 - 800MB	daily
Patents	baseset	~173M	About 360GB	1-2 times a year
	updates	50K - 1M	About 4 - 30GB	weekly
Grants	always full set	7M	About 14 GB	10-12 times a year
Clinical Trials	baseset	951K	About 3.2GB	2-3 times a year
	updates	1 - 5K	5MB - 35 MB	daily
Data sets	baseset	49M	About 27GB	1-2 times a year
	updates	2 - 200K	5MB - 2GB	weekly
Policy Documents	baseset	~2.3M	~6.5GB	1-3 times a year
	updates	1 - 10K	10MB	weekly
(Technical) Reports	baseset	~2 M	About 15 GB	1-2 times a year
	updates	1 - 10K	5 - 25MB	daily

Secure Data Access

For accessing the Dimensions data, a specific folder in Amazon S3 is provided where the data can be accessed in a read only format. The credentials will be passed along using secure communication.

Dimensions provides AWS credentials to allow access for a variety of tools, e.g.:

CrossFTP (http://www.crossftp.com/)
S3Fox - a plugin for Firefox (http://www.s3fox.net/)
S3 Browser (Windows only) (http://s3browser.com/)
CyberDuck (Mac only) (http://cyberduck.io/)
Amazon’s excellent command line tools at http://aws.amazon.com/cli/.

Configuring your tool with credentials:

S3 Bucket Path:	ai.dimensions.data/sourcename
Access Key ID:      XXXX
Secret Access Key:  XXXX

Make sure that you put in the whole path, including the source you want to access. You will find the specific path in the documentation of each source further down below in the “S3 bucket path” chapters. After setting up credentials in the chosen tool, downloading the file is a simple drag and drop operation.

To automate access you can also use Amazon’s command line tools. Once you have access to a certain sourcename, you can use the aws s3 ls command to get a list of all the releases available:

aws s3 ls s3://ai.dimensions.data/sourcename/

A single release is always identified by an additional date YYYYMMDD as part of the path marking the release. With Amazon’s command line tools you can now download a single release:

aws s3 cp s3://ai.dimensions.data/sourcename/YYYYMMDD target

To download significant amounts or to pull all updates, the sync command is very helpful (it only downloads new or not successfully downloaded files):

aws s3 sync s3://ai.dimensions.data/sourcename/YYYYMMDD target

Note that for both cp and sync the target can be another location on S3.

If you store your data there, it can be much faster to copy things directly like this rather than via your own machines. This is especially true within the same region (US-EAST-1) if you set your threads high.

Data delivery

Most of the data will be provided as an initial, large set of files called a baseset. The initial folder will look something like this:

0000000001_0000000098/records_0000001.jsonl
0000000001_0000000098/records_0000002.jsonl

Grants will be provided as a single folder with each release. The other releases will receive regular updates during a release.

For new data, a new folder will be created with new and updated documents on a regular basis. These documents will be complete, that is if anything changes then the whole document will appear again (so no complex diffing is required):

0000000099_0000000099/records_0000001.jsonl
0000000099_0000000099/records_0000002.jsonl

The folders will be orderable lexicographically, and there is no correspondence between the filenames in one update to another. 0000000099_0000000099/records_0000001.jsonl does not contain the updates to 0000000001_0000000098/records_0000001.jsonl it is simply the first batch of updated files.

While these may seem like restrictions, the goal is to provide you with a simple process for getting up to date quickly rather than provide complex guarantees for random access.

Download each folder in order (lexicographically) If an item has an ID you have seen before, the previous document with that ID should be replaced.

When items are deleted (for example, when they are discovered to be a duplicate and merged) they will appear with their ID and an obsolete status (details in the schemas).

New basesets

Over time, significant updates will happen across the data. To prevent the situation of having to process years of daily updates and seeing the same documents repeatedly, a new baseset will be created. The current and previous baseset will always be updated, to smooth the switchover. This allows time to prepare for any format changes as well.

Each provided group of baseset & updates will be placed in a folder with a date, and so you can pick the latest date for the latest data.

Data formats

Shared

There are several structures in the data that are shared between different content types. The current latest formats for these are

People for publication authors, grant investigators, etc.

Funding for links to grants and funders.

Categories for classification into FOR, RCDC, etc.

Category type	Version	Status
arxiv_cs	2022	production
bra	1	deprecated
broad_research_areas	2020	production
for	1	deprecated
for_2020	2022	production
ford	2020	preview
hra	1	production
hlbs	2020	production
hrcs_hc	1	deprecated
hrcs_hc	2020	production
hrcs_rac	1	deprecated
hrcs_rac	2020	production
icrp_ct	1	deprecated
icrp_ct	2020	production
icrp_cso	1	deprecated
icrp_cso	2020	production
rcdc	1	deprecated
rcdc	2020	deprecated
rcdc	2023	production
rcdc	2024	preview
sdg	1	deprecated
sdg	2021	production
uoa	1	deprecated
uoa	2023	production

Organisations

S3 bucket path

The S3 bucket path to access organisations: s3://ai.dimensions.data/organisations/

Data format

The organisations data is delivered as one JSONL file 3-4 times a year. No updates are provide between those releases. The delivery contains documents that may be active (see JSON schema here), obsolete (see JSON schema here) or redirected (see JSON schema here).

Funder Groups

S3 bucket path

The S3 bucket path:

s3://ai.dimensions.data/funder_groups

Data format

Version 1 release notes

With the release in s3://ai.dimensions.data/funder_groups/2023-02-23/ (starting January release 2.94), we started providing documents that comply with this format: see JSON schema here.

Research Org Groups

S3 bucket path

The S3 bucket path:

s3://ai.dimensions.data/research_org_groups

Data format

Version 1 release notes

With the release in s3://ai.dimensions.data/research_org_groups/2023-02-23/ (starting January release 2.94), we started providing documents that comply with this format: see JSON schema here.

Publications

S3 bucket path

The S3 bucket path to access publications:

s3://ai.dimensions.data/publications

Data format

Version 11 release notes

Release available at s3://ai.dimensions.data/publications/20240919/. The release contains documents that may be active (see version 11.schema.json here) or obsolete (see JSON schema here). Fields added:

pubmed_publication_types
accepted_publication_date
submitted_publication_date
publication_updates (not yet populated)
event

Version 10 release notes

Release available at s3://ai.dimensions.data/publications/20230505/. The release contains documents that may be active (see version 10.schema.json here) or obsolete (see JSON schema here).

Fields added:

concepts_scores_v6
created_in_dimensions
doctype_classification_v1
doctype_is_citable_v1
mesh
pubmed_publication_types
source->issn
source->eissn

Fields deprecated for future removal:

author_affiliations
concepts
concepts_scores
mesh_headings
mesh_terms
open_access_categories

Deprecated fields were removed:

doctype_classification_v0
doctype_is_citable_v0

Version 9 release notes

Release available at s3://ai.dimensions.data/publications/20220726/. The release contains documents that may be active (see version 9.schema.json here) or obsolete (see JSON schema here).

Fields added:

repository_dois
funding_section
doctype_classification_v0
doctype_is_citable_v0

A few deprecated fields were removed:

for - all category systems are available as part of the categories
created_in_dimensions
version_of_record
language
journal_lists
journal - redundant, use source instead

Version 8 release notes

With the latest release in s3://ai.dimensions.data/publications/20220105/ we started providing the data as compressed gzip files. The release contains documents that may be active (see version 8.schema.json here) or obsolete (see JSON schema here).

This format version is backwards compatible with versions 5, 6 and 7, no fields have been removed.

There are 2 new fields to be aware of:

open_access_linkout
copyright_statement

Also the categories list now provides a new preview version 2021 of sdg and a preview version 2020 of ford.

Version 7 release notes

This format version is backwards compatible with versions 5 and 6, no fields have been removed.

There are several new fields to be aware of:

online_publication_date and print_publication_date.
In addition to the current concepts list, concepts_scores provides relevance scores for each concept identified.
open_access_categories_v2 is in preview, with an updated definition of open access. When this is released into production we will notify users that they can now rely on it.
resulting_publication_doi allows linking from preprints to their final published version.
mesh_terms has been added, which includes all subheadings. The headings on their own remain under mesh_headings.
isbn, eisbn and arxiv_id.

Patents

S3 bucket path

The S3 bucket path to access patents:

s3://ai.dimensions.data/patents

Data format

Version 7 release notes

Latest release in s3://ai.dimensions.data/patents/20250702/ - see latest schema 7.schema.json.

Fields added:

npc_references

Deprecated fields:

publication_references

Removed fields:

patent_family

Version 6 release notes

Latest release in s3://ai.dimensions.data/patents/20231011/ - see latest schema 6.schema.json.

Fields added:

document_category
application_reference_id
inventors_details

Version 5 release notes

Latest release in s3://ai.dimensions.data/patents/20221220/ contains documents that may be obsolete or active - see see 5.schema.json.

Fields added:

claims_amount
figures_amount
created_in_dimensions

Removed deprecated fields:

current_ifi_orgs
original_ifi_orgs

Version 4 release notes

Latest release in s3://ai.dimensions.data/patents/20220715/ contains documents that may be obsolete or active - see 4.schema.json. active can be either updates or new documents, while obsolete indicates that the previous version of this documentshould be deleted.

Fields added:

federal_support
orange_book
concepts_scores

Removed deprecated fields:

for - all category systems are available as part of the categories
citations

Fields marked as deprecated (not to be used)

original_ifi_orgs
current_ifi_orgs

Version 3 release notes

The latest release in s3://ai.dimensions.data/patents/20210309/ contains documents that may be active or obsolete in format version 3 (see JSON schema here). active can be either updates or new documents, while obsolete indicates that the previous version of this documentshould be deleted.

Grants

S3 bucket path

The S3 bucket path:

s3://ai.dimensions.data/grants

Data format

Version 7 release notes

With the release in s3://ai.dimensions.data/grants/20230726/ we started provding documents that comply with this format: see JSON schema here. Other changes in this version:

Added new field concepts_scores_v6 for Terms Extraction 6
Deprecated concepts_scores (since it’s based on the old Terms Extraction 5)

Version 6 release notes

With the release in s3://ai.dimensions.data/grants/20230425/ we started provding documents that comply with this format: see JSON schema here. Other changes in this version:

Added NIH activity codes

Version 5 release notes

With the release in s3://ai.dimensions.data/grants/20230320/ we started provding documents that comply with this format: see JSON schema here. Other changes in this version:

Added project numbers

Version 4 release notes

With the release in s3://ai.dimensions.data/grants/20230222/ we started provding documents that comply with this format: see JSON schema here. Other changes in this version:

Removed funder_groups from schema
Added funding schemes and keywords

Clinical Trials

S3 bucket path

The S3 bucket path:

s3://ai.dimensions.data/clinicaltrials

Data format

Version 8 release notes

The latest release at s3://ai.dimensions.data/clinicaltrials/20250501/ . There is no change in schmea: see JSON schema here.

Couple of changes regadring rcdc classification:

rcdc_2024 - is added under categories
rcdc_v1 - is removed from categories

Version 7 release notes

The latest release at s3://ai.dimensions.data/clinicaltrials/20240604/ provides documents that comply with this format: see JSON schema here.

At the top level the following fields were added:

inclusion_criteria - Eligibility criteria for a person to participate in clinical study
exclusion_criteria - Eligibility criteria for a person to not participate in clinical study
results_first_posted - The date on which summary results information was first available
secondary_ids_json - Details about an identifier(s) ,other than the organization’s Unique Protocol Identification Number that is assigned to the clinical study.

A few deprecated older classifications fields are removed:

for_2020, for, for_first, for_v2, broad_research_areas, hrcs_hc, hrcs_rac, hlbs, cancer_types, cso, units_of_assessment, sdg, hrcs_rac_v2 - These older classifications fields are removed and replcaed by corresponding latest version.
concepts_scores_5.4 - Fields from Terms extraction version 5.4

Version 6 release notes

The latest release at s3://ai.dimensions.data/clinicaltrials/20230522/ provides documents that comply with this format: see JSON schema here.

At the top level the following fields were added:

primary_completion_date - primary completion date provided by registries
concepts_scores_v6 - Fields from Terms extraction verison 6

Version 5 release notes

The latest release at s3://ai.dimensions.data/clinicaltrials/20221219/ provides documents that comply with this format: see JSON schema here.

At the top level the following fields were added:

secondary_ids - List of secondary ids provided by registries
created_in_dimensions - Date when trials data is added to Dimensions

Version 4 release notes

The latest release at s3://ai.dimensions.data/clinicaltrials/20220712/ provides documents that comply with this format: see JSON schema here.

At the top level the following fields were added:

keywords - List of keywords from clinicaltrials.gov
enrollment_amount - Amount of planned or verified participants
overall_status - As provided by the registries
minimum_age - Minimal eligible age
maximum_age - Maxima eligible age”
eligibility_criteria
study_type
Arrays of
- outcome_measure - “Measure for evaluating the effect of an intervention/treatment”
- study_arm - ”A group or subgroup of participants that receives a specific intervention/treatment, or no intervention, according to the trial’s protocol”
- intervention - “process or action that is the focus of a clinical study“
- `study_designs

Datasets

S3 bucket path

The S3 bucket path:

s3://ai.dimensions.data/datasets

Data format

Version 5

The latest release can be found in s3://ai.dimensions.data/datasets/20250511/. The records comply with this format 5.schema.json.

At the top level the following fields were added:

file_details - datasets file details eg: format and size
funding_details - funding organisation and linked grant id
publisher_name - Name of the publisher of dataset document
references - Details about outgoing citation from datasets document
resource_type - Normalised value of resource type
authors : authors field is modified to provide more attributes such as researcher IDs and affiliated organisation IDs etc.
rcdc_2024 : is provided under categories

Version 4

The latest release can be found in s3://ai.dimensions.data/datasets/20240130/. The records comply with this format 4.schema.json.

Changes in schema version:

Added new field concepts_scores_v6 for Terms Extraction 6
Deprecated concepts_scores (since it’s based on the old Terms Extraction 5)

Version 3

With version 3 concept scores were introduced. The latest release can be found in s3://ai.dimensions.data/datasets/20220907/. The records comply with this format 3.schema.json.

Version 2

With the release in s3://ai.dimensions.data/datasets/20210929/ we started providing the data as compressed gzip files. The records comply with this format: 2.schema.json.

Policy Documents

S3 bucket path

The S3 bucket path to access policy documents:

s3://ai.dimensions.data/policydocuments

Data format

Version 4

The latest release in s3://ai.dimensions.data/policydocuments/20250429 contains documents in this format: see JSON schema here.

rcdc_v2024 is added

A few deprecated older classifications fields are removed:

for_v2020, for_v1, bra_v1, hrcs_hc_v1, hrcs_rac_v1, hlbs_v1, cancer_types, cso, uoa_v1, sdg_v1, rcdc_v1 - These older classifications fields are removed and replcaed by corresponding latest version.
concepts_scores_5.4 - Fields from Terms extraction version 5.4

Version 3

The latest release in s3://ai.dimensions.data/policydocuments/20230626 contains documents in this format: see JSON schema here.

(Technical) Reports

The S3 bucket path to access reports:

s3://ai.dimensions.data/reports

Data format

Version 5

The latest release in s3://ai.dimensions.data/reports/20250704 contains documents in this format: JSON schema.

Below listed fields are added in new schema:

grant_ids - Dimensions IDs of grants associated with the report.
publication_ids - Dimensions publication ids resolved from citations

Version 4

The latest release in s3://ai.dimensions.data/reports/20241019 contains documents in this format: JSON schema.

Changes in schema version:

rcdc_2024 is added under categories

A few deprecated older classifications fields are removed:

for_2020, for_v2, bra, hrcs_hc, hrcs_rac, icrp_ct, icrp_cso, uoa, sdg, rcdc - These older classifications fields are removed and replcaed by corresponding latest version.