../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Extracting researchers based on affiliations and publications history

The purpose of this notebook is to demonstrate how to extract researchers data using the Dimensions API.

Specifically, we want to find out about all researchers based on these two criteria:

  1. Affiliation. That is, whether they are / have been affiliated to a specific GRID organization

  2. Publications. Whether they have published within a chosen time frame.

[2]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 25, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[3]:
!pip install dimcli -U --quiet

import dimcli
from dimcli.utils import *

import os, sys, time, json
import pandas as pd
from IPython.display import Image
from IPython.core.display import HTML

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

Select an organization ID

For the purpose of this exercise, we will are going to use grid.258806.1 and the time frame 2013-2018. Feel free though to change the parameters below as you want, eg by choosing another GRID organization.

[4]:
# sample org: grid.258806.1
GRIDID = "grid.258806.1" #@param {type:"string"}
START_YEAR = 2013 #@param {type:"slider", min:1900, max:2020, step: 1}
END_YEAR = 2018 #@param {type:"slider", min:1900, max:2020, step: 1}

if START_YEAR < END_YEAR: START_YEAR = END_YEAR
YEARS = f"[{START_YEAR}:{END_YEAR}]"

Background: understanding the data model

In order to process researchers affiliation data in the context of publications, we should first take the time to understand how this data is structured in Dimensions.

The JSON results of any query with shape search publications where .... return publications are composed by a list of publications. If we open up one single publication record we will immediately see that authors are stored in a nested object authors that contains a list of dictionaries. Each element in this dictionary represents one single publication author and includes other information e.g. name, surname, ID, the organizations he/she is affiliated with etc..

For example, in order to extract the second author of the tenth publication from our results we would do the following: results.publications[10]['authors'][1]:

# author info
...
    {'first_name': 'Noboru',
     'last_name': 'Sebe',
     'orcid': '',
     'current_organization_id': 'grid.258806.1',
     'researcher_id': 'ur.010647607673.28',
     'affiliations': [{'id': 'grid.258806.1',
       'name': 'Kyushu Institute of Technology',
       'city': 'Kitakyushu',
       'city_id': 1859307,
       'country': 'Japan',
       'country_code': 'JP',
       'state': None,
       'state_code': None}]}
 ...

Here’s a object model diagram summing up how data is structured.

[5]:
Image(url= "http://api-sample-data.dimensions.ai/diagrams/data_model_researchers_publications1.v2.jpg", width=800)
[5]:

There are a few important things to keep in mind:

  • Publication Authors VS Researchers. In Dimensions, publication authors don’t necessarily have a researcher ID (eg because they haven’t been disambiguated yet). So a publication may have N authors (stored in JSON within the authors key), but only a subset of these include a researcher_id link. PS see also the searching for researchers section of the API docs for more info on this topic.

  • Time of the affiliation. Researchers can be affiliated to a specific GRID organization either at the time of speaking (now), or at the time of writing (i.e. when the article was published). The DSL uses different properties to express this fact: current_research_org or simply research_orgs.

  • Denormalized fields. Both the Publication and Researcher sources include a research_orgs field. In both cases that’s a denormalized field ie a ‘shortcut version’ of data primarily found in the authors field of publications. However, although they share the same name, the two research_orgs fields don’t have the same meaning.

    • In publications, research_orgs contains the set of all authors’ affiliations for a single article.

    • In researchers, research_orgs contains the set of all research organizations a single individual has been affiliated to throughout his/her career (as far as Dimensions knows, of course!).

So, in the real world we often have scenarios like the following one:

[6]:
Image(url= "http://api-sample-data.dimensions.ai/diagrams/data_model_researchers_publications2.v2.jpg", width=1000)
[6]:

Methodology: two options available

It turns out that there are two possible ways to extract these data, depending on whether we start our queries from Publications or from Researchers.

  1. Starting from the Publication source, we can first filter publications based on our constraints (eg year range [2013-2018] and research_orgs="grid.258806.1" - but it could be any other query parameters); second, we would loop over all of these publications so to extract all relevant reasearchers using the affiliation data.

  2. Starting from the Researcher source, we would first filter researchers based on our constraints (eg with research_orgs="grid.258806.1"); second, we would search for publications linked to these researchers, which have been published in the time frame [2013-2018]; lastly, we extract all relevant reasearchers using the affiliation data.

As we will find out later, both approaches are perfectly valid and return the same results.

The first apporach is generally quicker as it has only two steps (while the second one has three).

In real-world situations though, deciding which approach is best depends on the specific query filters being used and on the impact these filters have on the overall performance/speed of the data extraction. There is no fixed rule and a bit of trial & error can go a long way in helping you optimize your data extraction algorithm!

Approach 1. From publications to researchers

Starting from the publications source, the steps are as follows:

  1. Filtering publications for year range [2013-2018] and research_orgs=“grid.258806.1”

    • i.e. search publications where year in [2013:2018] and research_orgs="grid.258806.1" return publications

  2. Looping over publications’ authors and extracting relevant researchers

    • if ['affiliation']['id'] == "grid.258806.1"

      • => that gives us the affiliations at the time of publishing

    • if ['current_organization_id'] == "grid.258806.1"

      • => that gives us the affiliations at the time of speaking

First off, we can to get all publications based on our search criteria by using a loop query.

[7]:
pubs = dsl.query_iterative(f"""search publications
                                where year in {YEARS} and research_orgs="{GRIDID}"
                            return publications[id+title+doi+year+type+authors+journal+issue+volume]""")
Starting iteration with limit=1000 skip=0 ...
0-1000 / 1052 (1.94s)
1000-1052 / 1052 (1.28s)
===
Records extracted: 1052

First we want to know how many researchers linked to these publications were affiliated to the GRID organization when the publication was created (note: they may be at at a different institution right now).

The affiliation data in publications represent exactly that: we can thus loop over them (for each publication/researcher) and keep only the ones matching our GRID ID.

TIP affiliations can be extracted easily thanks one of the ‘dataframe’ transformation methods in Dimcli: as_dataframe_authors_affiliations

[8]:
# extract affiliations from a publications list
affiliations = pubs.as_dataframe_authors_affiliations()
# select only affiliations for GRIDID
authors_historical = affiliations[affiliations['aff_id'] == GRIDID].copy()
# remove duplicates by eliminating publication-specific data
authors_historical.drop(columns=['pub_id'], inplace=True)
authors_historical.drop_duplicates('researcher_id', inplace=True)
print(f"===\nResearchers with affiliation to {GRIDID} at time of writing:", authors_historical.researcher_id.nunique(), "\n===")
# preview the data
authors_historical
===
Researchers with affiliation to grid.258806.1 at time of writing: 830
===
[8]:
aff_city aff_city_id aff_country aff_country_code aff_id aff_name aff_raw_affiliation aff_state aff_state_code researcher_id first_name last_name
0 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Department of Electrical and Electronic Engine... Meng Ge
5 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Department of Control Engineering, Kyushu Inst... ur.0724144151.56 Joo Kooi Tan
6 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Department of Electrical Engineering, Kyushu I... ur.010214262704.60 Gaber Magdy
11 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Department of Electrical Engineering, Kyushu I... ur.011456332771.53 Yasunori Mitani
14 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Kyushu Institute of Technology, Kitakyushu, 80... ur.013152454441.15 Kazuhiro Toyoda
... ... ... ... ... ... ... ... ... ... ... ... ...
5278 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology 九州工業大学 ur.013230043761.21 幸左 賢二
5283 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Kyushu Institute of Technology ur.015230506045.34 Hirohisa ISOGAI
5290 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Kyushu Institute of Technology, Japan ur.012270025364.95 Naoya Higuchi
5291 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Kyushu Institute of Technology, Japan ur.013003725541.53 Yasunobu Imamura
5294 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Kyushu Institute of Technology, Japan ur.015310517641.55 Takeshi Shinohara

830 rows × 12 columns

Note: * The first ‘publication-affiliations’ dataframe we get may contain duplicate records - eg if an author has more than one publication it’ll be listed twice. That’s why we have an extra step where we drop the pub_id column and simply count unique researchers, based on their researcher ID.

This can be achieved simply by taking into consideration a different field, called current_organization_id, available at the outer level of the JSON author structure (see the data model section above) - outside the affiliations list.

Luckily Dimcli includes another handy method for umpacking authors into a dataframe: as_dataframe_authors

[9]:
authors_current = pubs.as_dataframe_authors()
authors_current = authors_current[authors_current['current_organization_id'] == GRIDID].copy()
authors_current.drop(columns=['pub_id'], inplace=True)
authors_current.drop_duplicates('researcher_id', inplace=True)
print(f"===\nResearchers with affiliation to {GRIDID} at the time of speaking:", authors_current.researcher_id.nunique(), "\n===")
authors_current
===
Researchers with affiliation to grid.258806.1 at the time of speaking: 584
===
[9]:
affiliations corresponding current_organization_id first_name last_name orcid raw_affiliation researcher_id
5 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Joo Kooi Tan None [Department of Control Engineering, Kyushu Ins... ur.0724144151.56
9 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Yasunori Mitani None [Department of Electrical Engineering, Kyushu ... ur.011456332771.53
12 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Kazuhiro Toyoda None [Kyushu Institute of Technology, Kitakyushu, 8... ur.013152454441.15
13 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Mengu Cho None [Kyushu Institute of Technology, Kitakyushu, 8... ur.013152112405.28
14 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Siowhwa Teo [0000-0001-9927-8309] [Graduate School of Life Science and Systems E... ur.010525770330.49
... ... ... ... ... ... ... ... ...
4763 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 古川 徹生 None [九州工業大学大学院生命体工学研究科] ur.013546017565.95
4770 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 西田 治男 None [九州工業大学大学院生命体工学研究科] ur.07662175025.31
4784 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Naoya Higuchi None [Kyushu Institute of Technology, Japan] ur.012270025364.95
4785 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Yasunobu Imamura None [Kyushu Institute of Technology, Japan] ur.013003725541.53
4788 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Takeshi Shinohara None [Kyushu Institute of Technology, Japan] ur.015310517641.55

584 rows × 8 columns

Approach 2. From researchers to publications

Using this approach, we start our search from the ‘researchers’ database (instead of the ‘publications’ database).

There are 3 main steps:

  1. Filtering researchers with research_orgs=GRID-ID (note: this gives us affiliated researches at any point in time)

    • search researchers where research_orgs="grid.258806.1" return researchers

  2. Searching for publications linked to these researchers and linked to GRID-ID, which have been published in the time frame [2013-2018]

    • search publications where researchers.id in {LIST OF IDS} and year in [2013:2018] and research_orgs="grid.258806.1" return publications

    • NOTE: this a variation of the Approach-1 query above: we have just added the researchers IDs filter (thus reducing the search space)

  3. Extracting relevant researchers from publications, using the same exact steps as in approach 1 above.

    • if ['current_organization_id'] == "grid.258806.1"

      • => that gives us the affiliations at the time of speaking

    • or if ['affiliation']['id'] == "grid.258806.1"

      • => that gives us the affiliations at the time of publishing

[10]:
q = f"""search researchers where research_orgs="{GRIDID}"
        return researchers[basics]"""
researchers_json = dsl.query_iterative(q)
researchers = researchers_json.as_dataframe()
researchers.head()
Starting iteration with limit=1000 skip=0 ...
0-1000 / 11838 (5.43s)
1000-2000 / 11838 (3.73s)
2000-3000 / 11838 (2.46s)
3000-4000 / 11838 (1.84s)
4000-5000 / 11838 (3.44s)
5000-6000 / 11838 (2.64s)
6000-7000 / 11838 (2.29s)
7000-8000 / 11838 (1.98s)
8000-9000 / 11838 (2.25s)
9000-10000 / 11838 (2.40s)
10000-11000 / 11838 (3.11s)
11000-11838 / 11838 (3.06s)
===
Records extracted: 11838
[10]:
first_name id last_name research_orgs orcid_id
0 Makoto ur.07777767140.16 Kobayashi [{'acronym': 'KIT', 'city_name': 'Kitakyushu',... NaN
1 Ashraful M ur.07777460411.31 Hoque [{'city_name': 'Houston', 'country_name': 'Uni... [0000-0002-3970-8538]
2 Fujio ur.07777374057.66 Kubo [{'city_name': 'Hiroshima', 'country_name': 'J... NaN
3 Muhammad Akmal Kamarudin ur.07777204200.43 Kamarudin [{'acronym': 'UM', 'city_name': 'Kuala Lumpur'... [0000-0002-2256-5948]
4 Chinatsu ur.0777713743.09 Ikeura [{'acronym': 'KIT', 'city_name': 'Kitakyushu',... NaN

Now we need to select only the researchers who have published in the time frame [2013:2018]. So for each researcher ID we must extract the full publication history in order to verify their relevance.

The most efficient way to do this is to use a query that extracts the publication history for several researchers at the same time (so to avoid overruning our API quota), then, as a second step, producing a clean list of relevant researchers from it.

[15]:
results = []
researchers_ids = list(researchers['id'])
# no of researchers IDs per query: so to ensure we never hit the 1000 records limit per query
CHUNKS_SIZE = 300

q = """search publications
                where researchers.id in {}
                and year in {}
                and research_orgs="{}"
            return publications[id+title+doi+year+type+authors+journal+issue+volume] limit 1000"""


for chunk in chunks_of(researchers_ids, size=CHUNKS_SIZE):
    data = dsl.query(q.format(json.dumps(chunk), YEARS, GRIDID))
    try:
        results += data.publications
    except:
        pass

print("---\nFound", len(results), "publications for the given criteria (including duplicates)")

# simulate a DSL payload using Dimcli
pubs_v2 = dimcli.DslDataset.from_publications_list(results)

# transform to a dataframe to remove duplicates quickly
pubs_v2_df = pubs_v2.as_dataframe()
pubs_v2_df.drop_duplicates("id", inplace=True)
print("Final result:", len(pubs_v2_df), "unique publications")
Returned Publications: 59 (total = 59)
Time: 1.99s
Returned Publications: 49 (total = 49)
Time: 0.93s
Returned Publications: 48 (total = 48)
Time: 0.93s
Returned Publications: 61 (total = 61)
Time: 1.02s
Returned Publications: 36 (total = 36)
Time: 0.86s
Returned Publications: 41 (total = 41)
Time: 0.85s
Returned Publications: 64 (total = 64)
Time: 0.94s
Returned Publications: 36 (total = 36)
Time: 0.91s
Returned Publications: 29 (total = 29)
Time: 0.98s
Returned Publications: 48 (total = 48)
Time: 0.93s
Returned Publications: 40 (total = 40)
Time: 0.90s
Returned Publications: 23 (total = 23)
Time: 0.79s
Returned Publications: 45 (total = 45)
Time: 0.93s
Returned Publications: 53 (total = 53)
Time: 1.13s
Returned Publications: 38 (total = 38)
Time: 0.94s
Returned Publications: 61 (total = 61)
Time: 0.96s
Returned Publications: 36 (total = 36)
Time: 1.76s
Returned Publications: 59 (total = 59)
Time: 1.06s
Returned Publications: 52 (total = 52)
Time: 0.94s
Returned Publications: 47 (total = 47)
Time: 0.96s
Returned Publications: 25 (total = 25)
Time: 0.88s
Returned Publications: 52 (total = 52)
Time: 0.98s
Returned Publications: 67 (total = 67)
Time: 0.98s
Returned Publications: 72 (total = 72)
Time: 1.02s
Returned Publications: 26 (total = 26)
Time: 0.83s
Returned Publications: 64 (total = 64)
Time: 0.95s
Returned Publications: 52 (total = 52)
Time: 0.96s
Returned Publications: 42 (total = 42)
Time: 0.85s
Returned Publications: 75 (total = 75)
Time: 0.94s
Returned Publications: 81 (total = 81)
Time: 1.01s
Returned Publications: 99 (total = 99)
Time: 1.06s
Returned Publications: 22 (total = 22)
Time: 0.91s
Returned Publications: 109 (total = 109)
Time: 0.94s
Returned Publications: 48 (total = 48)
Time: 0.91s
Returned Publications: 38 (total = 38)
Time: 0.85s
Returned Publications: 78 (total = 78)
Time: 0.92s
Returned Publications: 53 (total = 53)
Time: 0.89s
Returned Publications: 72 (total = 72)
Time: 1.03s
Returned Publications: 50 (total = 50)
Time: 1.06s
Returned Publications: 15 (total = 15)
Time: 0.69s
---
Found 2065 publications for the given criteria (including duplicates)
Final result: 904 unique publications

This step is basically the same as in approach 1 above.

[16]:
# extract affiliations from a publications list
affiliations_v2 = pubs_v2.as_dataframe_authors_affiliations()
# select only affiliations for GRIDID
authors_historical_v2 = affiliations_v2[affiliations_v2['aff_id'] == GRIDID].copy()
# remove duplicates by eliminating publication-specific data
authors_historical_v2.drop(columns=['pub_id'], inplace=True)
authors_historical_v2.drop_duplicates('researcher_id', inplace=True)
print(f"===\nResearchers with affiliation to {GRIDID} at time of writing:", authors_historical_v2.researcher_id.nunique(), "\n===")
# preview the data
authors_historical_v2
===
Researchers with affiliation to grid.258806.1 at time of writing: 740
===
[16]:
aff_city aff_city_id aff_country aff_country_code aff_id aff_name aff_raw_affiliation aff_state aff_state_code researcher_id first_name last_name
0 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Graduate School of Life Science and Systems En... ur.010525770330.49 Siowhwa Teo
1 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Graduate School of Life Science and Systems En... ur.07706074430.00 Zhanglin Guo
2 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Graduate School of Life Science and Systems En... ur.012076416030.50 Zhenhua Xu
3 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Graduate School of Life Science and Systems En... ur.010227671107.36 Chu Zhang
4 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Graduate School of Life Science and Systems En... ur.014730512667.72 Yusuke Kamata
... ... ... ... ... ... ... ... ... ... ... ... ...
12176 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology 九州工業大学 ur.012307631725.34 熊本 拓哉
12202 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Department of Bioscience and Bioinformatics, K... ur.010340773642.58 Satoshi Fujii
12706 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology 九州工業大学大学院情報工学研究院 ur.010035501405.12 國近 秀信
12728 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology Graduate School of Life Science and Systems En... ur.01004062131.45 Momoko Kumemura
12752 Kitakyushu 1859307.0 Japan JP grid.258806.1 Kyushu Institute of Technology 九工大 ur.010024355257.03 河野 晴彦

740 rows × 12 columns

Also here, the procedure is exactly the same as in approach 1.

[17]:
authors_current_v2 = pubs_v2.as_dataframe_authors()
authors_current_v2 = authors_current_v2[authors_current_v2['current_organization_id'] == GRIDID].copy()
authors_current_v2.drop(columns=['pub_id'], inplace=True)
authors_current_v2.drop_duplicates('researcher_id', inplace=True)
print(f"===\nResearchers with affiliation to {GRIDID} at the time of speaking:", authors_current_v2.researcher_id.nunique(), "\n===")
authors_current_v2
===
Researchers with affiliation to grid.258806.1 at the time of speaking: 584
===
[17]:
affiliations corresponding current_organization_id first_name last_name orcid raw_affiliation researcher_id
0 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Siowhwa Teo [0000-0001-9927-8309] [Graduate School of Life Science and Systems E... ur.010525770330.49
2 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Zhenhua Xu None [Graduate School of Life Science and Systems E... ur.012076416030.50
3 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Chu Zhang None [Graduate School of Life Science and Systems E... ur.010227671107.36
4 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Yusuke Kamata None [Graduate School of Life Science and Systems E... ur.014730512667.72
6 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Tingli Ma [0000-0002-3310-459X] [Graduate School of Life Science and Systems E... ur.01144540527.52
... ... ... ... ... ... ... ... ...
10482 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 毛利 恵美子 None [九州工業大学大学院] ur.010435415335.62
10525 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 田中 和博 None [九工大] ur.010456027505.69
11311 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 國近 秀信 None [九州工業大学大学院情報工学研究院] ur.010035501405.12
11328 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 Momoko Kumemura None [Graduate School of Life Science and Systems E... ur.01004062131.45
11344 [{'city': 'Kitakyushu', 'city_id': 1859307, 'c... grid.258806.1 河野 晴彦 None [九工大] ur.010024355257.03

584 rows × 8 columns

Conclusions

As anticipated above, both approaches are equally valid and in fact they return the same (or very similar) number of results. Let’s compare them:

[18]:
# create summary table
data = [['1', len(authors_current), len(authors_historical),
         """search publications where year in [2013:2018] and research_orgs="grid.258806.1" return publication""",
         ],
        ['2', len(authors_current_v2), len(authors_historical_v2),
         """search researchers where research_orgs="grid.258806.1" return researchers --- then --- search publications where researchers.id in {IDS} and year in [2013:2018] and research_orgs={GRIDID} return publications""",
        ]]

pd.DataFrame(data, columns = ['Method', 'Authors (current)', 'Authors (historical)',  'Query'])
[18]:
Method Authors (current) Authors (historical) Query
0 1 584 830 search publications where year in [2013:2018] ...
1 2 584 740 search researchers where research_orgs="grid.2...

In some cases you might encounter small differences in the total number of records returned by the two approaches (eg one method returns 1-2 extra records than the other one).

This is usually due to a synchronization delay between Dimensions databases (e.g. publications and researchers). The differences are negligible in most cases, but in general it’s enough to run same extraction again after a day or two for the problem to disappear.

It depends on which/how many filters are being used in order to identify a suitable results set for your research question.

  • The first approach is generally quicker as it has only two steps, as opposed to the second method that has three.

  • However if your initial publications query returns lots of results (eg for a large institution or big time frame), it may be quicker to try out method 2 instead.

  • The second approach can be handy if one wants to pre-filter researchers using one of the ther available properties (e.g. last_grant_year)

So, in general, deciding which approach is best depends on the specific query filters being used and on the impact these filters have on the overall performance/speed of the data extraction.

There is no fixed rule and a bit of trial & error can go a long way in helping you optimize your data extraction algorithm!



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg