../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Identifying the Industry Collaborators of an Academic Institution

Dimensions uses GRID identifiers for institutions, hence you can take advantage of the GRID metadata with Dimensions queries.

In this tutorial we identify all organizations that have an industry type.

This list of organizations is then used to identify industry collaborations for a chosen academic institution.

[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 25, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:
!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *

import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

1. Selecting an academic institution

For the purpose of this exercise, we will use University of Trento, Italy (grid.11696.39) as a starting point. You can pick any other GRID organization of course. Just use a DSL query or the GRID website to discover the ID of an organization that interests you.

[3]:
#@markdown The main organization we are interested in:
GRIDID = "grid.11696.39" #@param {type:"string"}

#@markdown The start/end year of publications used to extract industry collaborations:
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

#
# gen link to Dimensions
#
try:
  gridname = dsl.query(f"""search organizations where id="{GRIDID}" return organizations[name]""", verbose=False).organizations[0]['name']
except:
  gridname = ""
from IPython.core.display import display, HTML
display(HTML('GRID: <a href="{}" title="View selected organization in Dimensions">{} - {} &#x29c9;</a>'.format(dimensions_url(GRIDID), GRIDID, gridname)))
display(HTML('Time period: {} to {} <br /><br />'.format(YEAR_START, YEAR_END)))

Time period: 2000 to 2016

2. Extracting publications from industry collaborations

First of all we want to extract all GRID orgs with type='Company' using the API. Then we will use this list of organizations to identify industry collaborators for our chosen institution.

  • We can use the dimcli.query_iterative method to automatically retrieve ‘company’ GRID orgs in batches of 1000.

  • NOTE this step retrieves several thousands records from the API so it may take a few minutes to complete.

[4]:
# get GRID IDs
company_grids = dsl.query_iterative("""search organizations where types="Company" return organizations[id]""")
Starting iteration with limit=1000 skip=0 ...
0-1000 / 30088 (0.63s)
1000-2000 / 30088 (0.57s)
2000-3000 / 30088 (0.69s)
3000-4000 / 30088 (0.52s)
4000-5000 / 30088 (0.51s)
5000-6000 / 30088 (0.61s)
6000-7000 / 30088 (0.52s)
7000-8000 / 30088 (0.56s)
8000-9000 / 30088 (2.24s)
9000-10000 / 30088 (0.56s)
10000-11000 / 30088 (0.57s)
11000-12000 / 30088 (0.58s)
12000-13000 / 30088 (0.62s)
13000-14000 / 30088 (1.74s)
14000-15000 / 30088 (0.58s)
15000-16000 / 30088 (0.49s)
16000-17000 / 30088 (0.58s)
17000-18000 / 30088 (0.53s)
18000-19000 / 30088 (0.57s)
19000-20000 / 30088 (0.50s)
20000-21000 / 30088 (0.51s)
21000-22000 / 30088 (0.51s)
22000-23000 / 30088 (0.54s)
23000-24000 / 30088 (0.50s)
24000-25000 / 30088 (0.53s)
25000-26000 / 30088 (0.62s)
26000-27000 / 30088 (0.49s)
27000-28000 / 30088 (0.48s)
28000-29000 / 30088 (0.56s)
29000-30000 / 30088 (0.90s)
30000-30088 / 30088 (0.61s)
===
Records extracted: 30088

We can now set up a parametrized query that pulls Dimensions publications resulting from industry collaborations.

Together with IDs, title and DOIs, the publications generated from industry collaborations should include citations counts and authors info, so that we can draw up some useful statistics based on these metadata later on.

[5]:
query_template = """
    search publications
       where
        research_orgs.id = "{}"
        and research_orgs.id in {}
        and year in [{}:{}]
    return publications[id+doi+type+times_cited+year+authors]
    """
[6]:
gridis = list(company_grids.as_dataframe()['id'])

#
# loop through all grids

ITERATION_RECORDS = 1000  # Publication records per query iteration
GRID_RECORDS = 200       # grid IDs per query
VERBOSE = False          # set to True to view full extraction logs
print(f"===\nExtracting {GRIDID} publications with industry collaborators ...")
print("Records per query : ", ITERATION_RECORDS)
print("GRID IDs per query: ", GRID_RECORDS)
results = []


for chunk in progress(list(chunks_of(gridis, GRID_RECORDS))):
    query = query_template.format(GRIDID, json.dumps(chunk), YEAR_START, YEAR_END)
#     print(query)
    data = dsl.query_iterative(query, verbose=VERBOSE, limit=ITERATION_RECORDS)
    if data.errors:
        print("==\nIteration failed: due an error no data was extracted for this iteration. \nTry adjusting the ITERATION_RECORDS or BATCHSIZE parameters and rerun the extraction.")
    else:
        results += data.publications
    time.sleep(0.5)

#
# put the publication data into a dataframe, remove duplicates and save

pubs = pd.DataFrame().from_dict(results)
# print("===\nIndustry Publications found: ", len(pubs))
pubs.drop_duplicates(subset='id', inplace=True)
print("Unique Industry Publications found: ", len(pubs))

#
# preview the data
print("===\nPreview:")
pubs.head(10)
===
Extracting grid.11696.39 publications with industry collaborators ...
Records per query :  1000
GRID IDs per query:  200
Unique Industry Publications found:  375
===
Preview:
[6]:
authors doi id times_cited type year
0 [{'affiliations': [{'city': 'Madrid', 'city_id... 10.1088/0264-9381/33/23/235015 pub.1059063534 7 article 2016
1 [{'affiliations': [{'city': 'Dublin', 'city_id... 10.1145/2984356.2984363 pub.1001653422 14 proceeding 2016
2 [{'affiliations': [{'city': 'Stuttgart', 'city... 10.1016/j.apnum.2016.02.001 pub.1038596770 12 article 2016
3 [{'affiliations': [{'city': 'Madrid', 'city_id... 10.1103/physrevlett.116.231101 pub.1001053038 313 article 2016
4 [{'affiliations': [{'city': 'Dublin', 'city_id... 10.1109/eucnc.2016.7561056 pub.1094950798 22 proceeding 2016
5 [{'affiliations': [{'city': 'Trento', 'city_id... 10.1140/epjds/s13688-016-0064-6 pub.1033140941 15 article 2016
6 [{'affiliations': [{'city': 'Trento', 'city_id... 10.1089/big.2014.0054 pub.1018945654 48 article 2015
7 [{'affiliations': [{'city': 'Madrid', 'city_id... 10.1088/1742-6596/610/1/012027 pub.1031150191 1 article 2015
8 [{'affiliations': [{'city': 'Madrid', 'city_id... 10.1088/1742-6596/610/1/012005 pub.1052522882 17 article 2015
9 [{'affiliations': [{'city': 'Madrid', 'city_id... 10.1088/1742-6596/610/1/012026 pub.1033837350 2 article 2015

3. Analyses

In this section we will build some visualizations that help understanding the data we extracted.

3.1 Count of Publications per year from Industry Collaborations

A simple histogram chart can tell us the rate of publications per year.

[7]:
px.histogram(pubs,
             x="year",
             color="type",
             title=f"Publications per year with industry collaborations for {GRIDID}")

3.2 Citations from Industry Collaboration

[8]:
pubs_grouped = pubs.groupby(['year'], as_index=False).sum()
px.bar(pubs_grouped,
       x="year",
       y="times_cited",
       title=f"Tot Citations per year for publications with industry collaborations for {GRIDID}")

3.3 Top Industry Collaborators

In order to dig deeper into the industry affiliations we have to process the nested JSON data in the ‘authors’ column. By doing so, we can process authors & affiliations information and identify the ones belonging to the ‘industry’ set defined above.

For example, if we extract the authors data for the first publication/row (pubs.iloc[0]['authors']), this is what it’d look like:

[{'first_name': 'LUCA DALLA',
  'last_name': 'VALLE',
  'corresponding': '',
  'orcid': '',
  'current_organization_id': 'grid.11696.39',
  'researcher_id': 'ur.013645226073.38',
  'affiliations': [{'id': 'grid.11696.39',
    'name': 'University of Trento',
    'city': 'Trento',
    'city_id': 3165243,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]},
 {'first_name': 'ELENA CRISTINA',
  'last_name': 'RADA',
  'corresponding': '',
  'orcid': "['0000-0003-0807-1826']",
  'current_organization_id': 'grid.18147.3b',
  'researcher_id': 'ur.01344320306.26',
  'affiliations': [{'id': 'grid.11696.39',
    'name': 'University of Trento',
    'city': 'Trento',
    'city_id': 3165243,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]},
 {'first_name': 'MARCO',
  'last_name': 'RAGAZZI',
  'corresponding': '',
  'orcid': '',
  'current_organization_id': 'grid.11696.39',
  'researcher_id': 'ur.0655652202.53',
  'affiliations': [{'id': 'grid.11696.39',
    'name': 'University of Trento',
    'city': 'Trento',
    'city_id': 3165243,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]},
 {'first_name': 'MICHELE',
  'last_name': 'CARAVIELLO',
  'corresponding': '',
  'orcid': '',
  'current_organization_id': 'grid.14587.3f',
  'researcher_id': 'ur.016015622301.36',
  'affiliations': [{'id': 'grid.14587.3f',
    'name': 'Telecom Italia (Italy)',
    'city': 'Rome',
    'city_id': 3169070,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]}]

NOTE: Instead of iterating through the authors/affiliations data by building a new function, we can just take advantage of the DslDataset class in the Dimcli library. This class abstracts the notion of a Dimensions ‘results list’ and provides useful methods to quickly process authors and affiliations.

[10]:
from dimcli import DslDataset

# create a new DslDataset instance
pubsnew = DslDataset.from_publications_list(pubs)
# extract affiliations as a dataframe
affiliations = pubsnew.as_dataframe_authors_affiliations()
# focus only on affiliations including a grid from the industry set created above
affiliations = affiliations[affiliations['aff_id' ].isin(gridis)]
# preview the data
affiliations.head(5)
[10]:
aff_city aff_city_id aff_country aff_country_code aff_id aff_name aff_raw_affiliation aff_state aff_state_code pub_id researcher_id first_name last_name
7 Hamburg 2911298.0 Germany DE grid.410308.e Airbus (Germany) Airbus Defence and Space, Claude-Dornier-Stras... pub.1059063534 N Brandt
8 Milan 3173435.0 Italy IT grid.424032.3 OHB (Italy) CGS S.p.A, Compagnia Generale per lo Spazio, V... pub.1059063534 ur.014542047336.90 A Bursi
15 Milan 3173435.0 Italy IT grid.424032.3 OHB (Italy) CGS S.p.A, Compagnia Generale per lo Spazio, V... pub.1059063534 D Desiderio
16 Milan 3173435.0 Italy IT grid.424032.3 OHB (Italy) CGS S.p.A, Compagnia Generale per lo Spazio, V... pub.1059063534 E Piersanti
19 Bristol 2654675.0 United Kingdom GB grid.7546.0 Airbus (United Kingdom) Airbus Defence and Space, Gunnels Wood Road, S... pub.1059063534 ur.010504106037.54 N Dunbar

Let’s now count frequency and create a nice chart summing up the top industry collaborators.

TIP Try zooming in on the left-hand side to put into focus the organizations that appear most frequently.

[11]:
px.histogram(affiliations,
             x="aff_name",
             height=900,
             title=f"Top Industry collaborators for {GRIDID}").update_xaxes(categoryorder="total descending")

3.4 Countries of Industry Collaborators

We can use the same dataset to segment the data by country.

[12]:
px.pie(affiliations,
       names="aff_country",
       height=600,
       title=f"Countries of collaborators for {GRIDID}")

3.5 Putting Countries and Collaborators together

TIP by clicking on the right panel you can turn on/off specific countries

[13]:
px.histogram(affiliations,
             x="aff_name",
             height=900,
             color="aff_country",
             title=f"Top Countries and Industry collaborators for {gridname}-{GRIDID}",
             color_discrete_sequence=px.colors.diverging.Spectral)


Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg