../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Citation Analysis: Journals Cited by a Research Organization

This notebook shows how to use the Dimensions Analytics API to discover what academic journals are most frequenlty cited by authors affiliated to a selected research organization. These are the steps:

  1. We start from a specific organization GRID ID (and other parameters of choice)

  2. Using the publications API, we extract all publications authored by researchers at that institution. For each publication, we store all outgoing citations IDs using the reference_ids field

  3. We query the API again to obtain other useful metadata for those outgoing citations e.g. title, publisher, journals etc..

  4. We analyse the data, in particular by segmenting it by journal and publisher

[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Sep 22, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:
!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *
import sys, json, time, os
from tqdm.notebook import tqdm
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.9.1)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.2
Method: dsl.ini file

1. Choosing a Research Organization

We can use the organizations API to find the GRID ID for Berkeley University.

[3]:
%%dsldf

search organizations for "berkeley university" return organizations
Returned Organizations: 1 (total = 1)
Time: 0.61s
[3]:
id name acronym city_name country_name latitude linkout longitude state_name types
0 grid.47840.3f University of California, Berkeley UCB Berkeley United States 37.87216 [http://www.berkeley.edu/] -122.258575 California [Education]

The ID we are looking for is grid.47840.3f.

1.1 Selecting a Field of Research ID

Similarly, we can use the API to identify relevant Field of Research (FoR) categories for Berkeley University.

By using a specific FOR category we can make the subsequent data extraction & analysis a bit more focused.

[4]:
%%dsldf

search publications
    where research_orgs.id = "grid.47840.3f"
    return category_for limit 10
Returned Category_for: 10
Time: 0.97s
[4]:
id name count
0 2206 06 Biological Sciences 42951
1 2202 02 Physical Sciences 42053
2 2209 09 Engineering 38325
3 2211 11 Medical and Health Sciences 33814
4 2203 03 Chemical Sciences 24475
5 2201 01 Mathematical Sciences 19433
6 2208 08 Information and Computing Sciences 17366
7 2581 0601 Biochemistry and Cell Biology 16245
8 2471 0306 Physical Chemistry (incl. Structural) 13440
9 2217 17 Psychology and Cognitive Sciences 13398

For example, let’s focus on 08 Information and Computing Sciences, ID 2208 -

Finally, we can also select a specific year range, e.g. the last five years.

Let’s save all of these variables so that we can reference them later on.

[5]:
GRIDID = "grid.47840.3f" #@param {type:"string"}

FOR_CODE = "2208"  #@param {type:"string"}


#@markdown The start/end year of publications used to extract patents
YEAR_START = 2015 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2021 #@param {type: "slider", min: 1950, max: 2021}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

2. Getting the IDs of the outgoing citations

In this section we use the Publications API to extract the Dimensions ID of all publications referenced by authors in the selected research organization.

These identifiers can be found in the reference_ids field.

[7]:
publications = dsl.query_iterative(f"""

    search publications
        where research_orgs.id = "{GRIDID}"
        and year in [{YEAR_START}:{YEAR_END}]
        and category_for.id="{FOR_CODE}"
        return publications[id+doi+reference_ids]

""")

#
# preview the data
pubs_and_citations = publications.as_dataframe().explode("reference_ids")
pubs_and_citations.head(5)
Starting iteration with limit=1000 skip=0 ...
0-1000 / 5068 (1.31s)
1000-2000 / 5068 (2.01s)
2000-3000 / 5068 (1.07s)
3000-4000 / 5068 (1.17s)
4000-5000 / 5068 (1.33s)
5000-5068 / 5068 (1.23s)
===
Records extracted: 5068
[7]:
id doi reference_ids
0 pub.1146101788 10.1109/icicas53977.2021.00031 [pub.1121966168, pub.1121972405, pub.112197331...
1 pub.1144313713 10.1002/9780470015902.a0029363 [pub.1137478539, pub.1125842877, pub.112565184...
2 pub.1144210288 10.1093/jncics/pkab099 [pub.1143292627, pub.1009104520, pub.103834136...
3 pub.1142934064 10.1145/3485007 [pub.1007394032, pub.1004190151, pub.105272630...
4 pub.1141526845 10.1145/3478535 [pub.1023640338, pub.1127037784, pub.106438915...

2.1 Removing duplicates and counting most frequent citations

Since multiple authors/publications from our organization will be referencing the same target publications, we may have various duplicates in our reference_ids column.

So want to remove those duplicates, while at the same time retaining that information by adding a new column size that counts how frequenlty a certain publication was cited.

This can be easily achieved using panda’s group-by function:

[10]:
# consider only IDs column
df = pubs_and_citations[['reference_ids']]
# group by ID and count
citations = df.groupby(df.columns.tolist(),as_index=False).size().sort_values("size", ascending=False)
# preview the data, most cited ID first
citations.head(10)
[10]:
reference_ids size
62077 pub.1093359587 202
69920 pub.1095689025 133
29065 pub.1038140272 96
7290 pub.1009767488 86
34655 pub.1045321436 76
39735 pub.1052031051 68
66651 pub.1094727707 68
44994 pub.1061179979 66
7503 pub.1010020120 63
62982 pub.1093626237 56

3. Enriching the citations IDs with other publication metadata

In this step we use the outgoing citations IDs obtained above to query the publications API again.

The goal is to retrieve more publications metadata so to be able to ‘group’ citations based on criteria of interest e.g. what journal they belong to. For example:

  • source_title

  • publisher

  • year

  • doi

NOTE Since we can have lots of publications to go through, the IDs list is chunked into smaller groups so to ensure the resulting API query is never too long (more info here).

[11]:

#
# get a list of citation IDs
pubids = list(citations['reference_ids'])


#
# DSL query - PS change the return statement to extract different metadata of interest
query_template = """search publications
                    where id in {}
                    return publications[id+doi+journal+year+publisher+type+issn]
                    limit 1000"""


#
# loop through all references-publications IDs in chunks and query Dimensions
print(f"===\nExtracting publications data for {len(pubids)} citations...")
results = []
BATCHSIZE = 400
VERBOSE = False # set to True to see extraction logs

for chunk in tqdm(list(chunks_of(pubids, BATCHSIZE))):
    query = query_template.format(json.dumps(chunk))
    data = dsl.query(query, verbose=VERBOSE)
    results += data.publications
    time.sleep(0.5)

#
# save the citing pub data into a dataframe, remove duplicates and save
pubs_cited = pd.DataFrame().from_dict(results)
print("===\nCited Publications found: ", len(pubs_cited))


#
# transform the 'journal' column cause it contains nested data
temp = pubs_cited['journal'].apply(pd.Series).rename(columns={"id": "journal.id",
                                                              "title": "journal.title"}).drop([0], axis=1)
pubs_cited = pd.concat([pubs_cited.drop(['journal'], axis=1), temp], axis=1).sort_values('type')
pubs_cited.head(10)


===
Extracting publications data for 87159 citations...
===
Cited Publications found:  87139
[11]:
doi id publisher type year issn journal.id journal.title
87138 10.1214/aoms/1177704711 pub.1042438804 Institute of Mathematical Statistics article 1962.0 [0003-4851, 2168-8990] jour.1018844 The Annals of Mathematical Statistics
72855 10.1109/surv.2014.012214.00007 pub.1061446928 Institute of Electrical and Electronics Engine... article 2014.0 [1553-877X, 2373-745X] jour.1139536 IEEE Communications Surveys & Tutorials
72854 10.1109/surv.2014.032014.00094 pub.1061446943 Institute of Electrical and Electronics Engine... article 2014.0 [1553-877X, 2373-745X] jour.1139536 IEEE Communications Surveys & Tutorials
72853 10.1109/lsp.2014.2351822 pub.1061378903 Institute of Electrical and Electronics Engine... article 2014.0 [1070-9908, 1558-2361] jour.1033580 IEEE Signal Processing Letters
72852 10.1109/mm.2014.61 pub.1061408931 Institute of Electrical and Electronics Engine... article 2014.0 [0272-1732, 1937-4143] jour.1125669 IEEE Micro
72851 10.1109/mits.2014.2343262 pub.1061407712 Institute of Electrical and Electronics Engine... article 2014.0 [1939-1390, 1941-1197] jour.1140577 IEEE Intelligent Transportation Systems Magazine
72850 10.1109/lsp.2014.2334306 pub.1061378828 Institute of Electrical and Electronics Engine... article 2014.0 [1070-9908, 1558-2361] jour.1033580 IEEE Signal Processing Letters
72849 10.1109/mra.2014.2360283 pub.1061419755 Institute of Electrical and Electronics Engine... article 2014.0 [1070-9932, 1558-223X] jour.1033567 IEEE Robotics & Automation Magazine
72848 10.1109/msp.2014.107 pub.1061424107 Institute of Electrical and Electronics Engine... article 2015.0 [1540-7993, 1558-4046] jour.1033568 IEEE Security & Privacy
72847 10.1109/lsp.2015.2393295 pub.1061379116 Institute of Electrical and Electronics Engine... article 2015.0 [1070-9908, 1558-2361] jour.1033580 IEEE Signal Processing Letters

3.1 Adding the citations counts

We achieve this by joining this data with the ones we extracted before, that is, citations

Note: if there are a lot of publications, this step can take some time.

[12]:
pubs_cited = pubs_cited.merge(citations, left_on='id', right_on='reference_ids')

pubs_cited.head(10)
[12]:
doi id publisher type year issn journal.id journal.title reference_ids size
0 10.1214/aoms/1177704711 pub.1042438804 Institute of Mathematical Statistics article 1962.0 [0003-4851, 2168-8990] jour.1018844 The Annals of Mathematical Statistics pub.1042438804 1
1 10.1109/surv.2014.012214.00007 pub.1061446928 Institute of Electrical and Electronics Engine... article 2014.0 [1553-877X, 2373-745X] jour.1139536 IEEE Communications Surveys & Tutorials pub.1061446928 1
2 10.1109/surv.2014.032014.00094 pub.1061446943 Institute of Electrical and Electronics Engine... article 2014.0 [1553-877X, 2373-745X] jour.1139536 IEEE Communications Surveys & Tutorials pub.1061446943 1
3 10.1109/lsp.2014.2351822 pub.1061378903 Institute of Electrical and Electronics Engine... article 2014.0 [1070-9908, 1558-2361] jour.1033580 IEEE Signal Processing Letters pub.1061378903 1
4 10.1109/mm.2014.61 pub.1061408931 Institute of Electrical and Electronics Engine... article 2014.0 [0272-1732, 1937-4143] jour.1125669 IEEE Micro pub.1061408931 1
5 10.1109/mits.2014.2343262 pub.1061407712 Institute of Electrical and Electronics Engine... article 2014.0 [1939-1390, 1941-1197] jour.1140577 IEEE Intelligent Transportation Systems Magazine pub.1061407712 1
6 10.1109/lsp.2014.2334306 pub.1061378828 Institute of Electrical and Electronics Engine... article 2014.0 [1070-9908, 1558-2361] jour.1033580 IEEE Signal Processing Letters pub.1061378828 1
7 10.1109/mra.2014.2360283 pub.1061419755 Institute of Electrical and Electronics Engine... article 2014.0 [1070-9932, 1558-223X] jour.1033567 IEEE Robotics & Automation Magazine pub.1061419755 1
8 10.1109/msp.2014.107 pub.1061424107 Institute of Electrical and Electronics Engine... article 2015.0 [1540-7993, 1558-4046] jour.1033568 IEEE Security & Privacy pub.1061424107 1
9 10.1109/lsp.2015.2393295 pub.1061379116 Institute of Electrical and Electronics Engine... article 2015.0 [1070-9908, 1558-2361] jour.1033580 IEEE Signal Processing Letters pub.1061379116 1

4. Journal Analysis

Finally, we can analyze the citing publications by grouping them by source journal. This can be achieved easily thanks to pandas’ Dataframe methods.

4.1 Number of Unique journals

[13]:
pubs_cited['journal.id'].describe()
[13]:
count            59877
unique            6577
top       jour.1017736
freq               779
Name: journal.id, dtype: object

4.2 Most frequent journals

[14]:
journals = pubs_cited.value_counts(['journal.title', 'publisher'])
journals = journals.to_frame().reset_index().rename(columns= {0: 'citations', 'journal.title' : 'title'})
journals.index.name = 'index'

#preview
journals.head(100)
[14]:
title publisher citations
index
0 The Journal of Chemical Physics AIP Publishing 779
1 ACM Transactions on Graphics Association for Computing Machinery (ACM) 680
2 IEEE Transactions on Information Theory Institute of Electrical and Electronics Engine... 655
3 Nature Springer Nature 594
4 Proceedings of the National Academy of Science... Proceedings of the National Academy of Sciences 592
... ... ... ...
95 Artificial Intelligence Elsevier 97
96 Psychological Review American Psychological Association (APA) 96
97 Expert Systems with Applications Elsevier 96
98 JAMA American Medical Association (AMA) 96
99 IEEE Transactions on Multimedia Institute of Electrical and Electronics Engine... 95

100 rows × 3 columns

4.3 Top 50 journals chart, by publisher

[15]:
px.bar(journals[:50],
       x="title", y="citations", color="publisher",
       height=900,
       title=f"Top 50 journals cited by {GRIDID} (focus: FoR {FOR_CODE} and time span {YEAR_START}:{YEAR_END})")

4.4 Top 20 journals by year of the cited publication

[16]:

THRESHOLD = 20  #@param {type: "slider", min: 10, max: 100}

# suppress empty values
pubs_cited.fillna("-no value-", inplace=True)

# make publications list smaller by only showing top journals
pubs_citing_topjournals = pubs_cited[pubs_cited['journal.title'].isin(list(journals[:THRESHOLD]['title']))].sort_values('journal.title')

# build histogram
px.histogram(pubs_citing_topjournals,
             x="year",
             color="journal.title",
             height=600,
             title=f"Top {THRESHOLD} journals citing publications from {GRIDID} - by year")

Conclusions

In this notebook we have shown how to use the Dimensions Analytics API to discover what academic journals are most frequenlty cited by authors affiliated to a selected research organization.

This only scratches the surface of the possible applications of publication data, but hopefully it’ll give you a few basic tools to get started building your own applications. For more background, see the list of fields available via the Publications API.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg