Using Python for Bibliometric Analysis: Demo on Science Entrepreneurship

I needed to familiarize myself with the literature on science entrepreneurship (for reasons I’m going to explain soon). After delving into bibliometrics and doing literature review repetitively for my PhD, I already have a system to efficiently introduce myself to a new literature. In this post, I will explain my process, hoping it helps others who are also entering a new field.

I typically follow these steps:

  1. Explore the Web of Knowledge using a keyword search
  2. Explore data in Python
  3. Create visualizations using VosViewer

The first step for me is usually just trying out different keywords in the Web of Knowledge. I then browse the first page of the latest articles and the top cited articles. I try to check whether these are related to my topic of interest.

For this topic of science entrepreneurship, I settled with the following keywords. I also narrowed it down to the management journals that I know are relevant to technology and innovation management and just general management. Moreover, I was just interested in the papers published from 2010. Below was my keyword search:

TS=(science OR technology ) AND TS=(startup* OR “start up” OR “new venture” OR entrepreneur* OR “new firm” OR “spin off” OR spinoff* OR SME OR SMEs) AND SO=(RESEARCH POLICY OR R D MANAGEMENT OR STRATEGIC MANAGEMENT JOURNAL OR JOURNAL OF PRODUCT INNOVATION MANAGEMENT OR ACADEMY OF MANAGEMENT REVIEW OR ACADEMY OF MANAGEMENT JOURNAL OR TECHNOVATION OR SCIENTOMETRICS OR TECHNOLOGICAL FORECASTING “AND” SOCIAL CHANGE OR TECHNOLOGY ANALYSIS STRATEGIC MANAGEMENT OR ORGANIZATION SCIENCE OR ADMINISTRATIVE SCIENCE QUARTERLY OR JOURNAL OF BUSINESS VENTURING OR INDUSTRY “AND” INNOVATION OR STRATEGIC ENTREPRENEURSHIP JOURNAL OR JOURNAL OF TECHNOLOGY TRANSFER OR JOURNAL OF ENGINEERING “AND” TECHNOLOGY MANAGEMENT OR JOURNAL OF MANAGEMENT OR JOURNAL OF MANAGEMENT STUDIES OR RESEARCH TECHNOLOGY MANAGEMENT OR ENTREPRENEURSHIP THEORY “AND” PRACTICE OR ACADEMY OF MANAGEMENT ANNALS OR ACADEMY OF MANAGEMENT PERSPECTIVES OR JOURNAL OF BUSINESS RESEARCH OR BRITISH JOURNAL OF MANAGEMENT OR EUROPEAN JOURNAL OF MANAGEMENT OR MANAGEMENT SCIENCE)

After exploring the results, I then downloaded the articles. These amounted to 1412 articles in total. Since WOS only allowed downloading of 500 at a time, I named these files 1-500.txt, 501-1000.txt and so on. I saved all the files in a folder (named Raw in this case) in my computer.

Data Exploration in Python

In the following, I show the code to import the data into Python and format the articles into a pandas dataframe.

import re, csv, os 
import pandas as pd
import numpy as np
import nltk
import math
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('white')
from collections import Counter

columnnames =['PT','AU','DE', 'AF','TI','SO','LA','DT','ID','AB','C1','RP','EM','CR','NR','TC','U1','PU','PI','PA','SN','EI','J9','JI','PD','PY','VL','IS','BP','EP','DI','PG','WC','SC','GA','UT']

def convertWOScsv(filename):
    openfile = open(filename, encoding='latin-1')
    sampledata = openfile.read()
    # divide into list of records 
    individualrecords = sampledata.split('\n\n')
    databaseofWOS = []
    for recordindividual in individualrecords:
        onefile = {}
        for x in columnnames:
            everyrow = re.compile('\n'+x + ' ' + '((.*?))\n[A-Z][A-Z1]', re.DOTALL)
            rowsdivision = everyrow.search(recordindividual)
            if rowsdivision:
                onefile[x] = rowsdivision.group(1)
        databaseofWOS.append(onefile)
    return databaseofWOS

def massconvertWOS(folder):
    publicationslist = []
    for file in os.listdir(folder):
        if file.endswith('.txt'):
            converttotable = convertWOScsv(folder + '\\' + file)
            publicationslist += converttotable
    publicationslist = pd.DataFrame(publicationslist)
    publicationslist.dropna(how='all', inplace=True)
    publicationslist.reset_index(drop=True, inplace=True)
    publicationslist['PY'] =publicationslist['PY'].fillna('').replace('', '2019').astype(int)
    publicationslist['TC'] = publicationslist['TC'].apply(lambda x: int(x.split('\n')[0]))
    return publicationslist

df = massconvertWOS('Raw')
df = df.drop_duplicates('UT').reset_index(drop=True)

I preview some of the articles that I was able to download below. I chose the relevant columns to show.

print('Number of Articles:', df.shape[0])
df.head()[['TI', 'AU', 'SO', 'PY']]
Number of Articles: 1412
TIAUSOPY
0Non-linear effects of technological competence…Deligianni, I\n Voudouris, I\n Spanos, Y\n…TECHNOVATION2019
1Creating new products from old ones: Consumer …Robson, K\n Wilson, M\n Pitt, LTECHNOVATION2019
2What company characteristics are associated wi…Koski, H\n Pajarinen, M\n Rouvinen, PINDUSTRY AND INNOVATION2019
3Through the Looking-Glass: The Impact of Regio…Vedula, S\n York, JG\n Corbett, ACJOURNAL OF MANAGEMENT STUDIES2019
4The role of incubators in overcoming technolog…Yusubova, A\n Andries, P\n Clarysse, BR & D MANAGEMENT2019

WOS is smart in the sense that even if the text does not contain the keywords you said, they still may include papers because they sense that these are relevant papers. To filter out these papers that did not contain the keywords I wanted, I further filtered the dataset by checking the title, abstract and author-selected keywords. Moreover, let’s remove articles without any citations.

df["txt"] = df["TI"].fillna("") + " " + df["DE"].fillna("") + " " + df["AB"].fillna("")
df["txt"] = df["txt"].apply(lambda x: x.replace('-', ' '))
df = df[df['txt'].apply(lambda x: any([y in x.lower() for y in ['scien', 'technolog']]))]
df = df[df['txt'].apply(lambda x: any([y in x.lower() for y in ['startup', 'start up', 'new venture', 'entrepreneur', 'new firm', 'spin off',
                                                                'spinoff', 'sme ', 'smes ']]))]
df = df[~df['CR'].isnull()] 
print('Number of Articles:', df.shape[0])
Number of Articles: 846

I can plot the number of articles over time

df.groupby('PY').size().plot(kind='bar')

I can look at the breakdown per journal

#df.groupby('SO').size().sort_values().plot(kind='barh', figsize=[5,10])
soplot = df.pivot_table(index='PY', columns='SO', aggfunc='size').fillna(0) #.reset_index()
soplot = soplot[soplot.sum(axis=0).sort_values().index].reset_index().rename(columns={'PY':'Year'})
soplot['Year'] = pd.cut(soplot['Year'], [0, 2014, 2019], labels=['2010-2014', '2015-2019'])
soplot.groupby('Year').sum().T.plot(kind='barh', stacked=True, figsize=[5,10])
plt.ylabel('Journal'), plt.xlabel('Number of Articles')
plt.show()

I can look at the top cited articles. This shows what are the foundational material that I should know before delving into the topic.

topcited = df['CR'].fillna('').apply(lambda x: [y.strip() for y in x.split('\n')]).sum()
pd.value_counts(topcited).head(10)
COHEN WM, 1990, ADMIN SCI QUART, V35, P128, DOI 10.2307/2393553                  115
Shane S, 2004, NEW HORIZ ENTREP, P1                                               88
Shane S, 2000, ACAD MANAGE REV, V25, P217, DOI 10.5465/amr.2000.2791611           87
Rothaermel FT, 2007, IND CORP CHANGE, V16, P691, DOI 10.1093/icc/dtm023           86
BARNEY J, 1991, J MANAGE, V17, P99, DOI 10.1177/014920639101700108                81
Shane S, 2000, ORGAN SCI, V11, P448, DOI 10.1287/orsc.11.4.448.14602              78
TEECE DJ, 1986, RES POLICY, V15, P285, DOI 10.1016/0048-7333(86)90027-2           77
Di Gregorio D, 2003, RES POLICY, V32, P209, DOI 10.1016/S0048-7333(02)00097-5     77
EISENHARDT KM, 1989, ACAD MANAGE REV, V14, P532, DOI 10.2307/258557               75
Nelson R.R., 1982, EVOLUTIONARY THEORY                                            69
dtype: int64

The articles above are not really very specific to our topic of interest. These are foundational papers in innovation/management. To explore those papers that are more relevant to our topic, what I can do then is find which is the most cited within the papers in this dataset, meaning hey include the keywords that I’m interested in. This is the internal citation of the papers.

def createinttc(df):
    df["CRparsed"] = df["CR"].fillna('').str.lower().astype(str)
    df["DI"] = df["DI"].fillna('').str.lower()
    df["intTC"] = df["DI"].apply(lambda x: sum([x in y for y in df["CRparsed"]]) if x!="" else 0)
    df["CRparsed"] = df["CR"].astype(str).apply(lambda x: [y.strip().lower() for y in x.split('\n')])
    return df

df = createinttc(df).reset_index(drop=True)
df.sort_values('intTC', ascending=False)[['TI', 'AU', 'SO', 'PY', 'intTC']].head(10)
TIAUSOPYintTC
40130 years after Bayh-Dole: Reassessing academic…Grimaldi, R\n Kenney, M\n Siegel, DS\n W…RESEARCH POLICY201145
301Academic engagement and commercialisation: A r…Perkmann, M\n Tartari, V\n McKelvey, M\n …RESEARCH POLICY201341
428Why do academics engage with industry? The ent…D’Este, P\n Perkmann, MJOURNAL OF TECHNOLOGY TRANSFER201132
402The impact of entrepreneurial capacity, experi…Clarysse, B\n Tartari, V\n Salter, ARESEARCH POLICY201126
407ENDOGENOUS GROWTH THROUGH KNOWLEDGE SPILLOVERS…Delmar, F\n Wennberg, K\n Hellerstedt, KSTRATEGIC ENTREPRENEURSHIP JOURNAL201124
430Entrepreneurial effectiveness of European univ…Van Looy, B\n Landoni, P\n Callaert, J\n …RESEARCH POLICY201123
398The Bayh-Dole Act and scientist entrepreneurshipAldridge, TT\n Audretsch, DRESEARCH POLICY201120
400The effectiveness of university knowledge spil…Wennberg, K\n Wiklund, J\n Wright, MRESEARCH POLICY201119
515Convergence or path dependency in policies to …Mustar, P\n Wright, MJOURNAL OF TECHNOLOGY TRANSFER201019
413Entrepreneurial Origin, Technological Knowledg…Clarysse, B\n Wright, M\n Van de Velde, EJOURNAL OF MANAGEMENT STUDIES201119

A complementary approach is to look at the articles that are citing the most the rest of the papers in the dataset. These allows us to see which reviews already integrates the studies within our dataset. We can then start reading from this set of papers as they cover already a lot of the other papers in the dataset.

doilist = [y for y in df['DI'].dropna().tolist() if y!='']
df['Citing'] = df['CR'].apply(lambda x: len([y for y in doilist if y in x]))
df.sort_values('Citing', ascending=False)[['TI', 'AU', 'SO' , 'PY',  'Citing', ]].head(10)
TIAUSOPYCiting
139Conceptualizing academic entrepreneurship ecos…Hayter, CS\n Nelson, AJ\n Zayed, S\n O’C…JOURNAL OF TECHNOLOGY TRANSFER201875
168THE PSYCHOLOGICAL FOUNDATIONS OF UNIVERSITY SC…Hmieleski, KM\n Powell, EEACADEMY OF MANAGEMENT PERSPECTIVES201833
138Re-thinking university spin-off: a critical li…Miranda, FJ\n Chamorro, A\n Rubio, SJOURNAL OF TECHNOLOGY TRANSFER201831
122Public policy for academic entrepreneurship in…Sandstrom, C\n Wennberg, K\n Wallin, MW\n …JOURNAL OF TECHNOLOGY TRANSFER201828
37Opening the black box of academic entrepreneur…Skute, ISCIENTOMETRICS201928
166RETHINKING THE COMMERCIALIZATION OF PUBLIC SCI…Fini, R\n Rasmussen, E\n Siegel, D\n Wik…ACADEMY OF MANAGEMENT PERSPECTIVES201825
68The technology transfer ecosystem in academia….Good, M\n Knockaert, M\n Soppe, B\n Wrig…TECHNOVATION201924
40Theories from the Lab: How Research on Science…Fini, R\n Rasmussen, E\n Wiklund, J\n Wr…JOURNAL OF MANAGEMENT STUDIES201922
659How can universities facilitate academic spin-…Rasmussen, E\n Wright, MJOURNAL OF TECHNOLOGY TRANSFER201521
73Stimulating academic patenting in a university…Backs, S\n Gunther, M\n Stummer, CJOURNAL OF TECHNOLOGY TRANSFER201921

Bibliometric Analysis in VosViewer

To create visualizations of the paper, we do the following steps. First, we can export the filtered dataset into a text file.

def convertWOStext(dataframe, outputtext):
    dataframe["PY"]=dataframe["PY"].astype(int)
    txtresult = ""
    for y in range(0, len(dataframe)):
        for x in columnnames:
            if dataframe[x].iloc[y] != np.nan:
                txtresult += x + " " + str(dataframe[x].iloc[y]) + "\n"
        txtresult += "ER\n\n"
    f = open(outputtext, "w", encoding='utf-8')
    f.write(txtresult)
    f.close()

convertWOStext(df, 'df.txt')

We can then open the file in VosViewer. From there, we can create various visualizations. I like using bibliographic coupling to map all the papers in the dataset

I saved the file in VosViewer. This gives you two files, one has the data on each document and the second file has the network data. We modify these files to make certain changes. First, the citations above reflect their citations from all the papers outside the dataset. I want the internal citations to be shown so I replace it.

def createvosfile1(filename, df, updatecit= False, newclusters = False, newname=None):
    vosfile1  = pd.read_csv(filename, sep="\t")
    voscolumns = vosfile1.columns
    vosfile1["title"] = vosfile1["description"].apply(lambda x: x.split("Title:</td><td>")[1])
    vosfile1["title"] = vosfile1["title"].apply(lambda x: x.split("</td></tr>")[0])
    df["TI2"] = df["TI"].apply(lambda x: " ".join(x.lower().split()))
    vosfile1 = vosfile1.merge(df[[x for x in df.columns if x not in voscolumns]], how="left", left_on="title", right_on="TI2")
    vosfile1["txt"] = vosfile1["TI"].fillna(" ") + " " + vosfile1["DE"].fillna(" ") + " " + vosfile1["AB"].fillna(" ")  
    vosfile1["txt"] = vosfile1["txt"].apply(lambda x: x.lower())
    vosfile1["weight<Citations>"] = vosfile1["intTC"].fillna(0)
    vosfile1 = vosfile1.drop_duplicates('id')
    vosfile1['id'] = vosfile1.reset_index().index + 1
    if newclusters == True:
        vosfile1['cluster'] = artclusters
    if updatecit == True:
        vosfile1[voscolumns].to_csv(newname, sep="\t", index=False)
    return vosfile1

df = createvosfile1('Processed\VosViewer_1_Original.txt', df, newname='Processed\VosViewer_1_intCit.txt', updatecit= True, newclusters=False)

The above network just uses the citation data of the publications. To improve it, I like integrating the textual data from the title, abstract and keywords. I followed the steps suggested here for cleaning the text (https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/). I then combine these two measures to allow for hybrid clustering

Liu, Xinhai, Shi Yu, Frizo Janssens, Wolfgang Glänzel, Yves Moreau, and Bart De Moor. “Weighted hybrid clustering by combining text mining and bibliometrics on a large‐scale journal database.” Journal of the American Society for Information Science and Technology 61, no. 6 (2010): 1105-1119.

#Bibliometric coupling
from scipy.sparse import coo_matrix
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity

def createbibnet(df):
    allsources = Counter(df['CRparsed'].sum())
    allsources  = [x for x in allsources if allsources[x]>1]
    dfcr = df['CRparsed'].reset_index(drop=True)
    dfnet = []
    i=0
    for n in allsources:
        [dfnet.append([i, y]) for y in dfcr[dfcr.apply(lambda x: n in x)].index]
        i+=1
    dfnet_matrix = coo_matrix(([1] * len(dfnet), ([x[1] for x in dfnet], [x[0] for x in dfnet])), 
                              shape=(dfcr.shape[0], len(allsources)))
    return cosine_similarity(dfnet_matrix, dfnet_matrix)

#Lexical Coupling
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import string
from gensim.models.phrases import Phrases, Phraser

def clean(doc):
    stop = set(stopwords.words('english'))
    exclude = set(string.punctuation) 
    lemma = WordNetLemmatizer()
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    normalized = " ".join([x for x in normalized.split() if not any(c.isdigit() for c in x)])
    normalized = " ".join([x for x in normalized.split() if len(x)>3])
    return normalized

def bigrams(docs):
    phrases = Phrases(docs)
    bigram = Phraser(phrases)
    docs = docs.apply(lambda x: bigram[x])
    phrases = Phrases(docs)
    trigram = Phraser(phrases)
    docs = docs.apply(lambda x: trigram[x])
    return docs

def createtfidf(df, sheet_name):
    df["lemma"] = df["txt"].apply(lambda x: clean(x).split())
    df["lemma"] = bigrams(df["lemma"])
    vect = TfidfVectorizer(min_df=1)
    tfidftemp = vect.fit_transform([" ".join(x) for x in df["lemma"]])
    return cosine_similarity(tfidftemp) 

#Hybrid network
def createhybridnet(df, weightlex, sheet_name='Sheet1'):
    bibnet = createbibnet(df)
    tfidftemp = createtfidf(df, sheet_name)
    hybnet = pd.DataFrame(np.cos((1-weightlex) * np.arccos(bibnet) + weightlex *  np.arccos(tfidftemp))).fillna(0)
    return hybnet

from itertools import combinations
def createvosviewer2filefromhybrid(hybridlexcit, minimumlink, outputfilename):
    forvisuals = []
    for x, y in combinations(hybridlexcit.index, 2):
        val = int(hybridlexcit.loc[x,y]*100)
        if val > minimumlink:
            forvisuals.append([x, y, val])
    forvisuals = pd.DataFrame(forvisuals)
    forvisuals[0] = forvisuals[0] + 1
    forvisuals[1] = forvisuals[1] + 1
    forvisuals.to_csv(outputfilename, index=False, header=False)
    
dfhybrid = createhybridnet(df, 0.5)
createvosviewer2filefromhybrid(dfhybrid, 0, r'Processed/VosViewer_2_Hybrid.txt')

If we reimport these modified files to VosViewer. We come up with this visualization which incorporates both textual and citation data.

I can then spend tons of time just exploring the network. I look at the papers in each cluster. I check which papers have high citations. I can do this also with the help of python. We can update the clustering using the one generated by VosViewer.

df = createvosfile1('Processed/VosViewer_1_Clus.txt', df)
df[df['cluster']==1].sort_values('intTC', ascending=False)[['TI', 'AU', 'SO', 'PY', 'intTC']].head(10)
TIAUSOPYintTC
404ENDOGENOUS GROWTH THROUGH KNOWLEDGE SPILLOVERS…Delmar, F\n Wennberg, K\n Hellerstedt, KSTRATEGIC ENTREPRENEURSHIP JOURNAL201124
500Cognitive Processes of Opportunity Recognition…Gregoire, DA\n Barr, PS\n Shepherd, DAORGANIZATION SCIENCE201011
439Managing knowledge assets under conditions of …Allarakhia, M\n Steven, WTECHNOVATION201110
353Technology entrepreneurshipBeckman, C\n Eisenhardt, K\n Kotha, S\n …STRATEGIC ENTREPRENEURSHIP JOURNAL20129
484IAMOT and Education: Defining a Technology and…Yanez, M\n Khalil, TM\n Walsh, STTECHNOVATION20108
343TECHNOLOGY-MARKET COMBINATIONS AND THE IDENTIF…Gregoire, DA\n Shepherd, DAACADEMY OF MANAGEMENT JOURNAL20128
443The Strategy-Technology Firm Fit Audit: A guid…Walsh, ST\n Linton, JDTECHNOLOGICAL FORECASTING AND SOCIAL CHANGE20118
411The Cognitive Perspective in Entrepreneurship:…Gregoire, DA\n Corbett, AC\n McMullen, JSJOURNAL OF MANAGEMENT STUDIES20118
596Technology Business Incubation: An overview of…Mian, S\n Lamine, W\n Fayolle, ATECHNOVATION20166
303Local responses to global technological change…Fink, M\n Lang, R\n Harms, RTECHNOLOGICAL FORECASTING AND SOCIAL CHANGE20136
df[df['cluster']==2].sort_values('intTC', ascending=False)[['TI', 'AU', 'SO', 'PY', 'intTC']].head(10)
TIAUSOPYintTC
410Entrepreneurial Origin, Technological Knowledg…Clarysse, B\n Wright, M\n Van de Velde, EJOURNAL OF MANAGEMENT STUDIES201119
461On growth drivers of high-tech start-ups: Expl…Colombo, MG\n Grilli, LJOURNAL OF BUSINESS VENTURING201017
387WHEN DOES CORPORATE VENTURE CAPITAL ADD VALUE …Park, HD\n Steensma, HKSTRATEGIC MANAGEMENT JOURNAL201210
514The M&A dynamics of European science-based ent…Bonardo, D\n Paleari, S\n Vismara, SJOURNAL OF TECHNOLOGY TRANSFER20109
506The role of incubator interactions in assistin…Scillitoe, JL\n Chakrabarti, AKTECHNOVATION20109
423EXPLAINING GROWTH PATHS OF YOUNG TECHNOLOGY-BA…Clarysse, B\n Bruneel, J\n Wright, MSTRATEGIC ENTREPRENEURSHIP JOURNAL20119
574CHANGING WITH THE TIMES: AN INTEGRATED VIEW OF…Fisher, G\n Kotha, S\n Lahiri, AACADEMY OF MANAGEMENT REVIEW20169
354Amphibious entrepreneurs and the emergence of …Powell, WW\n Sandholtz, KWSTRATEGIC ENTREPRENEURSHIP JOURNAL20128
507A longitudinal study of success and failure am…Gurdon, MA\n Samsom, KJTECHNOVATION20108
324Are You Experienced or Are You Talented?: When…Eesley, CE\n Roberts, EBSTRATEGIC ENTREPRENEURSHIP JOURNAL20128
df[df['cluster']==3].sort_values('intTC', ascending=False)[['TI', 'AU', 'SO', 'PY', 'intTC']].head(10)
TIAUSOPYintTC
39830 years after Bayh-Dole: Reassessing academic…Grimaldi, R\n Kenney, M\n Siegel, DS\n W…RESEARCH POLICY201145
298Academic engagement and commercialisation: A r…Perkmann, M\n Tartari, V\n McKelvey, M\n …RESEARCH POLICY201341
425Why do academics engage with industry? The ent…D’Este, P\n Perkmann, MJOURNAL OF TECHNOLOGY TRANSFER201132
399The impact of entrepreneurial capacity, experi…Clarysse, B\n Tartari, V\n Salter, ARESEARCH POLICY201126
427Entrepreneurial effectiveness of European univ…Van Looy, B\n Landoni, P\n Callaert, J\n …RESEARCH POLICY201123
395The Bayh-Dole Act and scientist entrepreneurshipAldridge, TT\n Audretsch, DRESEARCH POLICY201120
397The effectiveness of university knowledge spil…Wennberg, K\n Wiklund, J\n Wright, MRESEARCH POLICY201119
392What motivates academic scientists to engage i…Lam, ARESEARCH POLICY201119
511Convergence or path dependency in policies to …Mustar, P\n Wright, MJOURNAL OF TECHNOLOGY TRANSFER201019
479A knowledge-based typology of university spin-…Bathelt, H\n Kogler, DF\n Munro, AKTECHNOVATION201018

Path analysis with Pajek

Recently, there was an article in Scientometrics about main path analysis by Liu et al. It’s supposed to help trace the development path of a scientific or technological field. Before hearing this, I was just being content with the capabilities of CitNetExplorer in showing the trends in my field of interest. However, after reading the technique’s capabilities. I was quite intrigued as it may make analyzing the overarching trend in a field of interest simpler to visualize. The only problem is that there is really no tutorial on how to do it. The only thing I found was this youtube video using Pajek, which honestly was not very informative. To add to that, I did not have experience with Pajek, and with its very intimidating interface, I really had to tinker with it. Nonetheless, after playing with it, I hacked my way into generating my own main path analysis plots.

In the following, I will explain the process. Note that I do not have much experience with Pajek so there might be easier ways to do it.

Overview

The workflow I engineered was this (more explanation in the coming days):

  1. Download articles from Web of Knowledge
  2. Import articles to CitNetExplorer
  3. Export the citation network file from CitNetExplorer
  4. Reformat the file into a Pajek .net file
  5. Import Pajek net file to Pajek
  6. Run Network -> Acyclic Network  -> Create Weighted Network + Vector -> Trasversal Weights -> Search Path Link Count (SPLC). Note that you can choose others weights such as SPC and SPNP. In the article above however, they recommended SPLC as they said that it somehow reflects how knowledge diffuse in real life.
  7. Run Network -> Acyclic Network  -> Create (Sub)Network  -> Main Paths -> Global Search -> Key-Route
  8. Enter an arbitrary number of routes. I tried 1-50.
  9. Run Draw -> Network
  10. Run Layout -> Kamada Kawai -> Fix first and last vertex

Results

This is a sample map for the field of Fragment-based drug discovery.

 

[In progress. Updates in the coming days]

Bibliometrics Assisted Literature Review

Starting a PhD program or any research project for that matter, one of the first things that you have to do is the literature review. When I first started carrying out the review, I found searching literature and organizing the readings to be excruciating. Where do you begin? In what order should you read your articles? Where do you stop reading? After delving into bibliometrics, I found that using the tools are really helpful to make the literature review less painstaking and more efficient. In this post, I will just list my ideas on how various bibliometric techniques can aid in this task.

Literature Search

One of the first things one has to do is to download the literature. Many researchers would carry this out by using Google scholar and search the keywords that they are familiar with. The problem however with this process is that especially for beginning researchers, they would not know all the relevant keywords in the first place and thus, exclude a lot of important papers. For more advanced researchers, they can resort to the Web of Science or Scopus and apply various Boolean operators to narrow or widen their search. But still, the problem persists, how can you ensure that you have not excluded valuable articles that are not using the keywords in their title, abstract or author-identified keywords.

Bibliometrics has an approach that can be helpful. To ensure that your collection of articles will be comprehensive, you can grow that collection from a seed of articles. To do so, you first download a set of articles through keywords that you are sure are related to your topic of interest. After downloading data from these set of articles, you can grow this set by downloading their frequently cited articles. One can set a minimum threshold of citations an article should have before it is downloaded. This can easily be done through software like CitNetExplorer, which exports the DOI.

Extending this further, another step one can do is download the citing articles. This is especially helpful for fields where advances are constantly occurring, making it difficult to track the keywords being used. This also allows one to identify the adjacent fields that the original field is extending to. This step can easily be done through the citation report feature of the Web of Science. As a caveat though, one should set a threshold on how many citations a paper should have in the original dataset before it is added to ensure that all the papers are still relevant. This can be done in the absolute or relative. For instance, one should consider that a paper cites 5 papers from the original dataset or at least 30% of its citations are from this. One should also consider the journal and category the article belongs to.

Organizing your Papers

Having downloaded the papers, it is now important to organize them by topics. To help with identifying the subtopics within your main topic, you can create a rough cooccurrence map of the keywords. This can be carried out through software like VosViewer. This shows you the different keywords used in your literature and how related they are with each other.

A more direct way of organizing the papers is by plotting the bibliographic coupling network of the publications. This plot shows paper according to how they are related to each other based on the references they share.

Reading Order

Now that you have to organized your papers, there are many ways to read them according to your preference. I propose to subdivide them by core papers and current papers. You can then read the core papers first to contextualize the foundations of the field. These core papers are identified by high citation count within your set of papers. On the other hand, the current papers show the current trends in the field. These are identified by looking at the latest publications in the top journals in your field. This journals can be identified by combining measures of citations, number of relevant articles and relatedness of keywords.

Literature Review

To carry out the actual literature review, everyone has their own system. I fortunately have found something that works for me. It involves combining Microsoft Access with a qualitative data analysis software like Atlas.Ti. I plan to share my system in the coming weeks.

NOTE: This is draft#1 and is still under revision.

Basic Network Analysis of High Tech Firms

At the Science, Business and Innovation department at VU Amsterdam, students frequently need to assess the strategies of various high tech firms. In this post, I will outline a basic toolkit that academic researchers can use to draw and analyze two basic networks of a company – knowledge and collaboration network. Collaboration network refers to explicit partnerships that members of a firm have with other institutions. The collaboration network is usually obtained from looking at the co-authorship in a firm’s works. Meanwhile, knowledge network is related to the sources of knowledge that a firm uses in its own innovation. This knowledge network can be traced by looking at the citations of a company’s output. The main difference between the two networks is that a company does not have to formally partner with another organization in order to learn from it, rather it can also do so by tracking the other company’s activities or through informal social networks. This form of learning is not manifested through co-authorships but through citations. By analyzing the citation network, we can see whether this knowledge relationship is one-sided or whether both companies cite each other’s works.

Collaboration and Knowledge Network

In order to draw the various networks of a high tech firm, the first step is simply to look at the company website. It usually has tons of information about a company already. It shows its founders, its services and perhaps even its collaborations. With basic company information known, it is now possible to draw various network maps either by looking at the firm’s patents or publications.

Publications

One of the things I would check first, especially for a high tech startup is the publication set of the company. High tech startups publish due to a variety of reasons, such as for marketing, sometimes using the publication as a signal to investors that the company is innovative. Moreover, if a company is a pioneer in a field, publishing can help it gain legitimacy for the emerging field that it is part of. Using the Web of Science or Scopus, one could do a basic search of the firm name. In Web of Science’s advanced search, you could use the tags OO for organization, OG for organization-enhanced and AD for address. I prefer to use the address tag as the database’s preprocessing algorithm can sometimes modify the name of companies. However, the problem might be that you would not be able to find any publication because the company has just kicked off and thus, has not carried out any activity under its own name. In such cases, especially for academic spinoffs, you can resort to searching the founders’ names. For many startups based in academia, the founder might still be affiliated with the university, causing most of the company’s publications still tied to the originating university’s name.

Patents

The other logical thing to search would be patents. I have found the Patentsview platform covering the US PTO to be a very reliable source for patents. Having an API feature allows automatized downloading of patents from the website (you just need however to read the documentation found in the website). Same comment with the publications, if the patents cannot be found through company search, sometimes they might be registered under the university or under just the founder’s name.

Networks

Through these two methods, various interesting analyses can be carried out. To draw the knowledge network, I would look at the cited works of the publications/patents of the company. For publications, this can easily be done through the cited works/authors feature in VosViewer. For patents, however, preprocessing should be done to format the cited works, which can be fed to programs such as VosViewer / Gephi / Pajek.

To draw the collaboration network, we have to look at the co-authorships of publications or patents. Once again, this can easily be carried out with VosViewer for publications but preprocessing should be done for patents.

The Value of Citations

I attended the European Scientometrics Summer School last Sept. 16-23 in Berlin.  For those not familiar with the field, scientometrics refer to the analysis of scientific publications through various statistical methods. As the amount of scientific output increase, scientometricians are needed to organize and make sense of all the data being generated. I found the talks very engaging, as they give a tour of the methods in the field and their various applicaitons. The organizers did a good job of providing a theoretical background of various concepts used in bibliometrics analysis while at the same time, balancing it by having computer laboratory sessions where we applied the concepts learned. I greatly appreciate how they wanted to ensure that we take various units like citations, impact factor, keyword usage, etc with grain of salt.

The discussion that caught my attention the most was on the merit of citations. I think, generally, people tend to take citations for granted. Many academics consider citations as the currency of science. It’s almost the measure of a scientist’s worth. The thing however is that citations are affected by so many factors that great care should be given in its analysis. It varies per field, per subfield and as noted many times before, has a bias towards English publications. I particularly enjoyed this list of 15 reasons to cite another person’s work[1] as presented by Sybille Hinze from DZHW Germany:

  • Paying homage to pioneers
  • Giving credit for related work (homage to peer)
  • Identifying methodology, equipment, etc.
  • Providing background reading
  • Correcting one’s own work
  • Correcting the work of others
  • Criticising previous work
  • Substantiating claims
  • Alerting to forthcoming work
  • Providing leads to poorly disseminated, poorly indexed, or uncited work
  • Authenticating data and classes of facts – physical constants, etc.
  • Identifying original publications in which an idea or concept was discussed
  • Identifying original publications or other work describing an eponymic concept or term
  • Disclaiming work or ideas of others (negative claim)
  • Disputing priority claims of others (negative homage)

[1] Weinstock, M. (1971). Citation Indexes. In: Encyclopedia of Library and Information Science. Vol. 5, p. 16-40, Marcel Dekker Inc., New York

Bibliometrics with Python

It’s been a few months since I last posted in this blog. As I am doing my PhD, I have been quite busy learning two things. First, since my background is chemistry, specifically crystal engineering, I have been busy transitioning towards the social sciences. There’s quite a lot of material I had to cover to be able to keep with the latest areas in Business and Innovation studies. Second, having no programming background before, I had to spend some time learning the basics. I am happy with my progress in data science with languages such as Python, R, SQL and other tools like Tableau.  I will cover the pros and cons of learning programming as a social scientist and how to actually learn them efficiently in another post.

For now, I just want to share a Python code I made to convert Web of Knowledge text files to a Dataframe / CSV . This is useful if you want to check each publication manually with Excel before analysis in another bibliometric software such as VosViewer and CitNetExplorer. I also provided a code to convert these back to the original Web of Knowledge format.

Link to ConvertWOStoDataFrame.ipynb

Update: This post is outdated. When I was starting with bibliometrics, I did not realize that you can download a CSV file directly from the Web of Science and this file can be fed directly to VosViewer and CitNetExplorer. Nonetheless, looking back, my lack of knowledge about this feature turned out to be a  good thing  as it pushed me to start learning seriously to program in Python.

Bibliometrics as a Tool for Literature Review

Literature review can be a tedious process. With so many articles to read, new researchers in a field can find themselves stuck, trying to stay on top of all the readings required. In an effort to streamline the process, bibliometrics can be a powerful tool to make the article selection more efficient, adding a visual component to it.

Last November 10-11, I gave a talk on bibliometric methods at the 8th joint PhD workshop of VU Amsterdam and FH Munster. I got really great response from my talk, with people asking me to make a manual on the topic. Though I only started using bibliometrics three months ago, I found that learning the basics to be a very useful investment. In this post, I will try to create a simple manual on the basics of the method.

Benefits of Bibliometrics

Especially for researchers, here are some things you would be able to do after reading this post:

  • Get an overview of the important publications in your field of study
  • Generate a database of important researchers and institutes in your field
  • Visualize how your field is connected

Workflow

Though there are many ways to do this, I found using the Web of Science as database and the bibliometric software VosViewer and CitiNetExplorer to have the easiest learning curve. The process generally is composed of the following steps:

  1. Formulating keywords
  2. Downloading the articles from the database
  3. Generating the maps using the software

Formulating the Keywords

The first part is just the regular literature search on the Web of Science. Most scientists would be knowledgeable already on this area, having done literature search in the past. Though the basic search would usually suffice, it would be more efficient to learn how to use the advanced search with the Boolean operators.

For example, if you are researching on entrepreneurship in the Netherlands. You want to search the terms entrepreneurship and Netherlands together. At the same time, you might want to include related words like business or industry and even the words Holland and Dutch. With these in mind, your keyword search could be:

TS = ((entrepreneurship OR business OR industry) AND (Netherlands OR Holland OR Dutch))

This yielded 3,381 results as of Nov 2016. A preliminary look at the results can then be done. At this point, you can decide to reformulate the keywords or stick with the results.  The good thing is that you can easily change your keywords if the list of articles fail to reflect your intended outcome.

Downloading the articles

This part is the easiest yet most tedious. The problem with the Web of Science is that you can only download 500 article data at a time. Thus, if you have 3,000 articles, then you have to repeat the saving process 6 times.

At the results page, what you want to do is click the down arrow beside ‘Save to EndNote online’ and click ‘Save to Other File Formats’

Afterwards, save the first 500 records by typing at the records space 1 to 500. Also, for the record content should be with the cited references. And finally, click send.

You will then have a text file containing information about the first 500 records.

To save the next 500, click again the down arrow and save records 501-1000, 1001-1500 and so on.

Using the Software

With the articles downloaded, it is now possible to analyze them with the software. Download  CitNetExplorer. It’s just a matter of loading the text files into the program. It automatically generates a map of the most cited papers in your set of papers. This software is smart such that even if an article does not have the keywords you used, it can still be included if it is cited a lot by the papers in your database.

More importantly, it also shows the connection among these papers. Through this, one can infer how the field developed and how ideas have evolved over time. By being able to visualize how these papers are related to one another, doing literature review then becomes a little bit easier.