Bibliometrics Assisted Literature Review

Starting a PhD program or any research project for that matter, one of the first things that you have to do is the literature review. When I first started carrying out the review, I found searching literature and organizing the readings to be excruciating. Where do you begin? In what order should you read your articles? Where do you stop reading? After delving into bibliometrics, I found that using the tools are really helpful to make the literature review less painstaking and more efficient. In this post, I will just list my ideas on how various bibliometric techniques can aid in this task.

Literature Search

One of the first things one has to do is to download the literature. Many researchers would carry this out by using Google scholar and search the keywords that they are familiar with. The problem however with this process is that especially for beginning researchers, they would not know all the relevant keywords in the first place and thus, exclude a lot of important papers. For more advanced researchers, they can resort to the Web of Science or Scopus and apply various Boolean operators to narrow or widen their search. But still, the problem persists, how can you ensure that you have not excluded valuable articles that are not using the keywords in their title, abstract or author-identified keywords.

Bibliometrics has an approach that can be helpful. To ensure that your collection of articles will be comprehensive, you can grow that collection from a seed of articles. To do so, you first download a set of articles through keywords that you are sure are related to your topic of interest. After downloading data from these set of articles, you can grow this set by downloading their frequently cited articles. One can set a minimum threshold of citations an article should have before it is downloaded. This can easily be done through software like CitNetExplorer, which exports the DOI.

Extending this further, another step one can do is download the citing articles. This is especially helpful for fields where advances are constantly occurring, making it difficult to track the keywords being used. This also allows one to identify the adjacent fields that the original field is extending to. This step can easily be done through the citation report feature of the Web of Science. As a caveat though, one should set a threshold on how many citations a paper should have in the original dataset before it is added to ensure that all the papers are still relevant. This can be done in the absolute or relative. For instance, one should consider that a paper cites 5 papers from the original dataset or at least 30% of its citations are from this. One should also consider the journal and category the article belongs to.

Organizing your Papers

Having downloaded the papers, it is now important to organize them by topics. To help with identifying the subtopics within your main topic, you can create a rough cooccurrence map of the keywords. This can be carried out through software like VosViewer. This shows you the different keywords used in your literature and how related they are with each other.

A more direct way of organizing the papers is by plotting the bibliographic coupling network of the publications. This plot shows paper according to how they are related to each other based on the references they share.

Reading Order

Now that you have to organized your papers, there are many ways to read them according to your preference. I propose to subdivide them by core papers and current papers. You can then read the core papers first to contextualize the foundations of the field. These core papers are identified by high citation count within your set of papers. On the other hand, the current papers show the current trends in the field. These are identified by looking at the latest publications in the top journals in your field. This journals can be identified by combining measures of citations, number of relevant articles and relatedness of keywords.

Literature Review

To carry out the actual literature review, everyone has their own system. I fortunately have found something that works for me. It involves combining Microsoft Access with a qualitative data analysis software like Atlas.Ti. I plan to share my system in the coming weeks.

NOTE: This is draft#1 and is still under revision.

Basic Network Analysis of High Tech Firms

At the Science, Business and Innovation department at VU Amsterdam, students frequently need to assess the strategies of various high tech firms. In this post, I will outline a basic toolkit that academic researchers can use to draw and analyze two basic networks of a company – knowledge and collaboration network. Collaboration network refers to explicit partnerships that members of a firm have with other institutions. The collaboration network is usually obtained from looking at the co-authorship in a firm’s works. Meanwhile, knowledge network is related to the sources of knowledge that a firm uses in its own innovation. This knowledge network can be traced by looking at the citations of a company’s output. The main difference between the two networks is that a company does not have to formally partner with another organization in order to learn from it, rather it can also do so by tracking the other company’s activities or through informal social networks. This form of learning is not manifested through co-authorships but through citations. By analyzing the citation network, we can see whether this knowledge relationship is one-sided or whether both companies cite each other’s works.

Collaboration and Knowledge Network

In order to draw the various networks of a high tech firm, the first step is simply to look at the company website. It usually has tons of information about a company already. It shows its founders, its services and perhaps even its collaborations. With basic company information known, it is now possible to draw various network maps either by looking at the firm’s patents or publications.

Publications

One of the things I would check first, especially for a high tech startup is the publication set of the company. High tech startups publish due to a variety of reasons, such as for marketing, sometimes using the publication as a signal to investors that the company is innovative. Moreover, if a company is a pioneer in a field, publishing can help it gain legitimacy for the emerging field that it is part of. Using the Web of Science or Scopus, one could do a basic search of the firm name. In Web of Science’s advanced search, you could use the tags OO for organization, OG for organization-enhanced and AD for address. I prefer to use the address tag as the database’s preprocessing algorithm can sometimes modify the name of companies. However, the problem might be that you would not be able to find any publication because the company has just kicked off and thus, has not carried out any activity under its own name. In such cases, especially for academic spinoffs, you can resort to searching the founders’ names. For many startups based in academia, the founder might still be affiliated with the university, causing most of the company’s publications still tied to the originating university’s name.

Patents

The other logical thing to search would be patents. I have found the Patentsview platform covering the US PTO to be a very reliable source for patents. Having an API feature allows automatized downloading of patents from the website (you just need however to read the documentation found in the website). Same comment with the publications, if the patents cannot be found through company search, sometimes they might be registered under the university or under just the founder’s name.

Networks

Through these two methods, various interesting analyses can be carried out. To draw the knowledge network, I would look at the cited works of the publications/patents of the company. For publications, this can easily be done through the cited works/authors feature in VosViewer. For patents, however, preprocessing should be done to format the cited works, which can be fed to programs such as VosViewer / Gephi / Pajek.

To draw the collaboration network, we have to look at the co-authorships of publications or patents. Once again, this can easily be carried out with VosViewer for publications but preprocessing should be done for patents.

The Value of Citations

I attended the European Scientometrics Summer School last Sept. 16-23 in Berlin.  For those not familiar with the field, scientometrics refer to the analysis of scientific publications through various statistical methods. As the amount of scientific output increase, scientometricians are needed to organize and make sense of all the data being generated. I found the talks very engaging, as they give a tour of the methods in the field and their various applicaitons. The organizers did a good job of providing a theoretical background of various concepts used in bibliometrics analysis while at the same time, balancing it by having computer laboratory sessions where we applied the concepts learned. I greatly appreciate how they wanted to ensure that we take various units like citations, impact factor, keyword usage, etc with grain of salt.

The discussion that caught my attention the most was on the merit of citations. I think, generally, people tend to take citations for granted. Many academics consider citations as the currency of science. It’s almost the measure of a scientist’s worth. The thing however is that citations are affected by so many factors that great care should be given in its analysis. It varies per field, per subfield and as noted many times before, has a bias towards English publications. I particularly enjoyed this list of 15 reasons to cite another person’s work[1] as presented by Sybille Hinze from DZHW Germany:

  • Paying homage to pioneers
  • Giving credit for related work (homage to peer)
  • Identifying methodology, equipment, etc.
  • Providing background reading
  • Correcting one’s own work
  • Correcting the work of others
  • Criticising previous work
  • Substantiating claims
  • Alerting to forthcoming work
  • Providing leads to poorly disseminated, poorly indexed, or uncited work
  • Authenticating data and classes of facts – physical constants, etc.
  • Identifying original publications in which an idea or concept was discussed
  • Identifying original publications or other work describing an eponymic concept or term
  • Disclaiming work or ideas of others (negative claim)
  • Disputing priority claims of others (negative homage)

[1] Weinstock, M. (1971). Citation Indexes. In: Encyclopedia of Library and Information Science. Vol. 5, p. 16-40, Marcel Dekker Inc., New York

Bibliometrics with Python

It’s been a few months since I last posted in this blog. As I am doing my PhD, I have been quite busy learning two things. First, since my background is chemistry, specifically crystal engineering, I have been busy transitioning towards the social sciences. There’s quite a lot of material I had to cover to be able to keep with the latest areas in Business and Innovation studies. Second, having no programming background before, I had to spend some time learning the basics. I am happy with my progress in data science with languages such as Python, R, SQL and other tools like Tableau.  I will cover the pros and cons of learning programming as a social scientist and how to actually learn them efficiently in another post.

For now, I just want to share a Python code I made to convert Web of Knowledge text files to a Dataframe / CSV . This is useful if you want to check each publication manually with Excel before analysis in another bibliometric software such as VosViewer and CitNetExplorer. I also provided a code to convert these back to the original Web of Knowledge format.

Link to ConvertWOStoDataFrame.ipynb

Update: This post is outdated. When I was starting with bibliometrics, I did not realize that you can download a CSV file directly from the Web of Science and this file can be fed directly to VosViewer and CitNetExplorer. Nonetheless, looking back, my lack of knowledge about this feature turned out to be a  good thing  as it pushed me to start learning seriously to program in Python.