Fully automated content analysis of citation environments bringing bibliometrics and text mining together.
Supported by the Federal Ministry of Education and Research Germany (BMBF)
Citations are at the core of measuring research performance. They form the basis of bibliometrics. In the teaching of the measurement of scientific publications, networks are formed on the basis of references, which can then in turn be studied for hidden relationships. In bibliometrics, the number of references received is considered equivalent with the relevance of scientific work. Thus, it is implicitly assumed that all quotations are of equal value. This simplistic assumption, however, only draws a superficial picture of the scientific landscape, which has repeatedly been part of controversial discussions since the use of bibliometric data. In particular, the relationship of a citation to the text in which it appears is ignored. This is problematic in the sense that citations can greatly differ in their relationship to the citing text. While this kind of simplification is sufficient for many purposes on the one hand, it excludes many interesting possibilities for analysis on the other. It can easily be understood, that knowing what one has been quoted for is much more valuable than just knowing how often one has been quoted.
In order to make both, the citation analysis as a whole and the quantification of scientific performance as well as the valuation of impact in particular more meaningful, it is necessary to assess not only the number of citations received, but also their respective context. One way of addressing this issue is to include information about the respective journal of the citing work, such as its professional classification. Another possibility is the use of syntactic features of the text excerpt that contains the reference. This includes, for example, the analysis of the text position or the distinction between direct and indirect quotations. Another variant is the analysis of the semantic characteristics of the text excerpt. For example, citations can be used to confirm, negate or extend statements. In the past, various proposals for the implementation of these approaches have been published. In doing so, they enhance the knowledge of relationships between citing and cited works. However, they are not suitable for showing the contribution of a cited publication, that is, the value of cited document for the citing work. For example, complex studies are sometimes cited only for a certain methodological refinement and other works are used only for coining terms. With the existing approaches, these relationships remain in the dark. At first glance, the manual coding of citations using qualitative techniques poses a possible solution. However, this approach is very limited in its practical value, since typically a large number of citation contexts have to be examined. For this type of analysis, automatic, scalable methods are therefore required that uncover the topics for which a publication is cited without manual intervention and thus are able to open the Blackbox Impact.
The research approach presented here combines ideas from classical bibliometrics with developments from computer science and thereby addresses the blind spot mentioned above. The progress made in recent years in the area of Big Data and Data Mining, as well as the ever-increasing computing capacities now make it possible to automatically evaluate large, unstructured amounts of data using natural language processing methods. This means that even the publication texts that have hardly been taken into account in bibliometrics to date can be automatically evaluated. Against the background that citations have a direct thematic relation to the text passage surrounding the citation - for example in the clarification of terms, the reception of used theories or the foundation of statements - the presented research approach concentrates on the text environments of citations by inferring the thematic reference to the cited work from the direct text environments of citations. This reference then represents the value ("impact") that the cited represent for the citing work.
The aim of this project is to use automated text mining techniques to uncover what a publication was really cited for. This deeper thematic understanding of citations as the basis of bibliometrics opens up a wide range of new analytical possibilities for research.
In this approach, computational linguistics and data mining techniques are used to measure the impact and thematic references of publication texts. First, all citations within the full texts of a set of documents are identified for a defined target publication. The text environments of the citations referring to the target publication are then extracted. Following a number of preparatory procedures using Natural Language Processing, the citation environments are grouped by means of the text clustering technique Topic Modeling. The resulting groups of citation environments then represent the topics for which the target publication is used within other works. From the extracted topics, a topic profile can be generated to illustrate the actual impact, that is, the actual use of a target publication within the scientific community. Such an impact profile could, for example, indicate which types of studies are based on a certain conceptual work or in which thematic contexts a method is applied. The creation of topic profiles from the full texts is planned to run in a fully automatic fashion. It is important to emphasize that these topics are not predefined, but are "learned" by the applied process itself.
The proposed project thus aims to develop a novel method for automatically extracting and evaluating the thematic context of citations. This method for the fully automated content analysis of citation environments is particularly suitable for the in-depth thematic analysis of the impact in the sense of an actual use of scientific contributions. In addition, the method we propose can be easily combined with metadata such as publication dates of citing works to calculate thematic trends in the use of knowledge. Thus, researchers can use the proposed method not only to quantify the impact and absorption of knowledge on a specific topic, but also to enrich bibliometric analyses, e.g. to uncover patterns in knowledge diffusion within and across scientific disciplines.
In harmony with the project’s orientation towards basic research, the development aims at making the results available to the scientific community. The central ideas shall be made accessible to the expert audience in a way that allows the implementation of the new method according to one's own needs. In order to facilitate a fast transfer within the scientific community, the computer-assisted method to be created will be made accessible to other researchers as documented source code, following the Open Source paradigm.
The project extends the current bibliometric research by developing a new method to measure the thematic contribution of scientific publications and thus to quantify and analyze impact. The availability of the method provides a new basis for investigating the structure of scientific disciplines and flow of knowledge within those structures. The release of the computer program as source code under an open source license shall ensure the further development beyond the funding period and allow other researchers to test the method in different disciplines for different analytical purposes.
The new methodology offers companies and public institutions a new tool for assessing the relevance of scientific publications and the individuals and institutions generating the findings. It is therefore of particular interest for public research evaluation and management as well as research-related industries and R&D institutions. Here, it allows for a novel approach to expert search: Which researcher or which institute from which research institution is known for topic X in the scientific community? Which researcher or which institute is qualified for reviewing publications on topic Y? These questions could be answered by a "backward search" of a large number of impact profiles. Furthermore, the procedure enables public institutions to evaluate the expertise of research institutions and can be used to determine changes in the impact profile for the evaluation of funding measures.