Essays on text mining : methodological advances and practical applications to scientific texts
Rüdiger, Matthias Sebastian; Salge, Torsten-Oliver (Thesis advisor); Wentzel, Daniel (Thesis advisor)
Dissertation / PhD Thesis
Dissertation, Rheinisch-Westfälische Technische Hochschule Aachen, 2020
The building blocks of the science system are the scientific publications. They serve to present new findings and to tear down existing knowledge. As scientific publications increase in volume, and as science continues to fragment into more specialized fields, the difficulty of organizing all the information increases, too. To uphold the productivity of science, it is necessary to consolidate scientific information regularly - to organize it, to weigh the importance of different contributions, and to integrate the information into a coherent body of knowledge. This way, the latest state of research is disseminated in condensed form, allowing scientists to keep pace with new developments and discoveries. Two approaches have proven especially useful for this purpose, namely review articles and citation analysis. However, both approaches struggle with the increasing number of scientific publications, but in very different ways. While reviews are in dire need of support in dealing with the sheer volume of content, citation analysis has barely begun to consider content at all. Current advances in machine learning and text mining open up several opportunities to mitigate or even overcome these challenges. Against this background, I carry out two research projects in my thesis. The first project is dedicated to the exploration of the opportunities and limitations of text mining. Following calls for research on the further development of text mining as a method, as well as its reliability and validity, my primary goal is to inform decisions about all critical steps in the process. These steps comprise the transformation of text into numbers, the selection of algorithms, and the evaluation of results. With the results of the first project, I provide technical guidance for the conduct of computer-aided literature reviews in particular and for the application of text mining research in general. Further, I contribute findings on the validity and reliability of text mining methods, information that has been largely absent in extant literature. The results of the first project form the basis for the second project of my thesis: the development and application of two analytical instruments based on citation analysis in combination with text mining. Both instruments introduce the content dimension to citation analysis, one from a micro- and one from a macro-perspective. The micro-perspective focuses on the contextual aspects of individual citations and the relationship between cited and citing work. In this way, the thematic contribution of academic research can be identified, and, more importantly, its actual reception and use by the academic community becomes apparent. In the macro-perspective, in contrast, science is considered as a whole, and flows of knowledge are observed within and between disciplines on a thematic level. Both instruments are characterized by complete automatability and each is presented as part of a comprehensive case study to illustrate its analytical capabilities. The case studies take as an example the still-developing and diverse field of Information Systems (IS), which benefits particularly from the newly-developed instruments. This thesis comprises four essays which are informed by the two research projects. The first project evaluates the application of text mining algorithms. Essay I compares algorithm performances and investigates the validity and reliability of result evaluation metrics. Essay II takes a close look at the text preprocessing and vectorization required for the subsequent application of text mining algorithms. Both essays are backed by comprehensive experiments based on automatically-generated test data from Wikipedia. The second research project builds on the technical insights of the first project and presents the development of two instruments, both augmenting citation analysis with a content dimension. Essay III takes a micro-perspective and investigates the contextual relationships between articles in the field of Information Systems that are linked together via one or more citations; specifically, it evaluates the impact of articles contributing to the discourse around the Technology Acceptance Model. Essay IV studies the flow of knowledge to and from the discipline of Information Systems and discusses the positioning of IS in the landscape of scientific fields. The findings of this thesis contribute to our understanding of how text mining methods can both facilitate text analytical work and complement bibliometric research methods, and how those insights can be put into practice. The four essays collectively (a) advance knowledge about text mining as a research method, (b) enrich the set of instruments in Bibliometrics, and (c) expand the knowledge of the field of Information Systems through the application of the novel instruments. The instruments serve to improve the introspection of science and illustrate how this can be achieved on two different levels of analysis using the discipline of Information Systems as an example. In this way, I facilitate studying knowledge production, dissemination, and absorption on a more granular level than before.