MA Thesis for Library and Information Science

An overview of my MA(LIS) thesis and its progress

(this document is originally in German)

General Information

Title: Publication Practices for Research Data in University Theses
Subtitle: An Examination of Publication Formats and Methods

University: Humboldt University of Berlin
Faculty: Faculty of Philosophy
Institute: Institute of Library and Information Science

Reviewer 1: Dr. Sarah Dellmann
Reviewer 2: Prof. Dr. Robert Jäschke

Exposé

Introduction

There are three publication forms for research data (RD) in academic theses (AT) (Reilly et al., 2011, pp. 5 f.):

  1. Fully integrated data in AT (e.g., tables and graphics embedded in the PDF file of the AT),
  2. Data attached to AT (e.g., files uploaded to the university’s publication server along with the PDF file of the thesis)
  3. Data uploaded to a separate repository referenced within the AT

In the academic context, prescriptive articles from the DFG-funded project eDissPlus (Weisbrod et al., 2017; Kleineberg & Kaden, 2018; Weisbrod, 2018) as well as the Policy for Dissertation-related Research Data of the German National Library (Deutsche Nationalbibliothek [DNB], 2017) increasingly provide guidelines for handling RD for AT. However, comprehensive studies on the effectiveness or enforcement of these guidelines among students are lacking (e.g., through corresponding examination regulations and consultations on this topic by university libraries). So far, there are at most highly specialized and discipline-specific studies.

This master’s thesis intends to provide a more general investigation in this regard.

Research Question

Main Research Question

In what way were RD from AT published in the institutional repository of Leibniz University Hannover (LUH Repository) until December 2023?

This can be divided into the following subordinate research questions:

  1. What proportion of AT had RD published as part of the PDF file?
  2. What proportion of AT had RD published as a separate file in the form of a supplement?
  3. What proportion of AT had RD published in a separate repository?
  4. How are RD in AT distinguished and linked with the text of the AT?
  5. How is it made visible in the metadata of AT that there are associated research data?

Subsidiary Research Question

To what extent have recommendations regarding RD in AT already been anchored in examination regulations and other guiding documents at German universities?

Methodology

For answering these research questions, the work process for the master’s thesis is divided into four modules:

  1. The analysis of German doctoral regulations and overarching guidelines regarding RD.
  2. The manual classification of AT in the LUH Repository regarding RD.
  3. The evaluation of the results from the first two modules focusing on potential recommendations regarding RD.
  4. The training of a model for automatic classification of AT regarding RD based on the results of the preceding manual classification work.

Module 1: Doctoral Regulations

Here, the doctoral regulations and other relevant guiding documents of a simple sample (n=173) of all universities eligible for doctoral studies in Germany (n=313) are examined. The sample size was calculated with a confidence interval of 95% and a margin of error of 5%.

Module 2: Manual AT Classification

Here, a multi-layered sample of AT in the LUH Repository is manually classified based on whether the AT,

  1. have no RD,
  2. have RD as part of the PDF file,
  3. have RD as attached file(s), or
  4. have RD in an external repository.

The sample is stratified by the faculties of LUH and by four 3-year stages. For this module, administrative access to the LUH Repository is obtained. The exact sample size can only be calculated with this access. The classification itself considers the content of the PDF file as well as the associated metadata in the LUH Repository.

Module 3: Evaluation & Recommendations

Here, the results of the first two modules are evaluated, and concepts are developed based on the data obtained on how to achieve better handling of RD in AT and which target groups these efforts should primarily address.

Module 4: Training of the Classification Model

Here, the results of the previous classification work are used to train a model that can classify the remaining AT in the LUH Repository according to RD status. The training and construction of the model are expected to follow Younes and Scherp’s work on identifying and extracting datasets in scientific articles (missing reference).

Depending on whether LUH has the resources for result control, either a one-step procedure (direct identification and extraction via a pre-trained language model like DeBERTa in question-answer mode) or a two-step procedure (filtering via an MLP followed by extraction via a pre-trained language model like RoBERTa) will be used here. The former (according to current expectations) has higher precision and therefore requires less post-processing, but has lower recall. The latter (according to current expectations) has higher recall but lower precision.

Schedule

gantt
    title Schedule for the Master's Thesis
    dateFormat YYYY-MM-DD
    tickInterval 1month
    weekday monday
    todayMarker on
        Prep.     :v1, 2024-02-15, 14d
        M. 1          :m1, 2024-02-29, 14d
        Module 2          :m2, 2024-03-14, 30d
        M. 3          :m3, 2024-04-06, 14d
        Module 4          :m4, 2024-04-11, 40d
        Writing Phase     :s1, 2024-05-11, 34d
Figure: A provisional schedule for the completion of the master's thesis as a Gantt chart.

PDF Version

A German PDF version of this proposal (without the Gantt chart) can be downloaded here.

Current Status

  • Preparation Phase
  • Processing Phase
    • Module 1
      • List of all German universities
      • Filter list by eligibility for doctoral studies
      • Create script for seed-based random selection from university list (Result: downloadable here)
      • Take a simple random sample
      • Collect doctoral regulations & other relevant documents of the sample
      • Evaluate doctoral regulations of the sample
    • Module 2
      • Download metadata of all LUH Repository dissertations
      • Find a way to automatically download all relevant files
        • Check if DSpace 5 provides internal function (Result: not available)
        • Create script that downloads all PDF files and accompanying files
      • Create script to stratify dissertations into Year+Faculty groupings
      • Take stratified random sample
        • Reevaluate stratification based on output (Result: switch to 3 year groupings with 4 years each instead of 4 year grouping with 3 years each)
      • Download all relevant files
      • Decide on metadata scheme to classify research data for subsequent upload of classification into DSpace
      • Evaluate all dissertations
        • Check for internal research data
        • Check for accompanying research data
        • Check for external research data
    • Module 3
    • Module 4
      • Sort PDF files
      • Install Grobid
      • Convert PDF files to TEI-XML files
      • Sort TEI-XML files by language
      • Check TEI-XML data quality
      • Create CSV dataset (by paragraph)
      • Classify paragraphs of dataset)
      • Write model training script
      • Evaluate performance
  • Writing Phase
    • Introduction
    • Module 1
    • Module 2
    • Module 3
    • Module 4
    • Conclusion
  • Submission

References

2018

  1. B-FDM
    Zur Veröffentlichung dissertationsbezogener Forschungsdaten: Perspektiven und Kompetenzen von Promovierenden an Berliner Universitäten
    Michael Kleineberg ,  and  Ben Kaden
    Bausteine Forschungsdatenmanagement, Oct 2018
  2. O-BIB
    Pflichtablieferung von Dissertationen mit Forschungsdaten an die DNB – Anlagerungsformen und Datenmodell
    Dirk Weisbrod
    o-bib. Das offene Bibliotheksjournal, Jul 2018

2017

  1. HUB
    eDissPlus – Optionen für die Langzeitarchivierung dissertationsbezogener Forschungsdaten aus Sicht von Bibliotheken und Forschenden
    Dirk Weisbrod ,  Ben Kaden ,  and  Michael Kleineberg
    In E-Science-Tage: Forschungsdaten managen , Jul 2017
  2. DNB
    Policy der Deutschen Nationalbibliothek für dissertationsbezogene Forschungsdaten
    Deutsche Nationalbibliothek [DNB]
    Jul 2017

2011

  1. OfDE
    Opportunities of Data Exchange: Report on Integration of Data and Publications
    Susan Reilly ,  Wouter Schallier ,  Sabine Schrimpf , and 2 more authors
    Jul 2011