Data Publishing and Archiving Routes

Aim: In this presentation, an overview of different ways of sharing and archiving data will be given to help researchers choose a data repository that will suit their needs and resources.

Approach: Based on the information available in reports and recommendations of various research infrastructure projects, funder requirements, and published literature about data repositories, different ways of data sharing and archiving are identified, described and compared.

Results: Data archiving and publishing is an integral part of proper data management.Researchers should, among other things in their Data Management Plans, consider where and how will they store their data after the project ends, and how will they make the data available for reuse. They are sometimes encouraged by the funders to make their data FAIR (1) – findable,accessible, interoperable and reusable, meaning that considering usability of the data, not just storing them in a repository, is an important issue (2). It can be argued that the data that are properly documented, reviewed for quality, searchable in catalogues, and citable in articles can be”counted” in evaluations as a normal scientific publication (3). This may not be the case if the data are published on researcher’s website or in a similar way without proper metadata and documentation as the most important consideration for usability of the data and without any guarantee that the data will be available for long in the future. Publishing and archiving data are complimentary services and both require that data are equipped with metadata and documentation. Through archiving, data are stored and preserved in readable form for long-term, while publishing is more focused on data dissemination and findability. Several routes for publishing and archiving data are available today and they can be roughly categorised as follows: journal supplementary material service, institutional data repository, general-purpose repository, and domain specific repository (4). These service providers differ in the level and scope of services they are offering and they may or may not be appropriate for the intended and permitted future use of a specific dataset. All of them will make the data available for re- use and will put some effort in data dissemination and promotion activities, but some will only allow a basic set of metadata and documentation and may not comply with the funder’s or journal publisher’s requirements. Some are more likely to accept only the complete, selected and reviewed datasets in specific domain of research, while others are accepting a wide range of datatypes from different domains and will accept only parts of the dataset on which a publication was based. Some may offer expert advice and support in data management and publishing, and others are setting up self-archiving service with minimal support for data providers and data users. In addition, there might be some charges for publishing, and some publishers may keep the data behind a subscription wall or claim copyright over data. Any route is likely to be sufficient for low complexity data or for data that are not personal or sensitive, or if, for some reason, it is not so important for the data to be available for long time in the future. If a dataset is worth archiving and preserving because it cannot be crated again (e.g. social sciences data, meteorological data),or if the data are personal and sensitive in nature, researchers are strongly advised to archive and publish the data in one of the trustworthy data repositories. These are usually domain-specific and striving to offer wide range of services to achieve the full potential of FAIR principles. They cultivate specialist domain knowledge in data management and ensure that datasets are equipped with all relevant metadata, documentation, additional information and tools needed to understand the data, ethical and legal issues are clarified (confidentiality, protection, licencing),file formats are appropriate for long-term preservation, access conditions are applied to assure confidentiality and data protection, persistent identifiers are used for identification and citation purposes, and more. Trustworthiness is achieved through compliance with certification requirements, which cover organisational, technical, financial, legal, and other aspects of running a data repository. The CoreTrustSeal (5) consists of 16 requirements that are describing the characteristics of trustworthy repositories and offers a core level of certification for data repositories. Empirically comparing and evaluating different data repositories is not a simple task because criteria for such evaluations have yet to be developed. In addition, it has to be kept in mind that research infrastructures that facilitate scientific communication through support for sharing, discovering and curating data and publications produced by researchers, are in rapid development (6). Comparisons are hard, if not impossible, between different fields of science because of substantial differences in types of data and ways of (re)using data that exist in different fields. One possible approach is to first identify important characteristics of data publishing and then to assess individual repositories on these aspects (7). Another approach is to examine perceptions and satisfaction of users who are re-using data available in repositories (8).

Conclusion: No one solution for data publishing fits all the diversity of researcher’s needs, datatypes and sizes. Not all data services providers invest the same amount of effort in order to make the data fully understandable and to prepare the data for long time preservation.Research databases that contain personal and sensitive data, as well as datasets that cannot be created again, should be published and archived in trustworthy domain-specific data repository to ensure its usability and longevity.

Location: Date: September 21, 2018 Time: 11:55 - 12:10 Marijana Glavica, University of Zagreb Irena Kranjec, University of Zagreb