Logo

Reproducibility Issues in Publications With Big Data

Aim

Big Data based research has become a dominant feature of many scientific fields such as high throughput genomics, proteomics, medicine, biotechnology, agronomy and many others. In recent years it has been recognised that publications based on Big Data have very specific issues which are not fully understood and result in conflicting inferences on causality leading to general mistrust to published results. The aim of this work is to bring out focus which publishers of scientific journals are facing when Big Data results are presented. Emphasized are key issues related to source data availability, algorithmic and software reliabilities, model validations, and data adjustments methodologies aiding cognitive inference.

Methods

The main concern in Big Data analysis is the extraction of key features and adjustments of co-founders needed for unsupervised pattern recognition and/or supervised learning of functional relations and cognitive inference. Described are the main methodologies and critical review of algorithms for data regularization such as principal component analysis and regression (PCA, PCR, PLS), elastic nets (LASSO), decision trees (DT), and artificial neural networks (ANN) (1-4). Since most of Big Data research are based on large scale observational studies which are unbalanced and lead to significant biases presented are positive and negative aspects of propensity score data adjustments to mimic properties of randomized trials. Application of data bootstrapping is emphasized as the key methodology needed for model validation and cognitive inference from Big Data studies (5-6).

Results and Discussion

Science progresses by corroboration through publications in scientific journals, books and electronic data exchange. Todays world of Big Data is result of integration of high throughput data, large databases and extensive networking facing upcoming 5G technologies. Validation of data and inference of cognition becomes the critical aspects for publishers of science journals. Although reproducibility and availability of data have been golden standards for valuable research, nowadays volume of data and complexity of algorithmic analysis and inference have put scientific journals at new and difficult issues. It is believed that only about 40 % of recently published science results can be reproduced (7). Reproducing scientific experimental data and algorithmic, usually which are not necessarily statistically sound, have become problematic for several reasons. Some of the main problems specific to Big Data studies are related to difficulties and lack of collaborative reproduction (parallel independent laboratory work) of experimental data, incomplete or faulty experimental design (DOE), unvalidated software tools, unawareness of data confounding, and blind use of machine learning algorithms. Science journal editors should insist on detailed exposition of DOE model, availability of source data and applied software. Applied algorithmic inference should be detailed on basics from probability theory and statistics, and proceeded with data science software tools. The key step in analysis of high dimensional big data sets is identification of key features, i.e. regularization of columns of the data matrix. Commonly are applied methods based on eigen value decomposition leading to principal components (PCA), principal component regression (PCR), and partial least squares (PLS). The methods are very effective but do not account interactions between key features, i.e. synergism and antagonism between variables is not explicated. In order to avoid model bias and misconception needed are systemic tests of presumed nonlinearities. Regularization with search of feature interactions is accounted by use of decision trees (forest) and artificial neural networks (convolution layer). Critical point in big data pattern classification, for example when applied for medical diagnosis, is model evaluation by data bootstrapping. Accuracy of the models should be evaluated by ROS graphs and numerically by AUC values. The most difficult issues are related to data adjustments for causality deduction. Fully randomized trials, including Mendellian data, which are mainly reported for drug trials and pharmacokinetic studies, fulfil requirements for elimination of feature confounding effects. However, most big data analysis are founded on observational studies and profoundly exposed to confounding and are not reliable for cognitive inferences. Science editors should recommend to researcher, before acceptance of results for publication, to test their causality inferences by propensity score data matching. Since Big Data models are basically deeply complex thorough multiple cross validation by bootstrapping, possible from independent sources, should be required before acceptance for publication.

Conclusion

In view of critical confidence into reproducibility and reliability of research publications based on Big Data projects a checklist for validation of analysis and inference of cognitive relations can be proposed. The main concern is focused on the initial assumptions and assumed of modelling thesis incorporated into research DOE plan. Collaborative multi laboratory parallel data acquisition is essential, especially for fundamental projects in molecular medicine and pharmacology. Validation of algorithmic methodologies applied for analysis should be based on feature space regularization followed by extensive bootstrap validation. Since most of Big Data research projects are observational studied feature confounding is present and data must be adjusted by matching by one of propensity score methods (often regularized logistic model).

In conclusion, ideally results from multiple (more than two independent) institutions (laboratories) which have different and unrelated key sources of potential bias and confounding should be compared. Open data policy of source data and software for algorithmic inferences should be available.

  • Gwen FranckGwen FranckEIFL, Lithuania

    Gwen Franck is consultant and facilitator, interested in the ‘hands on’ aspects of Open Science such open access publishing, self-archiving… More →

  • Victoria TsoukalaVictoria TsoukalaEuropean Commission

    Victoria Tsoukala works as a Policy Officer in the European Commission, DG RTD.G2: Open Science, in Secondment from her position at the… More →

  • Adriaan van der WeelAdriaan van der WeelLeiden University

    Adriaan van der Weel is Bohn extraordinary professor of Modern Dutch Book History at the University of Leiden and lecturer in Book and… More →

  • Sami SyrjämäkiSami SyrjämäkiFederation of Finnish Learned Societies

    Dr Sami Syrjämäki is the head of publications at the Federation of Finnish Learned Societies. His expert work focuses on science policies… More →

  • Thed van LeeuwenThed van LeeuwenLeiden University

    Thed van Leeuwen is a senior researcher at the Centre for Science and Technology Studies (CWTS) of Leiden University in the Netherlands. As… More →

  • Andrei RostovtsevAndrei RostovtsevDissernet, Russia

    Prof Andrei Rostovtsev is a Russian physicist, doctor of physical and mathematical sciences. He graduated from the National Research Nuclear… More →

  • Vanessa ProudmanVanessa ProudmanSPARC Europe

    Vanessa Proudman is Director of SPARC Europe; she is working to make Open the default in Europe. Vanessa has 20 years’ international… More →

  • Ana MarušićAna MarušićUniversity of Split

    Ana Marušić is Professor of Anatomy and Chair of the Department of Research in Biomedicine and Health at the University of Split School of… More →

  • Alen VodopijevecAlen VodopijevecRuđer Bošković Institute

    MSc Alen Vodopijevec obtained his diploma in 2003 at the University of Zagreb, Faculty of Social Sciences and Humanities, and currently is… More →

  • Anita Pavić Pintarić
  • Damien VannsonDamien VannsonThunken

    Builder at heart, driven by the satisfaction of turning shower thoughts and back-of-the-envelope plans into full-fledged, user-friendly… More →

  • Danijel GudeljDanijel GudeljUniversity of Zagreb

    Danijel Gudelj is M.A. of sociology and croatology, graduated at Centre for Croatian Studies, University of Zagreb. Currently, he is a… More →

  • Blaž RebernjakBlaž RebernjakUniversity of Zagreb

    Blaž Rebernjak was born in Zagreb in 1983, where he finished primary and secondary schools. In 2007 he obtained his MA and in 2013 his PhD… More →

  • Evgenia Arh
  • Drahomira CuparDrahomira CuparUniversity of Zadar

    Drahomira Cupar, Phd, is an assistant professor at the University of Zadar, Department of Information Sciences. She obtained her PhD in… More →

  • Elizabeth WagerElizabeth WagerSideview

    Elizabeth (Liz) Wager, PhD is a freelance consultant and trainer who has worked on six continents. She chaired the Committee on Publication… More →

  • Filip HorvatFilip HorvatUniversity of Rijeka

    Filip Horvat is a librarian at the Faculty of Civil Engineering, University of Rijeka. He received his Master’s degree in Information… More →

  • Goranka MitrovićGoranka MitrovićNational and University Library in Zagreb

    Goranka Mitrović, senior librarian, works at the National and University Library in Zagreb, Croatia (NUL) since 1993. Her research interest… More →

  • Draženko CeljakDraženko CeljakUniversity Computing Centre

    MSc Draženko Celjak is the head of data services at SRCE – University of Zagreb University Computing Centre. He coordinates and leads the… More →

  • Iva Melinščak ZlodiIva Melinščak ZlodiUniversity of Zagreb

    Iva Melinščak Zlodi works as an e-resources librarian at the Library of the University of Zagreb Faculty of Humanities and Social Sciences… More →

  • Ivana MajerIvana MajerUniversity of Zagreb

    Ivana Majer graduated from the Faculty of Humanities and Social Sciences at the University of Zagreb, and got her degree in Croatian… More →

  • Irena KranjecIrena KranjecUniversity of Zagreb

    Irena Kranjec works as a subject librarian for information sciences at the Library of the Faculty of Humanities and Social Sciences… More →

  • Jasminka MaravićJasminka MaravićCARNet Department for Education Support

    Jasminka Maravić is Project Manager at CARNet Department for Education Support. During her 14 years in CARNet she has been involved in… More →

  • Krešimir ZauderKrešimir ZauderUniversity of Zadar

    Krešimir Zauder was born in Zagreb, Croatia in 1980. He graduated Information science and English language and literature in 2006. In 201… More →

  • Jure TriglavJure TriglavCollaborative Knowledge Foundation

    Jure is the lead developer at the Collaborative Knowledge Foundation, where he develops the PubSweet framework and supports its community. More →

  • Josipa Zetović
  • Kristina RomićKristina RomićNational and University Library in Zagreb

    Kristina Romić works at the Acquisition Department, National and University Library in Zagreb, Croatia. She graduated from the Faculty of… More →

  • Ksenija Baždarić
  • Ksenija Švenda RadeljakKsenija Švenda RadeljakUniversity of Zagreb

    Ksenija Švenda Radeljak is employed at the Library of Department of Social Work at the Faculty of Law University in Zagreb. The areas of her… More →

  • Linda SīleLinda SīleUniversity of Antwerp

    Linda Sīle is doctoral student at the University of Antwerp within the Centre for R&D Monitoring (ECOOM). My current work spans somewhat… More →

  • Lovela Machala PoplašenLovela Machala PoplašenUniversity of Zagreb

    Lovela Machala Poplašen is a head librarian at the Andrija Štampar Library, School of Public Health, School of Medicine, University of… More →

  • Ljiljana Poljak
  • Luc BorutaLuc BorutaThunken

    Ph.D. in computational linguistics, natural language processor, interested in linked data and linguistic diversity. In previous lives, Luc… More →

  • Ljiljana Jertec MusapLjiljana Jertec MusapSRCE – University Computing Centre, University of Zagreb

    MSc Ljiljana Jertec is a librarian and computer specialist at SRCE – University of Zagreb University Computing Centre. She has a Master’s… More →

  • Lucija VejmelkaLucija VejmelkaUniversity of Zagreb

    Lucija Vejmelka is an assistant professor at the University of Zagreb, Faculty of Law, Department of Social Works, where she leads the… More →

  • Marijana Briški Gudelj
  • Marijana GlavicaMarijana GlavicaUniversity of Zagreb

    MSc Marijana Glavica works as a systems librarian at the University of Zagreb Faculty of Humanities and Social Sciences Library, where she… More →

  • Marina Cvitanušić BrečićMarina Cvitanušić BrečićCroatian Agency for Science and Higher Education

    Marina Cvitanušić Brečić works at the Analytics and Statistics Department of the Croatian Agency for Science and Higher Education (ASHE… More →

  • Marina GrubišićMarina GrubišićCroatian Agency for Science and Higher Education

    Marina Grubišić is the head of the Analytics and Statistics Department of the Croatian Agency for Science and Higher Education (ASHE). She… More →

  • Matko MarušićMatko MarušićUniversity of Split

    Matko Marušić is Professor Emeritus at the University of Split, Split, Croatia. He was a Professor at Medical Schools (in Zagreb and Split… More →

  • Nicolas Robinson-Garcia
  • Neven Pintarić
  • Paulin RibbePaulin RibbeOpenEdition

    Paulin Ribbe is Project Manager for the OPERAS infrastructure at OpenEdition (France, Marseille - CNRS, AMU, EHESS, Avignon Univ.). He holds… More →

  • Radovan VranaRadovan VranaUniversity of Zagreb

    Born in Zagreb, Croatia. Primary and secondary education completed in Zagreb. Croatia. Graduated information sciences and the English… More →

  • Rafaelly StavaleRafaelly StavaleUniversity of Brasília

    Rafaelly Stavale is a current student of Nursing at Universidade de Brasília – UnB. She has recently completed the Principles and Practices… More →

  • Olga KirillovaOlga KirillovaAssociation of Science Editors and Publishers (ASEP), Moscow, Russia

    Olga V. Kirillova, Candidate of Science (Engineering, 2004), the President of the Association of Science Editors and Publishers (ASEP, since… More →

  • Pierre MounierPierre MounierOpenEdition

    Pierre Mounier is deputy director of OpenEdition , a comprehensive infrastructure based in France for open access publication and… More →

  • Rodrigo CostasRodrigo CostasLeiden University

    Rodrigo Costas is an experienced researcher in the field of information science and bibliometrics. With a PhD in Library and Information… More →

  • Tihana RubićTihana RubićUniversity of Zagreb

    Tihana Rubić is an assistant professor at the Department of Ethnology and Cultural Anthropology, Faculty of Humanities and Social Sciences… More →

  • Vicko TomićVicko TomićUniversity of Split

    Vicko Tomić is a research assistant at the Department of Research in Biomedicine and Health at the University of Split School of Medicine… More →

  • Vanessa FairhurstVanessa FairhurstCrossref

    Vanessa Fairhurst joined Crossref in 2017 and is based at the Oxford office. As Community Outreach Manager, her role involves working… More →

  • Želimir KurtanjekŽelimir KurtanjekUniversity of Zagreb

    Želimir Kurtanjek is a retired professor of chemical engineering with an interest in biotechnology, biostatistics and big data analytics… More →

  • Vlatka BožičevićVlatka BožičevićUniversity of Zagreb

    Vlatka Božičević gratuated from Religious Pedagogy and Catechetics at the Catholic Faculty of Theology University of Zagreb and the… More →

  • Željka Salopek
  • Zoran VelagićZoran VelagićUniversity of Osijek

    Zoran Velagić is a professor of book history and publishing studies at the University of Osijek, Faculty of Humanities and Social Sciences… More →