Reproducibility Issues in Publications With Big Data


Big Data based research has become a dominant feature of many scientific fields such as high throughput genomics, proteomics, medicine, biotechnology, agronomy and many others. In recent years it has been recognised that publications based on Big Data have very specific issues which are not fully understood and result in conflicting inferences on causality leading to general mistrust to published results. The aim of this work is to bring out focus which publishers of scientific journals are facing when Big Data results are presented. Emphasized are key issues related to source data availability, algorithmic and software reliabilities, model validations, and data adjustments methodologies aiding cognitive inference.


The main concern in Big Data analysis is the extraction of key features and adjustments of co-founders needed for unsupervised pattern recognition and/or supervised learning of functional relations and cognitive inference. Described are the main methodologies and critical review of algorithms for data regularization such as principal component analysis and regression (PCA, PCR, PLS), elastic nets (LASSO), decision trees (DT), and artificial neural networks (ANN) (1-4). Since most of Big Data research are based on large scale observational studies which are unbalanced and lead to significant biases presented are positive and negative aspects of propensity score data adjustments to mimic properties of randomized trials. Application of data bootstrapping is emphasized as the key methodology needed for model validation and cognitive inference from Big Data studies (5-6).

Results and Discussion

Science progresses by corroboration through publications in scientific journals, books and electronic data exchange. Todays world of Big Data is result of integration of high throughput data, large databases and extensive networking facing upcoming 5G technologies. Validation of data and inference of cognition becomes the critical aspects for publishers of science journals. Although reproducibility and availability of data have been golden standards for valuable research, nowadays volume of data and complexity of algorithmic analysis and inference have put scientific journals at new and difficult issues. It is believed that only about 40 % of recently published science results can be reproduced (7). Reproducing scientific experimental data and algorithmic, usually which are not necessarily statistically sound, have become problematic for several reasons. Some of the main problems specific to Big Data studies are related to difficulties and lack of collaborative reproduction (parallel independent laboratory work) of experimental data, incomplete or faulty experimental design (DOE), unvalidated software tools, unawareness of data confounding, and blind use of machine learning algorithms. Science journal editors should insist on detailed exposition of DOE model, availability of source data and applied software. Applied algorithmic inference should be detailed on basics from probability theory and statistics, and proceeded with data science software tools. The key step in analysis of high dimensional big data sets is identification of key features, i.e. regularization of columns of the data matrix. Commonly are applied methods based on eigen value decomposition leading to principal components (PCA), principal component regression (PCR), and partial least squares (PLS). The methods are very effective but do not account interactions between key features, i.e. synergism and antagonism between variables is not explicated. In order to avoid model bias and misconception needed are systemic tests of presumed nonlinearities. Regularization with search of feature interactions is accounted by use of decision trees (forest) and artificial neural networks (convolution layer). Critical point in big data pattern classification, for example when applied for medical diagnosis, is model evaluation by data bootstrapping. Accuracy of the models should be evaluated by ROS graphs and numerically by AUC values. The most difficult issues are related to data adjustments for causality deduction. Fully randomized trials, including Mendellian data, which are mainly reported for drug trials and pharmacokinetic studies, fulfil requirements for elimination of feature confounding effects. However, most big data analysis are founded on observational studies and profoundly exposed to confounding and are not reliable for cognitive inferences. Science editors should recommend to researcher, before acceptance of results for publication, to test their causality inferences by propensity score data matching. Since Big Data models are basically deeply complex thorough multiple cross validation by bootstrapping, possible from independent sources, should be required before acceptance for publication.


In view of critical confidence into reproducibility and reliability of research publications based on Big Data projects a checklist for validation of analysis and inference of cognitive relations can be proposed. The main concern is focused on the initial assumptions and assumed of modelling thesis incorporated into research DOE plan. Collaborative multi laboratory parallel data acquisition is essential, especially for fundamental projects in molecular medicine and pharmacology. Validation of algorithmic methodologies applied for analysis should be based on feature space regularization followed by extensive bootstrap validation. Since most of Big Data research projects are observational studied feature confounding is present and data must be adjusted by matching by one of propensity score methods (often regularized logistic model).

In conclusion, ideally results from multiple (more than two independent) institutions (laboratories) which have different and unrelated key sources of potential bias and confounding should be compared. Open data policy of source data and software for algorithmic inferences should be available.

  • Želimir KurtanjekŽelimir KurtanjekUniversity of Zagreb

    Želimir Kurtanjek is a retired professor of chemical engineering with an interest in biotechnology, biostatistics and big data analytics… More →