Sep 10, 2019 to Sep 11, 2019
(Europe/Berlin / UTC200)


Prague, CZ

Contact Name


Sandro Fiore, CMCC

Add event to calendar


Meaningful advances in science and engineering are increasingly predicated on data-driven decision making. For these decisions to be valid, it is essential that the one not only record the process by which results were produced, but be able to reproduce the data involved at every step in the process. While we are all used to tracking source code revisions, and keeping track of program inputs and outputs, the increased complexity of end-to-end computing pipelines, coupled with new big-data and machine learning algorithms, imposes significant complexity on tracking all of the steps and associated data that went into producing a result. For example, keeping track of exactly what data use used for a training set vs. an evaluation set, what cleaning was done, what analysis was done on the results to evaluate performance, and what additional experiments were performed. With the ever-increasing number, size, and complexity of the data used in data-intensive applications, reproducing results from these types of investigations becomes increasingly difficult. While no-one deliberately sets out to create un-reproducible results, recent surveys of the literature shows that the ability to reproduce data-intensive results are the exception and not the rule.

For these reasons, a symposium on issues, tools and infrastructure for data intensive applications is highly germane to the ParCo community.

In this symposium, we propose to review current state of the art in reproducibility in data-intensive computing applications.

We will cover three primary topic areas:

  • Reproducibility challenges that are specific to science and engineering activities that have data-intensive computing as a core aspect of the process
  • Infrastructure, tools and methods that are currently available for reproducible data-intensive applications, and gaps and challenges that need to be addressed.
  • How to increase the adoption of methods for reproducible data-intensive applications across the research community.