The two dominant paradigms for scientific discovery have historically been theory and experiments, with large-scale simulations emerging as the third paradigm in the 20th century. Over the past decade, a new paradigm for scientific discovery is emerging due to the availability of exponentially increasing volumes of data from large instruments such as telescopes, colliders, and light sources, as well as the proliferation of sensors and high-throughput analysis devices. These trends are popularly referred to as "Big Data", and they have led to the emergence of a "fourth paradigm" for scientific discovery that is driven by analysis of massive datasets to extract new knowledge and actionable insights. The complexity and challenge of the fourth paradigm arises from the increasing velocity, heterogeneity, and volume of data generation.
Analysis of large volumes of complex data to derive knowledge requires "data-driven" computing, where the computation is driven by data mining, queries, analysis, statistics, and hypothesis formulation and validation. At the same time, large scale simulations that are expected to run on future exascale ("Big Compute") systems are accompanied by the challenges of data-intensive computing. The requirements for Big Data and Big Compute are tightly intertwined since they both contribute to a shared goal of scientific discovery. For example, data-intensive simulations on Big Compute exascale systems will be used to generate volumes of Big Data that are comparable to the data volumes generated by many scientific instruments. Likewise, the volumes of Big Data generated by the data-driven paradigm will need to be analyzed by Big Compute exascale or extreme-scale systems. As we head to the exascale timeframe, it will be critical to exploit synergies between the two approaches, even though some fundamental differences may still remain between the two.
This talk will discuss some of these synergies in the context of challenges in data-intensive science and exascale computing. The material for this talk is drawn from a recent (March 2013) study led by the speaker on "Synergistic Challenges in Data-Intensive Science and Exascale Computing" for the US Department of Energy's Office of Science. Background material was also drawn from an earlier (September 2009) DARPA Exascale Software Study, also led by the speaker. We would like to acknowledge the contributions of all participants in both studies.
BIO: Vivek Sarkar conducts research in multiple aspects of parallel software including programming languages, program analysis, compiler optimizations and runtimes for parallel and high performance computer systems. He currently leads the Habanero Multicore Software Research project at Rice University, and serves as Associate Director of the NSF Expeditions project on the Center for Domain-Specific Computing. Prior to joining Rice in July 2007, Vivek was Senior Manager of Programming Technologies at IBM Research. His responsibilities at IBM included leading IBM's research efforts in programming model, tools, and productivity in the PERCS project during 2002- 2007 as part of the DARPA High Productivity Computing System program. His past projects include the X10 programming language, the Jikes Research Virtual Machine for the Java language, the MIT RAW multicore project, the ASTI optimizer used in IBM's XL Fortran product compilers, the PTRAN automatic parallelization system, and profile-directed partitioning and scheduling of Sisal programs. Vivek holds a B.Tech. degree from the Indian Institute of Technology, Kanpur, an M.S. degree from University of Wisconsin-Madison, and a Ph.D. from Stanford University. He became a member of the IBM Academy of Technology in 1995, the E.D. Butcher Chair in Engineering at Rice University in 2007, and was inducted as an ACM Fellow in 2008. Vivek has been serving as a member of the US Department of Energy's Advanced Scientific Computing Advisory Committee (ASCAC) since 2009.
REFERENCES:
http://science.energy.gov/~/media/ascr/ascac/pdf/reports/2013/ASCAC_Data_Intensive_ Computing_report_final.pdf
http://www.cs.rice.edu/~vsarkar