DataClustering Handout March 2017

The challenges of extreme scale computing systems exist across multiple dimensions including architecture, energy constraints, memory scaling, limited I/O, scalability of software and applications. These constraints and the need for faster scientific discovery have identified the demand for scalable and in-situ analysis. It is clear that larger the simulations using extreme-scale systems, greater the need for effective data analysis and derivation of insights, at a faster pace, and within the constraints of limited storage space, deeper and complex memory hierarchies, minimization of data movement due to energy and I/O constraints. The traditional model of store raw and/or derived data and analyze later will become cost prohibitive in the exascale computing realm. Furthermore, continuously involving human in the loop for analyzing data will become less effective due to the sheer size and complexity of data. For in-situ analysis, the design of existing analytics algorithms and software by simply extending the assumptions made based on the off-line model may not work, and therefore, rethinking and redesign of analysis algorithms, runtime and software is needed. In order to keep pace with the ever-increasing computational parallelism demands by large-scale simulations, the analysis algorithms must be customizable to the needs of simulation and data it produces for deriving insights.
The objective of this proposal is to address challenges in the design and development of scalable in-situ analytics algorithms and software based on “Scalable Thinking”. The proposed research and development includes scalable algorithms and software for spatio-temporal data clustering, anomaly detection, learning data distributions, for in-situ implementation and execution. All of these are very important for large-scale analysis and have wide applicability. Our design approach is driven by rethinking and reformulation within the constraints posed by in-situ analysis requirements at multiple levels. This is particularly important for data analytics in the extreme scale computing environment because most existing techniques developed, validated and optimized on small data sets may not be scalable nor may they be suitable for in-situ analytics on the emerging extreme scale computing systems. Another key component of our approach will be to incorporate a co-design approach to development of scalable algorithms and software by taking full advantage of new architecture rather than simply considering and scaling existing techniques.