#### Transcript downloading

DDM Kirk LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011 The LSST Data Challenges The LSST Data Mining Challenges 1. Massive data stream: ~2 Terabytes of image data per hour that must be mined in real time (for 10 years). 2. Massive 20-Petabyte database: more than 50 billion objects need to be classified, and most will be monitored for important variations in real time. 3. Massive event stream: knowledge extraction in real time for 100,000 events each night. • Challenge #1 includes both the static data mining aspects of #2 and the dynamic data mining aspects of #3. • Look at #2 and #3 in more detail ... LSST data mining challenge # 2 • Accurately characterize and classify 50 billion objects and 20 trillion source observations • Requires VO-accessible multi-wavelength data • Szalay’s Law: Astrophysical discovery potential grows as (number of data sources)2 Benefits of very large datasets: • best statistical analysis of “typical” events • automated search for “rare” events LSST data mining challenge # 3 • Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: LSST data mining challenge # 3 flux • Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: time LSST data mining challenge # 3 flux • Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help ! time LSST data mining challenge # 3 flux • Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help ! Characterize first ! then Classify. time Characterization Use Case #1 • Feature detection and extraction: – Automated pipelines’ tasks: Characterize! • Identify and describe features in the data • Extract feature descriptors from the data • Curating these features for scientific re-use – Human experts’ tasks: Categorize and Classify! • Associate features with astrophysical processes • Find boundaries between feature sets and label them – Example: Star-Galaxy Separation Characterization Use Case #2 • The clustering problem: – Finding clusters of objects within a data set – Pipeline: apply an optimal algorithm for finding friends-of-friends or nearest neighbors • N is >1010, so what is the most efficient way to sort? • Number of dimensions ~ 1000 – therefore, we have an enormous subspace search problem – Scientist: determine the significance of the clusters (statistically and scientifically) – categorize! Characterization Use Case #3 • Outlier detection: (unknown unknowns) – Finding the objects and events that are outside the bounds of our expectations (outside known clusters) – These may be real scientific discoveries or garbage – Outlier detection is therefore useful for: • Novelty Discovery – is my Nobel prize waiting? • Anomaly Detection – is the detector system working? • Data Quality Assurance – is the data pipeline working? – How does one optimally find outliers in 103-D parameter space? or in interesting subspaces (in lower dimensions)? – How do we measure their “interestingness”? Characterization Use Case #4 • The dimension reduction problem: – Finding correlations and “fundamental planes” of parameters – Number of attributes can be hundreds or thousands • The Curse of High Dimensionality ! – Are there combinations (linear or non-linear functions) of observational parameters that correlate strongly with one another? – Are there eigenvectors or condensed representations (e.g., basis sets) that represent the full set of properties? The LSST Data Mining Challenges: What’s the common theme? • Need multi-wavelength data in all use cases! • VO-accessible ancillary information is essential. The LSST Data Mining Challenges: What’s the common theme? • Need multi-wavelength data in all use cases! • VO-accessible ancillary information is essential. Requirements for success: • Discovery of distributed data sources • Access to distributed data sources • Applying characterization and clustering (data mining) algorithms on distributed data: • Unsupervised and Supervised Machine Learning Data Bottleneck • Mismatch: • Data volumes increase 1000x in 10 yrs • I/O bandwidth improves ~3x in 10 years • Therefore . . . Distributed Data Mining Distributed Data Mining (DDM) • DDM comes in 2 types: 1. Mining of Distributed Data (MDD) 2. Distributed Mining of Data (DMD) • Type 1 takes many forms, with data being centralized (in whole or in partitions) • Type 2 requires sophisticated algorithms that operate with data in situ … • Ship the Code to the Data • The computations are done on the data locally, with partial results shipped around to the different data nodes, and the DDM algorithm iterates until a solution is converged upon. • This can be pipeline-initiated or scientist end-user-initiated. • References: http://www.cs.umbc.edu/~hillol/DDMBIB/ • Ultimate goal: Knowledge Discovery through Data Discovery