Spatiotemporal Stream Mining Using TRACDS, Middle East

March 20, 2018 | Author: Anonymous | Category: N/A
Share Embed


Short Description

Download Spatiotemporal Stream Mining Using TRACDS, Middle East...

Description

Spatiotemporal Stream Mining using TRACDS Middle East Technical University October 31, 2012 Margaret H Dunham, Michael Hahsler, Yu Su, Sudheer Chelluboina, and Hadil Shaiba Computer Science and Engineering This work is supported by NSFIIS-0948893

10/31/2012, METU

IDA@SMU Intelligent Data Analysis Lab Team led by Margaret H. Dunham Michael Hahsler Mission At IDA@SMU we create novel techniques inspired by knowledge discovery, data mining, machine learning, artificial intelligence and statistical analysis to work with data from various sources. Current Focus TM  Massive data stream modeling: TRACDS  Hurricane intensity prediction  Effective metagenomic classification for the Human Genome Project

 Recommender systems: R/Apache Mahout 10/31/2012, METU

http://www.lyle.smu.edu/IDA

Outline

• Spatiotemporal Stream Data • TRACDS • Hurricane Intensity Prediction • PIIH • PIIH online

10/31/2012, METU

From Sensors to Streams • Data captured and sent by a set of sensors is usually referred to as “stream data”. • Real-time sequence of encoded signals which contain desired information. It is continuous, ordered (implicitly by arrival time or explicitly by timestamp or by geographic coordinates) sequence of items • May be viewed as arriving in discrete time intervals. • Stream data is infinite - the data keeps coming.

• Examples: Weather data, network data (VoIP), traffic data. 10/31/2012, METU

Stream Data Format • Events arriving in a stream • At any time, t, we can view the state of the problem as represented by a vector of n numeric values: Vt =

V1 S1 S2 … Sn

S11 S21 … Sn1 Time

10/31/2012, METU

V2 S12 S22 … Sn2

… … … … …

Vq S1q S2q … Snq

Modeling Stream Data – Summarization (Synopsis) of data – Temporal and Spatial – Dynamic

– Continuous (infinite stream) – Concept Drift • Learn

• Forget

– Sublinear growth rate - Clustering 10/31/2012, METU

MM A first order Markov Chain is a finite or countably infinite sequence of events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei), and at any time the future behavior of the process is based solely on the current state

A Markov Model (MM) is a graph with m vertices or states, S, and directed arcs, A, such that: • S ={N1,N2, …, Nm}, and • A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc, Lij = is labeled with a transition probability Pij = P(Nj | Ni). 10/31/2012, METU

10/31/2012, METU

Problem with Markov Chains •

The required structure of the MC may not be certain at the model construction time.



As the real world being modeled by the MC changes, so should the structure of the MC.



Not scalable – grows linearly as number of events.



Our solution: – Extensible Markov Model (EMM) – Cluster real world events – Allow Markov chain to grow and shrink dynamically

10/31/2012, METU

10/31/2012, METU

EMM (Extensible Markov Model) • Time Varying Discrete First Order Markov Model • Continuously evolves • Nodes are clusters of real world states. • Learning continues during application phase. • Learning: – Transition probabilities between nodes – Node labels (centroid of cluster) – Nodes are added and removed as data arrives

• Applications: – Anomaly/Rare Event Detection – Prediction – Classification 10/31/2012, METU

10/31/2012, METU

EMM Definition Extensible Markov Model (EMM): at any time t, EMM consists of an MC with designated current node and algorithms to modify it, where algorithms include: • EMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t. • EMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1. • EMMDecrement algorithm, which removes nodes from the EMM when needed.

10/31/2012, METU

10/31/2012, METU

EMM Cluster • Nearest Neighbor (or any clustering technique)

• If none “close” create new node • Labeling of cluster is centroid of members in cluster or Clustering Feature

• O(n) Here n is the number of states

10/31/2012, METU

10/31/2012, METU

EMM Sublinear Growth

Servent Data 10/31/2012, METU

10/31/2012, METU

Growth Rate Automobile Traffic

Minnesota Traffic Data 10/31/2012, METU

10/31/2012, METU

EMM Learning

10/31/2012, METU 10/31/2012, METU

2/3 2/3 2/21 2/3 1/1 1/2

1/2 N3

N1

1/3 N2

1/1

1/2 1/1

EMM Forgetting

N1

N3 1/3

1/3 2/2

1/3

N3

1/3

1/3 N2

1/6 1/6

1/3

1/2 N5

10/31/2012, METU 10/31/2012,

N1

N6

METU

N5

1/6

N6

Outline • Spatiotemporal Stream Data

• TRACDS • Hurricane Intensity Prediction • PIIH • PIIH online

10/31/2012, METU

Traditional Stream Clustering

Standard Data Stream Clustering ignores temporal aspect of data

10/31/2012, METU

Stream Clustering • Clusters change over time – they move • Some techniques use micro clusters/reclustering • Reclustering is often off line (batch while stream data comes).

• STREAM – Partitions stream data into segments – Clusters each segment (k-medians)

– Iteratively reclusters the centers of these clusters S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. “Clustering data streams: Theory and practice.” IEEE Transactions on Knowledge and Data Engineering, 15(3):515-528, 2003.

10/31/2012, METU

Temporal Relationship Among Clusters in Data Streams

10/31/2012, METU

TRACDS NOTE • TRACDS is not:

– Another stream clustering algorithm • TRACDS is: – A new way of looking at clustering

– Built on top of an existing clustering algorithm • TRACDS may be used with any stream clustering algorithm

10/31/2012, METU

10/31/2012, METU

TRAC-DS Overview

10/31/2012, METU

10/31/2012, METU

TRACDS Clustering Operations

10/31/2012, METU

TRACDS Example

C

EMM

10/31/2012, METU

http://www.lyle.smu.edu/IDA/TRACDS

Outline • Spatiotemporal Stream Data • TRACDS

• Hurricane Intensity Prediction • PIIH • PIIH online

10/31/2012, METU

10/31/2012, METU

10/31/2012, METU

Lower 9th Ward of New Orleans, Louisiana, Feb 27, 2006 Photographer: Mackenzie Schott

Hurricanes Hurricanes are tropical cyclones with sustained winds of at least 64 kt (119 km/h, 74 mph) . The major issues in forecasting hurricanes are predicting their tracks of movement and their intensities. Compared with prediction of track movement, intensity prediction is still relatively inaccurate.

Time step [0h, 12h, 24h, …, 120h] 10/31/2012, METU

10/31/2012, METU

Hurricane Intensity Prediction 

Hurricane Intensity: 



Maximum sustained surface wind. Highest average wind speed within 1 minute and10m above surface.

“Maximum Sustained Wind”. Wikipedia. Wikimedia foundation, 27 August 2011. Web. 4 December 2011. Retrieved from http://en.wikipedia.org/wiki/Maximum_sustained_wind. 

Rapid Intensification 

24-h increase in maximum wind speed >= 30knots.

“Rapid Intensification,” accessed on 10/24/12,

http://www.hurrnet.com/tutorial/forecasts/intensity/rapid.htm .

10/31/2012, METU

Predicting Intensity • Statistical models predict intensity based on measured stream data. • Current state of storm • History of this storm • How similar storms behaved in past • Regression models are the most popular. • NOAA (branch of U.S. Government) – collects stream data.

– Yearly updates it models based on data from previous year – Makes predictions in a quasi-real time manner.

10/31/2012, METU

Hurricane Intensity Prediction “Objective: Improve forecast skill to accuracy and confidence levels required for decision‐making and risk management” NOAA’s National Weather Service Strategic Plan 2010-2020  Very difficult to predict Intensity (rapid intensification)  National Hurricane Center (NHC) uses – Dynamical models: computational intensive and slow

Path of Hurricane Katrina (2005) Color shows intensity

– Statistical models: Statistical Hurricane  Intensity Prediction Scheme (SHIPS) 



Current Storm – SANDY

http://www.nhc.noaa.gov/archive/2012/SANDY_gr aphics.shtml

10/31/2012, METU

Category 5 - 175 mph Damage: estimated $125 billion  Fatalities: >1,800 

“Hurricane Katrina – Most Destructive Hurricane Ever to Strike the U.S.”, August 28, 2005, February 12, 2007, http://www.katrina.noaa.gov/ .

Remote Sensing 

Storm features are gathered from the earth's observations using remote sensing.



Real time data are gathered every few hours and stored

in large databases. 



Historical data of more than 20 years of the earth's behavior is stored in the database. Methods:  Satellite  Buoy  Ship  Aircraft

10/31/2012, METU

Outline • Spatiotemporal Stream Data • TRACDS • Hurricane Intensity Prediction

• PIIH • PIIH online

10/31/2012, METU

Hurricane Data

0h, 12h, 24h, …

… hurricane 274

… 0h, 12h, 24h, …

0h, 12h, 24h, …

hurricane 2

Hurricane Data

hurricane 1

The data contains 16 predictors. The dataset is formed by time ordered 12 hour interval records and contains the hurricane data from seasons 1982 to 2003. 1982 16 predictors

10/31/2012,2003 METU

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 25,0,1,-5.83,668,0,140,14.9,-53.5,13.25,40.5,23,6.6,27,372.5,19600 25,0,1,-5.83,708,0,140,12.7,-53.45,13.65,37.5,17.5,5.69,4,317.5,19600 30,5,1,-3.58,682,150,135,12.75,-53.35,13.25,34,1.5,5.79,15,382.5,18225 35,5,1,-4.9,674,175,130,14.2,-53.35,13.4,33,-12,6.66,-13,497,16900 50,15,1,0.44,681,750,113.52,17.1,-53.15,13.2,35,-20,8.32,-7,855,12885.79 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 30,0,0.99,-7.02,656,0,124.55,19.05,-52.55,14.75,51,0.5,6.68,45,571.5,15512.49 30,0,0.98,-7.02,675,0,123.75,17.3,-52.6,14.15,54,5,6.63,22,519,15314.28 35,5,0.98,-4.16,722,175,119.55,17.9,-52.6,14.65,58,10,7.43,34,626.5,14292 65,30,0.97,4.09,635,1950,88.77,19.15,-52.1,14.7,54.5,27.5,8.63,33,1244.75,7879.26 75,10,0.97,6.25,724,750,70.08,17.8,-52.15,12.55,54,48.5,8.61,45,1335,4910.92 95,20,0.96,9.17,641,1900,37.59,14.85,-52.9,11.1,56.5,55,7.87,15,1410.75,1413.13 95,0,0.96,7.2,691,0,33.33,15.6,-53.45,9.25,51.5,44.5,8.97,32,1482,1110.98 95,0,0.95,0.82,713,0,35.62,17.9,-53.25,7.85,47,38,10.72,31,1700.5,1268.43 95,0,0.95,2.4,813,0,28.12,20.85,-52.65,7.25,45,45,12.84,63,1980.75,790.65 115,20,0.93,10.65,635,2300,-11.1,24.45,-52.7,4.55,41.5,57.5,15.81,24,2811.75,123.2 110,-5,0.93,14.51,622,-550,-26.24,30.7,-53.55,1.15,40.5,50.5,21.2,28,3377,688.71 90,-20,0.91,18.15,613,-1800,-17.97,37.05,-53.95,0,46,29.5,27.08,42,3334.5,322.99 70,-20,0.91,21.86,668,-1400,1.01,40.3,-53.7,0,52.5,20,30.72,41,2821,1.02 70,0,0.89,26.22,688,0,2.35,45.05,-52.7,0.25,50.5,37.5,35.18,31,3153.5,5.5 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ……

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Intensity

Construct EMM

10/31/2012, METU

Use EMM for Prediction

10/31/2012, METU

EMM, TRACDS and Hurricane Data • Approach: Using TRACDS algorithms, construct multiple EMMs. One will be built for each time point into the future for which predictions are to be made: 12 hours, …, 120 hours. • NOAA provides 16 different features or predictors (attribute values). • Clustering is performed based on a distance calculation from input feature vector to centroid of clusters in EMMs. • However the importance of these to intensity prediction is not uniform.

• How can we determine weight for each feature? Used during clustering.

10/31/2012, METU

Weighted Feature Learning -Extensible Markov Model (WFL-EMM) WFL-EMM assumes that the different predictors contribute differently during the prediction.

1 Weights for predictors 0

f1

f2

f3

f4

f5

f6

f7

V1 = V2 = …… In WFL-EMM, a weight vector u = to indicate the weights for different predictors, where ui ∈[0, 1] . ui =1 means the ith predictor is important and ui =0 implies that the ith predictor is ignored. 10/31/2012, METU

Weighted Feature Learning -Extensible Markov Model (WFL-EMM)

GA Learning Process

The question is how to locate a fitness weight vector u = for hurricane intensity predictions.

Genetic algorithm (GA) is introduced in WFL-EMM to find the best fitness weight vector, which gives the smallest error of the prediction.

10/31/2012, METU

Weighted Feature Learning -Extensible Markov Model (WFL-EMM) Given a weights vector u = . Two steps of data transformation 

Normalization: normalize all the predictor within the range of [0, 1] First standardize the predictor values by

where and sd(x) are the mean and standard deviation of the ith predictor. Then a non-linear normalization maps zi to interval [0, 1],

where

is damping coefficient.

 Transformation: Assume a normalized record d = . Then the record is transformed as d’ = < u1 d1,…, un dn>.

10/31/2012, METU

Weighted Feature Learning -Extensible Markov Model (WFL-EMM)

10/31/2012, METU

Weighted Feature Learning -Extensible Markov Model (WFL-EMM)

GA Learning Process

• The question is how to locate a fitness weight vector u = for hurricane intensity predictions.

• These weights are used during the clustering and .applied to the distance/similarity measure used for clustering

• Genetic algorithm (GA) is introduced in WFLEMM to find the best fitness weight vector, which gives the smallest error of the prediction

10/31/2012, METU

Weighted Feature Learning -Extensible Markov Model (WFL-EMM) GAs try to locate a fitness solution from the a solution space. Weight vector u = spans a vector space [0, 1]n since each ui is a real value ranged in [0, 1].

Solution space

Fitness solution 10/31/2012, METU

Weighted Feature Learning -Extensible Markov Model (WFL-EMM)

GA Learning Process

Genetic algorithm evolution Each time, two chromosomes are selected randomly from the ith population with a probability proportional to their fitness, where a chromosome is a Gray code string of a weight vector u. Chromosome 1 Chromosome 2

Population i

10/31/2012, METU

Weighted Feature Learning -Extensible Markov Model (WFL-EMM)

GA Learning Process

Genetic algorithm evolution

Chromosome 1 Chromosome 2 crossover

mutation

Randomly alter one or more bits in the offspring based on a given probability.

inversion Randomly select a break point in a chromosome and then exchange the position of the two pieces.

New chromosome

Calculate the fitness of the obtained chromosome and place it into the population i+1 10/31/2012, METU

Weighted Feature Learning -Extensible Markov Model (WFL-EMM)

GA Learning Process

Fitness of the chromosome A chromosome is first decoded into a weight vector u. Apply this obtained u to generate a GEMM by using the training data. Then the fitness is calculated by either mean absolute deviation (MAD) or root mean square error (RMSE) based on the testing data. The best fitness weight vector u is located during the evolution of a GA. Fitness

where 10/31/2012, METU

Results - Experiment 2: Evaluating WFL-EMM by using k-fold cross validation technique over the dataset from 1982 to 2003 (set MAD as fitness).

10/31/2012, METU

Results It is interesting to look at the weights of the features because these weights reveals information about what the main drivers of intensity change might be.

10/31/2012, METU

Learn feature weights using Genetic Algorithm. Weights for features over time.

10/31/2012, METU

PIIH – Prediction Intensity Interval Model for Hurricanes

TRACDS

TM

Historic hurricane data Features     

Current wind speed Various temperatures Time of the year Direction of movement GOES Satellite Data (IR)

Currently 23 features from the Statistical Hurricane Intensity Prediction Scheme (SHIPS)

10/31/2012, METU

Data stream clustering + temporal order model

Prediction using PIIH – Irene (2011)

Current features of hurricane

10/31/2012, METU

Prediction using PIIH – Irene (2011)

Current features of hurricane

10/31/2012, METU

Aggregate possible future scenarios into a prediction

PIIH Output for Irene (2011)

MAD

MSE

PIIH

14.28

310.79

SHIFOR 5*

12.64

229.49

LGEM

15.06

411.73

SHIPS

14.80

319.64

D-SHIPS

17.11

500.36

MAD … Mean average deviation MSE … Mean squared error * Baseline model

10/31/2012, METU

PIIH Advantages • Real Time

• Dynamic • Machine Learning • Confidence Bands • By analyzing the 2011 storms through Nate, we observed the following: – 96.33% of observations fell within the 95% confidence band – 92.8% of observations fell within the 90% confidence band – 74.27% of observations fell within the 68% confidence band

10/31/2012, METU

Outline • Spatiotemporal Stream Data • TRACDS • Hurricane Intensity Prediction • PIIH

• PIIH online

10/31/2012, METU

10/31/2012, METU

http://IDA.lyle.smu.edu/PIIH/

Future Work 1. Deploy model with NOAA    

Add decay model over land Evaluate additional features Predict rapid intensification Interface with NOAA’s systems

2. Improve the TRACDSTM model  Data stream clustering  Higher-order effects  Improve model selection and outlier handling 10/31/2012, METU

PIIH Bibliography

10/31/2012, METU

Thank you! http://www.lyle.smu.edu/IDA

10/31/2012, METU

View more...

Comments

Copyright © 2017 DOCUMEN Inc.