楊恭娟
Kung-Jiuan Yang
博士論文(2013)
高效性部份週期樣式探勘演算法之研究
A Study on Efficient Algorithms for Partial Periodic Pattern Mining
關鍵字 Keywords
資料探勘、部分週期樣式探勘、事件序列、投影技術、最大子樣式法、上限邊界
Data mining、partial periodic pattern mining、event sequence、projection technique、upper-bound
摘要
近年來,資料探勘技術已被廣泛應用在各種資料的分析應用上,其中週期樣式探勘之相關應用,亦已成為受歡迎的研究議題之一。事實上,在實際生活的例子中,並不容易找到發生時間位置是固定且”完整”的週期樣式,因而衍生出”部份”週期樣式探勘之研究。然而,相較於完整週期樣式探勘,因部分週期樣式探勘允許在一些時間位置中忽略其事件的發生與否,故將會產生更大量的候選樣式集,進而造成須耗費更大量的時間成本來處理這些候選樣式集。因此如何設計出一高效性之演算法之議題便顯得相當重要。除此之外,傳統的部份週期樣式探勘僅以頻率出現的準則來挖掘出有價值之樣式,此方式也許會讓一些具有重要卻出現頻率較低的事件無法通過所設定門檻值因而被忽略。因此,本論文除了發展高效性之部分週期樣式探勘演算法之外,亦進一步提出以事件具有不同權重值與不同門檻值標準之相關議題,以求能從事件序列中挖掘出不同於僅以頻率出現為主要考量之部份週期樣式資訊。
關於部份週期樣式探勘的議題,本論文先提出一個以投影技術為基礎之PPA (Projection-based Pattern Mining Approach)演算法,能有效地應用於事件序列之部分週期樣式挖掘議題裡。為了有效提升執行效能,本論文提出另一個具有刪除及過濾機制之PRA (Pruning Redundancy Approach) 演算法,以有效降低不必要候選樣式之數量。在實驗評估裡,實驗結果顯示在長週期及低門檻值情況下所提的方法相較於傳統MSA(Max-Subpattern Mining Algorithm)演算法能有70%以上的效能改善率。
另一方面,由於在傳統部份週期樣式探勘議題裡,假設所有事件之重要性皆為相同,且僅以一最小支持度門檻值作為有效樣式之判定標準。因此本論文分別提出事件具有不同權重值(Weighted Partial Periodic Pattern Mining)或多重門檻值(Partial Periodic Pattern Mining with Multiple Minimum Constraints)考量之研究議題。其中,由於WPPM不具向下封閉性,因此提出一個具二階段式之PWA (Projection-based Weighted Mining Approach)演算法,其第一階段是以每個週期的最大權重值作為週期上限邊界值之高估模型,進而找出所有可能是高權重頻繁部份週期樣式之集合;接著在第二階段裡,執行一次資料掃描以獲得各樣式之實際平均權重值,以找出所需之樣式。另一個具有多重最小門檻限制之議題則是提出一個PAMMS (Projection-based Mining Approach with Multiple Minimum Supports)的演算法,PAMMS會以事件集合中的最小門檻值作為第一階段找出所有可能為高頻繁部份週期樣式之集合,在第二階段則是從這集合裡,獲得每個滿足自身實際支持度門檻之部分週期樣式集合。最後的實驗評估結果顯示在數個資料庫測試上,PWA及PAMMS可兼顧原有的探勘效率並能有效地找出重要的部分週期樣式。
Abstract
Data mining techniques have been widely used in a variety of data-analysis applications in recent years. To find useful rules or patterns in a single long-term time series data, the periodic pattern mining has become a very popular research topic. In real-life examples, "partial” periodic pattern mining is more flexible than “full” periodic patterns. The main reason is that the events of some time positions in a pattern can be uncertain. However, since partial periodic pattern mining can ignore the events of some time positions in a period, it has to generate a large number of candidate patterns in mining. Then, how to develop efficient partial periodic pattern mining algorithms for saving time cost is a critical issue. Besides, since most of studies related to partial periodic pattern mining only consider the supports of items in period segments, many useful patterns with low-frequency but high-significance in event sequence data may not be found. Hence, in this dissertation, we propose not only two efficient projection-based mining algorithms but also the two new issues, respectively named weighted partial periodic pattern mining (WPPP) and partial periodic pattern mining with multiple minimum constraints (PPPMM).
As to the traditional partial periodic pattern mining, the two algorithms, PPA (Projection-based Pattern Mining Approach) and PRA (Pruning Redundancy Approach), were proposed to enhance the execution efficiency in finding partial periodic patterns from a single event sequence with single or multiple events in a time point. Different from the PPA algorithm without any strategies, the PRA algorithm adopts two effective strategies, pruning and filtering, to reduce a large number of candidates in mining. The experimental results on several synthetic and real datasets showed the proposed approaches get up to 70% performance improvement when compared to the traditional MSA (Max-Subpattern Hit Set) algorithm.
For the issue of WPPP, since the downward-closure property cannot be kept in this problem, an effective upper-bound model, which the maximum weight of all events in a period segment as the upper-bound of any sub-pattern in that segment, is developed to achieve this goal. Based on the model, a two-phase mining approach PWA (Projection-based Weighted Mining Approach) is also presented to complete the WPPP mining tasks. For another issue PPPMM, an efficient two-phase mining approach PAMMS (Projection-based Mining Approach with Multiple Minimum Supports) is proposed to handle this problem. Especially, since the downward-closure property is not kept in the problem of PPPMM, the minimum constraint value of all events in mining is used to avoid information losing. Finally, the experimental results show that the performance of both PWA and PAMMS in terms of pruning effectiveness and execution efficiency on synthetic and real datasets.