redoules.github.io/blog/Time_Series_anomaly_detection.html
Guillaume 44f740504b added an article
about uploading data to a sharepoint site
2020-07-20 20:20:09 +02:00

276 lines
15 KiB
HTML

<!DOCTYPE html>
<html lang="fr">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="Data Science for Political and Social Phenomena">
<meta name="author" content="Guillaume Redoulès">
<link rel="icon" href="../favicon.ico">
<title>Time Series anomaly detection - Blog</title>
<!-- JQuery -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script>
window.jQuery || document.write('<script src="../theme/js/jquery.min.js"><\/script>')
</script>
<!-- Bootstrap core CSS -->
<link rel="stylesheet" href="../theme/css/bootstrap.css" />
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<link rel="stylesheet" type="text/css" href="../theme/css/ie10-viewport-bug-workaround.css" />
<!-- Custom styles for this template -->
<link rel="stylesheet" type="text/css" href="../theme/css/style.css" />
<link rel="stylesheet" type="text/css" href="../theme/css/notebooks.css" />
<link href='https://fonts.googleapis.com/css?family=PT+Serif:400,700|Roboto:400,500,700' rel='stylesheet' type='text/css'>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<meta name="tags" content="Basics" />
</head>
<body>
<div class="navbar navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="..">Guillaume Redoulès</a>
</div>
<div class="navbar-collapse collapse" id="searchbar">
<ul class="nav navbar-nav navbar-right">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">About<span class="caret"></span></a>
<ul class="dropdown-menu">
<li><a href="../pages/about.html">About Guillaume</a></li>
<li><a href="https://github.com/redoules">GitHub</a></li>
<li><a href="https://www.linkedin.com/in/guillaume-redoul%C3%A8s-33923860/">LinkedIn</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Data Science<span class="caret"></span></a>
<ul class="dropdown-menu">
<li><a href="..#Blog">Blog</a></li>
<li><a href="..#Python">Python</a></li>
<li><a href="..#Bash">Bash</a></li>
<li><a href="..#SQL">SQL</a></li>
<li><a href="..#Mathematics">Mathematics</a></li>
<li><a href="..#Machine_Learning">Machine Learning</a></li>
<li><a href="..#Projects">Projects</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Projects<span class="caret"></span></a>
<ul class="dropdown-menu">
<li><a href="https://github.com/redoules/redoules.github.io">Notes (Github)</a></li>
</ul>
</li>
<!--<li class="dropdown">
<a href="../feeds/blog.rss.xml">Blog RSS</a>
</li>-->
</ul>
<form class="navbar-form" action="../search.html" onsubmit="return validateForm(this.elements['q'].value);">
<div class="form-group" style="display:inline;">
<div class="input-group" style="display:table;">
<span class="input-group-addon" style="width:1%;"><span class="glyphicon glyphicon-search"></span></span>
<input class="form-control search-query" name="q" id="tipue_search_input" placeholder="e.g. scikit KNN, pandas merge" required autocomplete="off" type="text">
</div>
</div>
</form>
</div>
<!--/.nav-collapse -->
</div>
</div>
<!-- end of header section -->
<div class="container">
<!-- <div class="alert alert-warning" role="alert">
Did you find this page useful? Please do me a quick favor and <a href="#" class="alert-link">endorse me for data science on LinkedIn</a>.
</div> -->
<section id="content" class="body">
<header>
<h1>
Time Series anomaly detection
</h1>
<ol class="breadcrumb">
<li>
<time class="published" datetime="2019-12-24T10:42:00+01:00">
24 décembre 2019
</time>
</li>
<li>Blog</li>
<li>Basics</li>
</ol>
</header>
<div class='article_content'>
<h1>Time series anomaly detection</h1>
<p>"An anomaly is an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism." (Hawking 1980)
"Anomalies [...] may or not be harmful." (Esling 2012)</p>
<h2>Types of anomalies</h2>
<p>The anomalies in an industrial system are often influenced by external factors such as speed or product being manufactured. There external factors represent the context and should be added to the feature vector.</p>
<p>Furthermore, there might be a difference between what you detect and what are the people actually interested in on site. </p>
<p>On industrial systems, you would find different types of anomalies signatures. A bearing degrading or gear wear would result in a <em>progressive shift</em> from the normal state. Other pattern might be detected with such anomalies : the mean is going up or the amplitude of the phenomenon is increasing or a cyclic pattern appear more often. </p>
<p>When a component breaks or when something gets stuck the anomaly signature would result in a <em>persitent change</em>. This type of signature would also appear after a poorly performed maintenance. IN this case, a stepwise pattern appears in the time series data.</p>
<p>Other anomalies can appear in the data. For example, a measuring error or a short current spike caused by an induction peak can appear and is considered an anomaly because it is clearly out of trend. However, it is often the case that those anomalies are don't represent errors and are a normal part of the process. </p>
<p>In order to alert on the anomalies that represent an error or a degradation of the system and filter out the spike anomalies, some feature engineering has to be done. </p>
<h2>Feature extraction</h2>
<p>This represent the most important part of the analysis.
Either you use knowledge of the experts, intuition of literatures (especially for bearing and rotating machines).</p>
<p>Or you perform an automated feature extraction using packages such as :</p>
<ul>
<li><a href="https://github.com/benfulcher/hctsa">HTCSA (highly comparative time-series analysis)</a> is a library implementing more than 7000 features (use pyopy for Python on Linux and OSX). It allows to normalize and clster the data, produce low dimensional representation of the data, identify and discriminate features between different classes of time series, learn multivariate classification models, vizualise the data, etc.</li>
<li><a href="https://github.com/chlubba/catch22">Catch22</a> reduces the 7000 features coded in HTCSA to the 22 that produced the best results across 93 real world time-series datasets. </li>
<li><a href="https://github.com/blue-yonder/tsfresh">tsfresh</a> is a package that automatically calculates a large number of time series characteristics and contains methods to evaluate the explaining power and importance of such characteristics for regression or classification tasks</li>
</ul>
<p>A combinaison of both automatically extracted knowledge and human knowledge can be combined. For instance, you can filter the spikes with a rolling median and then use catch22 on the resulting data. Or you can in parallel use your knowledge about bearing degradation and some automatically extracted feature.</p>
<h2>Unsupervised Anomaly Detection algorithms</h2>
<p>When you are using unsupervised anomaly detection algorithm you postulate that the majority is normal and you try to find outliers. Those outliers are the anomalies. This approach is useful when you only have unlabeled data. </p>
<p>Algorithms used in this case are often :</p>
<ul>
<li>
<p>nearest neighbor / density based :</p>
<ul>
<li>Global : K-Nearest Neighbor (K-NN), DBSCAN</li>
<li>Local : Local Outlier Factor (LOF)</li>
</ul>
</li>
<li>
<p>Clustering based:</p>
<ul>
<li>Global : Cluster Based Local Outlier Factor (CBLOF/uCBLOF)</li>
<li>Local : Local Density Cluster-based Outlier Factor (LDCOF)</li>
</ul>
</li>
</ul>
<p>The tricky part is to set k, the number of clusters and the other hyperparameters.</p>
<p>Furthermore, this kind of alogrithms perform poorly against persitant changes because the normal and anormal states would be in two clusters but they would be identified as normal by the algorithm since they represent the majority of the data. </p>
<h2>Semi-supervised Anomaly Detection algorithms</h2>
<p>The first approach is to train the algorithm on healthy data and detect an anomaly when the distance between the measured point and the healthy cluster exceeds a value.
* Distance based measures to healthy states such as the measure of the Mahalanobis distance
<img alt="Mahalanobis distance" src="../images/time_series_anomaly_detection/distancefeatured-1.png"></p>
<p>You can also model the surface of the healthy state and detect an anomaly when the measure crosses the surface : </p>
<ul>
<li>
<p>Rich Representation of Healthy State:</p>
<ul>
<li>One-class Support Vector Machines (SVM)</li>
<li>One-class Neuronal Networks</li>
</ul>
</li>
</ul>
<p>Finally you can perform a dimension reduction of the space by finding new basis function of the state, and keeping only the n most important feature vector. An anomaly is detected when the reconstruction error grows because it is not part of what is considered normal.</p>
<ul>
<li>
<p>Reconstruction Error with Basis Functions :</p>
<ul>
<li>Principal Component Analysis (PCA)</li>
<li>Neuronal Network (Autoencoders)</li>
</ul>
</li>
</ul>
<p>Very important : Do not use dimensionality reduction (like PCA) before the anomaly detection because you would throw away all the anomalies. </p>
<p><img alt="PCA" src="../images/time_series_anomaly_detection/GaussianScatterPCA.svg"></p>
<p>This kind of semi supervised approach is strongly dependent on the data. Hence if you don't have a healthy state in the training set then the output of the algorithm won't be useful.</p>
<h2>Supervised anomaly detection algorithm</h2>
<p>Here, you apply classical classification methods for machine learning. However, be careful when training your classifiers because you have very imbalanced classes.</p>
<h2>Conclusions</h2>
<p>Anomalies may or may not be harmful! Hence you have to focus on the one that can damage your system.
Anomaly interpretation depend a lot on the context (spike, progressive change, persitent change)
Questions for feature extraction (collective, contextual or point like):</p>
<ul>
<li>which external influence ?</li>
<li>which kind of events should be detected ? </li>
</ul>
<p>Questions for choice of algorithm :</p>
<ul>
<li>Does data have labelled events ? -&gt; Supervised learning</li>
<li>Is healthy state marked ? -&gt; Semi Supervised</li>
<li>If no knowledge at all -&gt; Unsupervised</li>
</ul>
<p>Questions for model deployment</p>
<ul>
<li>When is information needed (real-time vs historic)?</li>
</ul>
</div>
<aside>
<div class="bug-reporting__panel">
<h3>Find an error or bug? Have a suggestion?</h3>
<p>Everything on this site is avaliable on GitHub. Head on over and <a href='https://github.com/redoules/redoules.github.io/issues/new'>submit an issue.</a> You can also message me directly by <a href='mailto:guillaume.redoules@gadz.org'>email</a>.</p>
</div>
</aside>
</section>
</div>
<!-- start of footer section -->
<footer class="footer">
<div class="container">
<p class="text-muted">
<center>This project contains 119 pages and is available on <a href="https://github.com/redoules/redoules.github.io">GitHub</a>.
<br/>
Copyright &copy; Guillaume Redoulès,
<time datetime="2018">2018</time>.
</center>
</p>
</div>
</footer>
<!-- This jQuery line finds any span that contains code highlighting classes and then selects the parent <pre> tag and adds a border. This is done as a workaround to visually distinguish the code inputs and outputs -->
<script>
$( ".hll, .n, .c, .err, .k, .o, .cm, .cp, .c1, .cs, .gd, .ge, .gr, .gh, .gi, .go, .gp, .gs, .gu, .gt, .kc, .kd, .kn, .kp, .kr, .kt, .m, .s, .na, .nb, .nc, .no, .nd, .ni, .ne, .nf, .nl, .nn, .nt, .nv, .ow, .w, .mf, .mh, .mi, .mo, .sb, .sc, .sd, .s2, .se, .sh, .si, .sx, .sr, .s1, .ss, .bp, .vc, .vg, .vi, .il" ).parent( "pre" ).css( "border", "1px solid #DEDEDE" );
</script>
<!-- Load Google Analytics -->
<script>
/*
(function(i, s, o, g, r, a, m) {
i['GoogleAnalyticsObject'] = r;
i[r] = i[r] || function() {
(i[r].q = i[r].q || []).push(arguments)
}, i[r].l = 1 * new Date();
a = s.createElement(o),
m = s.getElementsByTagName(o)[0];
a.async = 1;
a.src = g;
m.parentNode.insertBefore(a, m)
})(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');
ga('create', 'UA-66582-32', 'auto');
ga('send', 'pageview');
*/
</script>
<!-- End of Google Analytics -->
<!-- Bootstrap core JavaScript
================================================== -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="../theme/js/bootstrap.min.js"></script>
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<script src="../theme/js/ie10-viewport-bug-workaround.js"></script>
</body>
</html>