redoules.github.io/blog/Statistics_10days-day1.html
Guillaume 44f740504b added an article
about uploading data to a sharepoint site
2020-07-20 20:20:09 +02:00

268 lines
18 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="fr">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="Data Science for Political and Social Phenomena">
<meta name="author" content="Guillaume Redoulès">
<link rel="icon" href="../favicon.ico">
<title>Day 1 - Quartiles, Interquartile Range and standard deviation - Blog</title>
<!-- JQuery -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script>
window.jQuery || document.write('<script src="../theme/js/jquery.min.js"><\/script>')
</script>
<!-- Bootstrap core CSS -->
<link rel="stylesheet" href="../theme/css/bootstrap.css" />
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<link rel="stylesheet" type="text/css" href="../theme/css/ie10-viewport-bug-workaround.css" />
<!-- Custom styles for this template -->
<link rel="stylesheet" type="text/css" href="../theme/css/style.css" />
<link rel="stylesheet" type="text/css" href="../theme/css/notebooks.css" />
<link href='https://fonts.googleapis.com/css?family=PT+Serif:400,700|Roboto:400,500,700' rel='stylesheet' type='text/css'>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<meta name="tags" content="Basics" />
</head>
<body>
<div class="navbar navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="..">Guillaume Redoulès</a>
</div>
<div class="navbar-collapse collapse" id="searchbar">
<ul class="nav navbar-nav navbar-right">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">About<span class="caret"></span></a>
<ul class="dropdown-menu">
<li><a href="../pages/about.html">About Guillaume</a></li>
<li><a href="https://github.com/redoules">GitHub</a></li>
<li><a href="https://www.linkedin.com/in/guillaume-redoul%C3%A8s-33923860/">LinkedIn</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Data Science<span class="caret"></span></a>
<ul class="dropdown-menu">
<li><a href="..#Blog">Blog</a></li>
<li><a href="..#Python">Python</a></li>
<li><a href="..#Bash">Bash</a></li>
<li><a href="..#SQL">SQL</a></li>
<li><a href="..#Mathematics">Mathematics</a></li>
<li><a href="..#Machine_Learning">Machine Learning</a></li>
<li><a href="..#Projects">Projects</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Projects<span class="caret"></span></a>
<ul class="dropdown-menu">
<li><a href="https://github.com/redoules/redoules.github.io">Notes (Github)</a></li>
</ul>
</li>
<!--<li class="dropdown">
<a href="../feeds/blog.rss.xml">Blog RSS</a>
</li>-->
</ul>
<form class="navbar-form" action="../search.html" onsubmit="return validateForm(this.elements['q'].value);">
<div class="form-group" style="display:inline;">
<div class="input-group" style="display:table;">
<span class="input-group-addon" style="width:1%;"><span class="glyphicon glyphicon-search"></span></span>
<input class="form-control search-query" name="q" id="tipue_search_input" placeholder="e.g. scikit KNN, pandas merge" required autocomplete="off" type="text">
</div>
</div>
</form>
</div>
<!--/.nav-collapse -->
</div>
</div>
<!-- end of header section -->
<div class="container">
<!-- <div class="alert alert-warning" role="alert">
Did you find this page useful? Please do me a quick favor and <a href="#" class="alert-link">endorse me for data science on LinkedIn</a>.
</div> -->
<section id="content" class="body">
<header>
<h1>
Day 1 - Quartiles, Interquartile Range and standard deviation
</h1>
<ol class="breadcrumb">
<li>
<time class="published" datetime="2018-11-08T22:22:00+01:00">
08 novembre 2018
</time>
</li>
<li>Blog</li>
<li>Basics</li>
</ol>
</header>
<div class='article_content'>
<h2>Quartile</h2>
<h3>Definition</h3>
<p>A quartile is a type of quantile. The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highest value of the data set. </p>
<h3>Implementation in python without using the scientific libraries</h3>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">median</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="n">l</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="p">)</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="n">l</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="p">)</span> <span class="o">//</span> <span class="mi">2</span><span class="p">]</span> <span class="o">+</span> <span class="n">l</span><span class="p">[(</span><span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="p">)</span><span class="o">//</span><span class="mi">2</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)])</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">l</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="p">)</span><span class="o">//</span><span class="mi">2</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">quartiles</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="c1"># check the input is not empty</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">l</span><span class="p">:</span>
<span class="k">raise</span> <span class="n">StatsError</span><span class="p">(</span><span class="s1">&#39;no data points passed&#39;</span><span class="p">)</span>
<span class="c1"># 1. order the data set</span>
<span class="n">l</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="c1"># 2. divide the data set in two halves</span>
<span class="n">mid</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">Q2</span> <span class="o">=</span> <span class="n">median</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="k">if</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="p">)</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">):</span>
<span class="c1"># even</span>
<span class="n">Q1</span> <span class="o">=</span> <span class="n">median</span><span class="p">(</span><span class="n">l</span><span class="p">[:</span><span class="n">mid</span><span class="p">])</span>
<span class="n">Q3</span> <span class="o">=</span> <span class="n">median</span><span class="p">(</span><span class="n">l</span><span class="p">[</span><span class="n">mid</span><span class="p">:])</span>
<span class="k">else</span><span class="p">:</span>
<span class="c1"># odd</span>
<span class="n">Q1</span> <span class="o">=</span> <span class="n">median</span><span class="p">(</span><span class="n">l</span><span class="p">[:</span><span class="n">mid</span><span class="p">])</span> <span class="c1"># same as even</span>
<span class="n">Q3</span> <span class="o">=</span> <span class="n">median</span><span class="p">(</span><span class="n">l</span><span class="p">[</span><span class="n">mid</span><span class="o">+</span><span class="mi">1</span><span class="p">:])</span>
<span class="k">return</span> <span class="p">(</span><span class="n">Q1</span><span class="p">,</span> <span class="n">Q2</span><span class="p">,</span> <span class="n">Q3</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="n">L</span> <span class="o">=</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">8</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">12</span><span class="p">,</span><span class="mi">14</span><span class="p">,</span><span class="mi">21</span><span class="p">,</span><span class="mi">13</span><span class="p">,</span><span class="mi">18</span><span class="p">]</span>
<span class="n">Q1</span><span class="p">,</span> <span class="n">Q2</span><span class="p">,</span> <span class="n">Q3</span> <span class="o">=</span> <span class="n">quartiles</span><span class="p">(</span><span class="n">L</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;Sample : </span><span class="si">{</span><span class="n">L</span><span class="si">}</span><span class="se">\n</span><span class="s2">Q1 : </span><span class="si">{</span><span class="n">Q1</span><span class="si">}</span><span class="s2">, Q2 : </span><span class="si">{</span><span class="n">Q2</span><span class="si">}</span><span class="s2">, Q3 : </span><span class="si">{</span><span class="n">Q3</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="err">Sample : [3, 7, 8, 5, 12, 14, 21, 13, 18]</span>
<span class="err">Q1 : 6.0, Q2 : 12, Q3 : 16.0</span>
</pre></div>
<h2>Interquartile Range</h2>
<h3>Definition</h3>
<p>The interquartile range of an array is the difference between its first (Q1) and third (Q3) quartiles. Hence the interquartile range is Q3-Q1</p>
<h3>Implementation in python without using the scientific libraries</h3>
<div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;Interquatile range : </span><span class="si">{</span><span class="n">Q3</span><span class="o">-</span><span class="n">Q1</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="err">Interquatile range : 10.0</span>
</pre></div>
<h2>Standard deviation</h2>
<h3>Definition</h3>
<p>The standard deviation (σ) is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. </p>
<p>The standard deviation can be computed with the formula:</p>
<p><img alt="Standard deviation" src="../images/stat_challenge/day1/ecart_type.png"></p>
<p>where µ is the mean :</p>
<p><img alt="Mean" src="../images/stat_challenge/day1/moyenne.png"></p>
<h3>Implementation in python without using the scientific libraries</h3>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">math</span>
<span class="n">X</span> <span class="o">=</span> <span class="p">[</span><span class="mi">10</span><span class="p">,</span><span class="mi">40</span><span class="p">,</span><span class="mi">30</span><span class="p">,</span><span class="mi">50</span><span class="p">,</span><span class="mi">20</span><span class="p">]</span>
<span class="n">mean</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="p">[(</span><span class="n">x</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">X</span><span class="p">]</span>
<span class="n">std</span> <span class="o">=</span> <span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span>
<span class="nb">sum</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;The distribution </span><span class="si">{</span><span class="n">X</span><span class="si">}</span><span class="s2"> has a standard deviation of </span><span class="si">{</span><span class="n">std</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="err">The distribution [400.0, 100.0, 0.0, 400.0, 100.0] has a standard deviation of 14.142135623730951</span>
</pre></div>
</div>
<aside>
<div class="bug-reporting__panel">
<h3>Find an error or bug? Have a suggestion?</h3>
<p>Everything on this site is avaliable on GitHub. Head on over and <a href='https://github.com/redoules/redoules.github.io/issues/new'>submit an issue.</a> You can also message me directly by <a href='mailto:guillaume.redoules@gadz.org'>email</a>.</p>
</div>
</aside>
</section>
</div>
<!-- start of footer section -->
<footer class="footer">
<div class="container">
<p class="text-muted">
<center>This project contains 119 pages and is available on <a href="https://github.com/redoules/redoules.github.io">GitHub</a>.
<br/>
Copyright &copy; Guillaume Redoulès,
<time datetime="2018">2018</time>.
</center>
</p>
</div>
</footer>
<!-- This jQuery line finds any span that contains code highlighting classes and then selects the parent <pre> tag and adds a border. This is done as a workaround to visually distinguish the code inputs and outputs -->
<script>
$( ".hll, .n, .c, .err, .k, .o, .cm, .cp, .c1, .cs, .gd, .ge, .gr, .gh, .gi, .go, .gp, .gs, .gu, .gt, .kc, .kd, .kn, .kp, .kr, .kt, .m, .s, .na, .nb, .nc, .no, .nd, .ni, .ne, .nf, .nl, .nn, .nt, .nv, .ow, .w, .mf, .mh, .mi, .mo, .sb, .sc, .sd, .s2, .se, .sh, .si, .sx, .sr, .s1, .ss, .bp, .vc, .vg, .vi, .il" ).parent( "pre" ).css( "border", "1px solid #DEDEDE" );
</script>
<!-- Load Google Analytics -->
<script>
/*
(function(i, s, o, g, r, a, m) {
i['GoogleAnalyticsObject'] = r;
i[r] = i[r] || function() {
(i[r].q = i[r].q || []).push(arguments)
}, i[r].l = 1 * new Date();
a = s.createElement(o),
m = s.getElementsByTagName(o)[0];
a.async = 1;
a.src = g;
m.parentNode.insertBefore(a, m)
})(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');
ga('create', 'UA-66582-32', 'auto');
ga('send', 'pageview');
*/
</script>
<!-- End of Google Analytics -->
<!-- Bootstrap core JavaScript
================================================== -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="../theme/js/bootstrap.min.js"></script>
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<script src="../theme/js/ie10-viewport-bug-workaround.js"></script>
</body>
</html>