redoules.github.io/tipuesearch_content.json


			
				
					
						
						
						
							
							
								
							
							{"pages":[{"title":"About Guillaume Redoulès","text":"I am a data scientist and a mechanical engineer working on numerical methods for stress computations in the field of rocket propulsion. Prior to that, I've got a MSc in Computational Fluid Dynamics and aerodynamics from Imperial College London. Email: guillaume.redoules@gadz.org Linkedin: Guillaume Redoulès Curriculum Vitae Experience Thermomecanical method and tools engineer , Ariane Group , 2015 - Present In charge of tools and methods related to thermomecanical computations. Focal point for machine learning. Education MSc Advanced Computational Methods for Aeronautics, Flow Management and Fluid-Structure Interaction , Imperial College London, London. 2013 Dissertation: \"Estimator design for fluid flows\" Fields: Aeronautics, aerodynamics, computational fluid dynamics, numerical methods Arts et Métiers Paristech , France, 2011 Generalist engineering degree Fields: Mechanics, electrical engineering, casting, machining, project management, finance, IT, etc.","tags":"pages","url":"redoules.github.io/pages/about.html","loc":"redoules.github.io/pages/about.html"},{"title":"Kullback Leibler Divergence","text":"The Kullback Leibler Divergence also famously called KL Divergence. The KL divergence actually measures the difference between any two distributions. The KL divergence is defined as : $$D_{KL}(p||q)=\\int_{X} p(x)log\\frac{p(x)}{q(x)} dx$$ And it is always non-negative and zero only when P is equal to Q. Indeed, when P is equal to Q, the log of 1 is zero, and that's why the distance is zero. Otherwise, it is always some non-negative quantity. It is can be view as a distance measure but in reality it is not because it isn't a symertic operator and because it doesn't follow the triangle law. $$D_{KL}(p||q)\\neq D_{KL}(q||p)$$ Where to use the KL-Divergence In supervised learning you are always trying to model our data to a particular distribution. So in that case our \\(P\\) can be the unknown distribution. We usually want to build an estimated probability distribution \\(Q\\) based on the sample samples \\(X\\) . When the estimator is perfect, \\(P\\) and \\(Q\\) are the same hence \\(D_{KL}(p||q) = 0\\) . This means that the KL-Divergence can be used as a mesure of the error. Jonathon Shlens explains that the KL-Divergence can be interpreted as measuring the likelihood that samples form an empirical distribution \\(p\\) were generated by a fixed distribution \\(q\\) . $$D_{KL}(p||q)=\\int_{X} p(x)log\\frac{p(x)}{q(x)}dx$$ $$D_{KL}(p||q)=\\int_{X}\\left( -p(x)log q(x) +p(x) log p(x) \\right) dx$$ The entropy of p is defined as \\(H(p)=-\\int_{X}p(x) log(p(x)) dx\\) The cross entropy between p and q is defined as $H(p,q)=-\\int_{X} p(x)log q(x) $ Hence: $$D_{KL}(p||q)=H(p,q) - H(p)$$ In many machine learning algorithm, in particular deep learning, the optimization problem revolves around minimizing the cross-entropy. Taking into account what we learnt above : $$H(p,q) = H(p) + D_{KL}(p||q) $$ The cross entropy between two probability distributions \\(p\\) and \\(q\\) over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an \"artificial\" probability distribution \\(q\\) , rather than the \"true\" distribution \\(p\\) . This means that, the cross entropy corresponds to the number of bits needed to encode the distribution \\(p\\) given by \\(H(p)\\) the entropy of \\(p\\) plus the number of bits needed to encode the directed divergence between the two distributions \\(p\\) and \\(q\\) given by \\(D_{KL}(p||q)\\) , the Kullback Leibler Divergence if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Mathematics","url":"redoules.github.io/mathematics/KLD.html","loc":"redoules.github.io/mathematics/KLD.html"},{"title":"Downloading from quandl","text":"In this example we will download the price of bitcoin from quandl import quandl btc_price_data = quandl . get ( \"BCHARTS/COINBASEEUR\" ) btc_price_data . head () .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } Open High Low Close Volume (BTC) Volume (Currency) Weighted Price Date 2015-05-11 215.85 219.15 214.67 217.83 145.863137 31656.416712 217.028218 2015-05-12 218.04 218.50 214.16 215.50 127.225520 27467.730961 215.897966 2015-05-13 216.43 217.45 208.53 208.88 111.808285 24014.796862 214.785486 2015-05-14 209.23 209.83 204.94 207.93 148.228400 30825.088327 207.956696 2015-05-15 207.95 209.51 207.62 208.55 127.718800 26586.847126 208.167060","tags":"Python","url":"redoules.github.io/python/quandl.html","loc":"redoules.github.io/python/quandl.html"},{"title":"Computing the Mayer multiple","text":"I've learnt about the Mayer mutliple from The Inverstor Podcast . The Mayer multiple is the ratio of the bitcoin price divided by the 200-day moving average. It is designed to understand the price of bitcoin without taking in account the short term volatility. It helps investors filter out their emotions during a bull run. Let's see how to compute the Mayer mutliple in python. First, we need to import the data, we will use Quandl to download data from coinbase import quandl btc_price_data = quandl . get ( \"BCHARTS/COINBASEEUR\" ) btc_price_data . tail () .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } Open High Low Close Volume (BTC) Volume (Currency) Weighted Price Date 2018-12-12 2966.00 3076.71 2952.05 3026.00 1447.627465 4.372890e+06 3020.728514 2018-12-13 3025.19 3028.06 2861.15 2886.91 2125.242928 6.261750e+06 2946.369017 2018-12-14 2886.91 2919.00 2800.32 2835.50 2527.558347 7.256959e+06 2871.134083 2018-12-15 2835.49 2865.00 2781.47 2830.45 1267.004758 3.568614e+06 2816.575409 2018-12-16 2830.45 2830.45 2830.44 2830.45 0.144249 4.082886e+02 2830.447385 Next we need to compute the 200 days moving average of the price of bitcoin moving_averages = btc_price_data [[ \"Open\" , \"High\" , \"Low\" , \"Close\" ]] . rolling ( window = 200 ) . mean () moving_averages . tail () .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } Open High Low Close Date 2018-12-12 5507.88295 5611.72570 5380.60150 5491.43610 2018-12-13 5491.44155 5595.14440 5363.77840 5474.38560 2018-12-14 5474.39910 5577.89585 5347.27500 5457.99125 2018-12-15 5457.94930 5559.49145 5330.78235 5439.75570 2018-12-16 5439.71660 5540.89425 5313.62955 5422.17700 Finally, we can compute the ratio and plot it. % matplotlib inline import matplotlib.pyplot as plt plt . rcParams [ 'savefig.dpi' ] = 300 plt . rcParams [ 'figure.dpi' ] = 163 plt . rcParams [ 'figure.autolayout' ] = False plt . rcParams [ 'figure.figsize' ] = 20 , 12 plt . rcParams [ 'font.size' ] = 26 mayer_multiple = btc_price_data / moving_averages mayer_multiple [ \"High\" ] . plot () plt . title ( \"Mayer Mutliple over time\" ) plt . ylabel ( \"Mayer Mutliple\" ) plt . xlabel ( \"Time\" ) print ( f \"Mayer multiple {mayer_multiple.iloc[-1]['High']}\" ) print ( f \"Mayer multiple average {mayer_multiple.mean()['High']}\" ) Mayer multiple 0.5108290958630005 Mayer multiple average 1.3789102045356179 Lastly, I wanted to plot the distribution of the Mayer multiple import numpy as np x = mayer_multiple [ \"High\" ] . values x = x [ ~ np . isnan ( x )] n , a , patches = plt . hist ( x , 100 , facecolor = 'green' , alpha = 0.75 , density = True ) plt . axvline ( x = 2.4 , color = \"red\" ) plt . annotate ( 'We are here today' , xy = ( mayer_multiple . iloc [ - 1 ][ \"High\" ], n [( np . abs ( bins - mayer_multiple . iloc [ - 1 ][ \"High\" ])) . argmin ()]), xytext = ( mayer_multiple . iloc [ - 1 ][ \"High\" ] * 3 , n . max () / 2 ), arrowprops = dict ( facecolor = 'black' , shrink = 0.05 ), ) plt . title ( \"Distribution of the Mayer mutliple\" ) plt . plot () []","tags":"Cryptocurrencies","url":"redoules.github.io/cryptocurrencies/mayer_multiple.html","loc":"redoules.github.io/cryptocurrencies/mayer_multiple.html"},{"title":"Load a saved model","text":"A keras model can be loaded from a hdf5 file. Becareful, a keras generated hdf5 can contain either : the model weights (obtained by the .save_weights() method) the model weights and the model architecture (obtained by the .save() method) In our case, we will load both the model and its weights with the load_model function from the keras.models module from keras.models import load_model model = load_model ( \"my_model.h5\" ) type ( model ) keras.engine.sequential.Sequential","tags":"DL","url":"redoules.github.io/dl/keras_load.html","loc":"redoules.github.io/dl/keras_load.html"},{"title":"Get model input shape","text":"Information on the model can be accessed by the .input_shape parameter of the model object from keras.models import load_model model = load_model ( \"my_model.h5\" ) model . input_shape (None, 28, 28)","tags":"DL","url":"redoules.github.io/dl/keras_input.html","loc":"redoules.github.io/dl/keras_input.html"},{"title":"Get model info and number of parameters","text":"Information on the model can be accessed by the .summary() method from keras.models import load_model model = load_model ( \"my_model.h5\" ) model . summary () _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= flatten_2 (Flatten) (None, 784) 0 _________________________________________________________________ dense_3 (Dense) (None, 128) 100480 _________________________________________________________________ dense_4 (Dense) (None, 10) 1290 ================================================================= Total params: 101,770 Trainable params: 101,770 Non-trainable params: 0 _________________________________________________________________","tags":"DL","url":"redoules.github.io/dl/keras_info.html","loc":"redoules.github.io/dl/keras_info.html"},{"title":"Saving the models weights after each epoch","text":"Let's see how we can save the model weights after every epoch. Let's first import some libraries import keras import numpy as np In this example, we will be using the fashion MNIST dataset to do some basic computer vision, where we will train a Keras neural network to classify items of clothing. In order to import the data we will be using the built in function in Keras : keras . datasets . fashion_mnist . load_data () The model is a very simple neural network consisting in 2 fully connected layers. The model loss function is chosen in order to have a multiclass classifier : \"sparse_categorical_crossentropy\" Let's define a simple feedforward network. ##get and preprocess the data fashion_mnist = keras . datasets . fashion_mnist ( train_images , train_labels ), ( test_images , test_labels ) = fashion_mnist . load_data () train_images = train_images / 255.0 test_images = test_images / 255.0 ## define the model model = keras . Sequential ([ keras . layers . Flatten ( input_shape = ( 28 , 28 )), keras . layers . Dense ( 128 , activation = \"relu\" ), keras . layers . Dense ( 10 , activation = \"softmax\" ) ]) model . compile ( optimizer = \"adam\" , loss = \"sparse_categorical_crossentropy\" , metrics = [ \"acc\" ]) In order to automatically save the model weights to a hdf5 format after every epoch we need to import the ModelCheckpoint callback located in keras . callbacks from keras.callbacks import ModelCheckpoint We now need to define the ModelCheckPoint callback, it takes 7 arguments : filepath: string, path to save the model file. monitor: quantity to monitor verbose: verbosity mode, 0 or 1 save_best_only: if save_best_only=True , the latest best model according to the quantity monitored will not be overwritten. mode: one of {auto, min, max} If save_best_only=True , the decision to overwrite the current save file is made based on either the maximization or the minimization of the monitored quantity. save_weights_only: if True, then only the model's weights will be saved ( model.save_weights(filepath) ), else the full model is saved ( model.save(filepath) ). period: Interval (number of epochs) between checkpoints. The callback has to be added to the callbacks list in the fit method. save_to_hdf5 = ModelCheckpoint ( filepath = \"my_model.h5\" , monitor = 'acc' , verbose = 0 , save_best_only = True , save_weights_only = False , mode = 'auto' , period = 1 ) model . fit ( train_images , train_labels , epochs = 5 , callbacks = [ save_to_hdf5 ]) Epoch 1/5 60000/60000 [==============================] - 29s 478us/step - loss: 0.3975 - acc: 0.8575 Epoch 2/5 60000/60000 [==============================] - 98s 2ms/step - loss: 0.3498 - acc: 0.8721 Epoch 3/5 60000/60000 [==============================] - 95s 2ms/step - loss: 0.3213 - acc: 0.8825 Epoch 4/5 60000/60000 [==============================] - 64s 1ms/step - loss: 0.3021 - acc: 0.8887: 6s - loss: Epoch 5/5 60000/60000 [==============================] - 61s 1ms/step - loss: 0.2855 - acc: 0.8953: 4s - loss: 0.284 - ETA: 3s - loss: - ETA: &lt;keras.callbacks.History at 0x1d70aa62f28&gt; import os if \"my_model.h5\" in os . listdir () : print ( \"The model is saved to my_model.h5\" ) else : print ( 'no model saved to disk' ) The model is saved to my_model.h5","tags":"DL","url":"redoules.github.io/dl/model_checkpoint_keras.html","loc":"redoules.github.io/dl/model_checkpoint_keras.html"},{"title":"Logging the training progress in a CSV","text":"Let's see how we can log the progress and various metrics during the training process to a csv file. Let's first import some libraries import keras import numpy as np In this example, we will be using the fashion MNIST dataset to do some basic computer vision, where we will train a Keras neural network to classify items of clothing. In order to import the data we will be using the built in function in Keras : keras . datasets . fashion_mnist . load_data () The model is a very simple neural network consisting in 2 fully connected layers. The model loss function is chosen in order to have a multiclass classifier : \"sparse_categorical_crossentropy\" Let's define a simple feedforward network. ##get and preprocess the data fashion_mnist = keras . datasets . fashion_mnist ( train_images , train_labels ), ( test_images , test_labels ) = fashion_mnist . load_data () train_images = train_images / 255.0 test_images = test_images / 255.0 ## define the model model = keras . Sequential ([ keras . layers . Flatten ( input_shape = ( 28 , 28 )), keras . layers . Dense ( 128 , activation = \"relu\" ), keras . layers . Dense ( 10 , activation = \"softmax\" ) ]) model . compile ( optimizer = \"adam\" , loss = \"sparse_categorical_crossentropy\" , metrics = [ \"accuracy\" , 'mae' ]) In order to stream to a csv file the epoch results and metrics, we define a CSV logger. It is a callback located in keras . callbacks Let's first import it from keras.callbacks import CSVLogger We now need to define the callback by specifiying a file to be written to, the separator and whether to append to the file or erase it every time. The callback has to be added to the callbacks list in the fit method. csv_logger = CSVLogger ( filename = \"my_csv.csv\" , separator = ';' , append = False ) model . fit ( train_images , train_labels , epochs = 5 , callbacks = [ csv_logger ]) Epoch 1/5 60000/60000 [==============================] - 9s 148us/step - loss: 0.5020 - acc: 0.8234 - mean_absolute_error: 4.4200 Epoch 2/5 60000/60000 [==============================] - 8s 138us/step - loss: 0.3765 - acc: 0.8630 - mean_absolute_error: 4.4200 Epoch 3/5 60000/60000 [==============================] - 8s 129us/step - loss: 0.3371 - acc: 0.8789 - mean_absolute_error: 4.4200 Epoch 4/5 60000/60000 [==============================] - 8s 133us/step - loss: 0.3129 - acc: 0.8843 - mean_absolute_error: 4.4200 Epoch 5/5 60000/60000 [==============================] - 9s 151us/step - loss: 0.2952 - acc: 0.8916 - mean_absolute_error: 4.4200 &lt;keras.callbacks.History at 0x1582adc6780&gt; The results are stored in the my_csv.csv file and contain the epoch results import pandas as pd pd . read_csv ( \"my_csv.csv\" , sep = \";\" ) .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } epoch acc loss mean_absolute_error 0 0 0.823400 0.502013 4.42 1 1 0.863050 0.376516 4.42 2 2 0.878867 0.337097 4.42 3 3 0.884317 0.312893 4.42 4 4 0.891583 0.295157 4.42","tags":"DL","url":"redoules.github.io/dl/csv_logger_keras.html","loc":"redoules.github.io/dl/csv_logger_keras.html"},{"title":"Random integer","text":"The random module comes with the randint function that returns a pseudo random number between 2 values : # Return a random integer N such that a <= N <= b. random . randint ( a , b ) #### import random min_int = 126 max_int = 211 print ( f \"My pseudo random number between {min_int} and {max_int} : {random.randint(min_int, max_int)}\" ) My pseudo random number between 126 and 211 : 206","tags":"Python","url":"redoules.github.io/python/randint.html","loc":"redoules.github.io/python/randint.html"},{"title":"Using tensorboard with Keras","text":"general workflow Let's see how we can get tensorboard to work with a Keras-based Tensorflow code. import tensorflow as tf import keras import numpy as np In this example, we will be using the fashion MNIST dataset to do some basic computer vision, where we will train a Keras neural network to classify items of clothing. In order to import the data we will be using the built in function in Keras : keras . datasets . fashion_mnist . load_data () The model is a very simple neural network consisting in 2 fully connected layers. The model loss function is chosen in order to have a multiclass classifier : \"sparse_categorical_crossentropy\" Finally, let's train the model for 5 epochs ##get and preprocess the data fashion_mnist = keras . datasets . fashion_mnist ( train_images , train_labels ), ( test_images , test_labels ) = fashion_mnist . load_data () train_images = train_images / 255.0 test_images = test_images / 255.0 ## define the model model = keras . Sequential ([ keras . layers . Flatten ( input_shape = ( 28 , 28 )), keras . layers . Dense ( 128 , activation = \"relu\" ), keras . layers . Dense ( 10 , activation = \"softmax\" ) ]) model . compile ( optimizer = \"adam\" , loss = \"sparse_categorical_crossentropy\" , metrics = [ \"accuracy\" ]) model . fit ( train_images , train_labels , epochs = 5 ) Epoch 1/5 60000/60000 [==============================] - 9s 143us/step - loss: 0.4939 - acc: 0.8254 Epoch 2/5 60000/60000 [==============================] - 11s 182us/step - loss: 0.3688 - acc: 0.8661 Epoch 3/5 60000/60000 [==============================] - 10s 169us/step - loss: 0.3305 - acc: 0.8798 Epoch 4/5 60000/60000 [==============================] - 21s 350us/step - loss: 0.3079 - acc: 0.8874 Epoch 5/5 60000/60000 [==============================] - 18s 302us/step - loss: 0.2889 - acc: 0.8927 &lt;keras.callbacks.History at 0x235c1bc1be0&gt; During the training we can see the process, including the loss and the accuracy in the output. test_loss , test_acc = model . evaluate ( test_images , test_labels ) print ( f \"Test accuracy : {test_acc}\" ) 10000/10000 [==============================] - 1s 67us/step Test accuracy : 0.8763 When the model finishes training, we get an accuracy of about 87%, and we output some sample predictions predictions = model . predict ( test_images ) print ( predictions [ 0 ]) [1.8075149e-05 3.6810281e-08 6.3094416e-07 5.1111499e-07 1.6264809e-06 3.5973577e-04 1.0840570e-06 3.1453002e-02 1.7062060e-06 9.6816361e-01] This kind of process only gives us minimal information during the training process. Setting up tensorboard To make it easier to understand, debug, and optimize TensorFlow programs, a suite of visualization tools called TensorBoard is included. You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it. When TensorBoard is fully configured, it looks like this: Let's start by importing the time library and tensorboard itself. It can be found in tensorflow.python.keras.callbacks. from time import time from tensorflow.python.keras.callbacks import TensorBoard After having imported our data and defined our model, we specify a log directory where the training information will get written to. #keep in mind that we already imported the data and defined the model. tensorboard = TensorBoard ( log_dir = f \"logs/{time()}\" ) Finally, to tell Keras to call back to TensorBoard we refer to the instant of TensorBoard we created. model . compile ( optimizer = \"adam\" , loss = \"sparse_categorical_crossentropy\" , metrics = [ \"accuracy\" ]) Now, we need to execute the TensorBoard command pointing at the log directory previously specified. tensorboard --logdir = logs/ TensorBoard will return a http address TensorBoard 1.12.0 at http://localhost:6006 (Press CTRL+C to quit) Now, if we retrain again, we can take a look in TensorBoard and start investigating the loss and accuracy model . fit ( train_images , train_labels , epochs = 5 , callbacks = [ tensorboard ]) Epoch 1/5 60000/60000 [==============================] - 41s 684us/step - loss: 0.4990 - acc: 0.8241 Epoch 2/5 60000/60000 [==============================] - 49s 812us/step - loss: 0.3765 - acc: 0.8648 Epoch 3/5 60000/60000 [==============================] - 46s 765us/step - loss: 0.3392 - acc: 0.8766 Epoch 4/5 60000/60000 [==============================] - 48s 794us/step - loss: 0.3135 - acc: 0.8836 Epoch 5/5 60000/60000 [==============================] - 49s 813us/step - loss: 0.2971 - acc: 0.8897 &lt;keras.callbacks.History at 0x235be1c76d8&gt; TensorBoard also give access to a dynamic visualization of the graph","tags":"DL","url":"redoules.github.io/dl/tensorboard_keras.html","loc":"redoules.github.io/dl/tensorboard_keras.html"},{"title":"Install keras using conda","text":"Keras with the Tensorflow backend can be installed by running the following conda command conda install -c conda-forge keras tensorflow If you want a Intel CPU optimized version, install tensorflow-mkl conda install -c conda-forge keras tensorflow-mkl A GPU compatible version is also available conda install -c conda-forge keras tensorflow-gpu","tags":"DL","url":"redoules.github.io/dl/keras_install.html","loc":"redoules.github.io/dl/keras_install.html"},{"title":"Day 9 - Multiple Linear Regression","text":"Problem Here is a simple equation: $$Y=a+b_1\\cdot f_1++b_2\\cdot f_2+...++b_m\\cdot f_m$$ $$Y=a+\\sum_{i=1}&#94;m b_i\\cdot f_i$$ for \\((m+1)\\) read constants \\((a,f_1, f_2, ..., f_m)\\) . We can say that the value of \\(Y\\) depends on \\(m\\) features. We study this equation for \\(n\\) different feature sets \\((f_1, f_2, ..., f_m)\\) and records each respective value of \\(Y\\) . If we have \\(q\\) new feature sets, and without accounting for bias and variance trade-offs,what is the value of \\(Y\\) for each of the sets? Python implementation import numpy as np m = 2 n = 7 x_1 = [ 0.18 , 0.89 ] y_1 = 109.85 x_2 = [ 1.0 , 0.26 ] y_2 = 155.72 x_3 = [ 0.92 , 0.11 ] y_3 = 137.66 x_4 = [ 0.07 , 0.37 ] y_4 = 76.17 x_5 = [ 0.85 , 0.16 ] y_5 = 139.75 x_6 = [ 0.99 , 0.41 ] y_6 = 162.6 x_7 = [ 0.87 , 0.47 ] y_7 = 151.77 q_1 = [ 0.49 , 0.18 ] q_2 = [ 0.57 , 0.83 ] q_3 = [ 0.56 , 0.64 ] q_4 = [ 0.76 , 0.18 ] With scikit learn X = np . array ([ x_1 , x_2 , x_3 , x_4 , x_5 , x_6 , x_7 ]) Y = np . array ([ y_1 , y_2 , y_3 , y_4 , y_5 , y_6 , y_7 ]) X_q = np . array ([ q_1 , q_2 , q_3 , q_4 ]) from sklearn import linear_model lm = linear_model . LinearRegression () lm . fit ( X , Y ) lm . predict ( X_q ) array([105.21455835, 142.67095131, 132.93605469, 129.70175405]) without scikit learn (but with numpy) from numpy.linalg import inv #center X_R = X - np . mean ( X , axis = 0 ) a = np . mean ( Y ) Y_R = Y - a #calculate b B = inv ( X_R . T @X_R ) @X_R.T@Y_R #predict X_new_R = X_q - np . mean ( X , axis = 0 ) Y_new_R = X_new_R @B Y_new = Y_new_R + a Y_new array([105.21455835, 142.67095131, 132.93605469, 129.70175405]) if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day9.html","loc":"redoules.github.io/blog/Statistics_10days-day9.html"},{"title":"Multiple Linear Regression","text":"If \\(Y\\) is linearly dependent only on \\(X\\) , then we can use the ordinary least square regression line, \\(\\hat{Y}=a+bX\\) . However, if \\(Y\\) shows linear dependency on \\(m\\) variables \\(X_1\\) , \\(X_2\\) , ..., \\(X_m\\) , then we need to find the values of \\(a\\) and \\(m\\) other constants ( \\(b_1\\) , \\(b_2\\) , ..., \\(b_m\\) ). We can then write the regression equation as: $$\\hat{Y}=a+\\sum_{i=1}&#94;{m}b_iX_i$$ Matrix Form of the Regression Equation Let's consider that \\(Y\\) depends on two variables, \\(X_1\\) and \\(X_2\\) . We write the regression relation as \\(\\hat{Y}=a+b_1X_1+b_2X_2\\) . Consider the following matrix operation: $$\\begin{bmatrix} 1 & X_1 & X_2\\\\ \\end{bmatrix}\\cdot\\begin{bmatrix} a \\\\ b_1\\\\ b_2\\\\ \\end{bmatrix}=a+b_1X_1+b_2X_2$$ We define two matrices, \\(X\\) and \\(B\\) as: $$X=\\begin{bmatrix}1 & X_1 & X_2\\\\\\end{bmatrix}$$ $$B=\\begin{bmatrix}a \\\\b_1\\\\b_2\\\\\\end{bmatrix}$$ Now, we rewrite the regression relation as \\(\\hat{Y}=X\\cdot B\\) . This transforms the regression relation into matrix form. Generalized Matrix Form We will consider that \\(Y\\) shows a linear relationship with \\(m\\) variables, \\(X_1\\) , \\(X_2\\) , ..., \\(X_m\\) . Let's say that we made \\(n\\) observations on different tuples \\((x_1, x_2, ..., x_m)\\) : \\(y_1=a+b_1\\cdot x_{1,1} + b_2\\cdot x_{2,1} + ... + b_m\\cdot x_{m,1}\\) \\(y_2=a+b_2\\cdot x_{1,2} + b_2\\cdot x_{2,2} + ... + b_m\\cdot x_{m,2}\\) \\(...\\) \\(y_n=a+b_n\\cdot x_{1,n} + b_2\\cdot x_{2,n} + ... + b_m\\cdot x_{m,n}\\) Now, we can find the matrices: $$X=\\begin{bmatrix}1 & x_{1,1} & x_{2,1} & x_{3,1} & ... & x_{m,1} \\\\1 & x_{1,2} & x_{2,2} & x_{3,2} & ... & x_{m,2} \\\\1 & x_{1,3} & x_{2,3} & x_{3,3} & ... & x_{m,3} \\\\... & ... & ... & ... & ... & ... \\\\1 & x_{1,n} & x_{2,n} & x_{3,n} & ... & x_{m,n} \\\\\\end{bmatrix}$$ $$Y=\\begin{bmatrix}y_1 \\\\y_2\\\\y_3\\\\...\\\\y_n\\\\\\end{bmatrix}$$ Finding the Matrix B We know that \\(Y=X\\cdot B\\) $$\\Rightarrow X&#94;T\\cdot Y=X&#94;T\\cdot X \\cdot B$$ $$\\Rightarrow (X&#94;T\\cdot X)&#94;{-1}\\cdot X&#94;T \\cdot Y=I\\cdot B$$ $$\\Rightarrow B= (X&#94;T\\cdot X)&#94;{-1}\\cdot X&#94;T \\cdot Y$$ Finding the Value of Y Suppose we want to find the value of for some tuple \\(Y\\) , then \\((x_1, x_2, ..., x_m)\\) , $$Y=\\begin{bmatrix} 1 & x_1 & x_2 & ... & x_m\\\\ \\end{bmatrix}\\cdot B$$ Multiple Regression in Python We can use the fit function in the sklearn.linear_model.LinearRegression class. from sklearn import linear_model x = [[ 5 , 7 ], [ 6 , 6 ], [ 7 , 4 ], [ 8 , 5 ], [ 9 , 6 ]] y = [ 10 , 20 , 60 , 40 , 50 ] lm = linear_model . LinearRegression () lm . fit ( x , y ) a = lm . intercept_ b = lm . coef_ print ( f \"Linear regression coefficients between Y and X : a={a}, b_0={b[0]}, b_1={b[1]}\" ) Linear regression coefficients between Y and X : a=51.953488372092984, b_0=6.65116279069768, b_1=-11.162790697674419 if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Machine Learning","url":"redoules.github.io/machine-learning/Multiple_Linear_Regression.html","loc":"redoules.github.io/machine-learning/Multiple_Linear_Regression.html"},{"title":"Least Square Regression Line","text":"Linear Regression If our data shows a linear relationship between \\(X\\) and \\(Y\\) , then the straight line which best describes the relationship is the regression line. The regression line is given by \\(\\hat{Y}\\) =a+bX$. Finding the value of b The value of \\(b\\) can be calculated using either of the following formulae: \\(b=\\frac{n\\sum(x_iy_i)-(\\sum x_i)(\\sum y_i)}{n\\sum(x_i&#94;2)-(\\sum x_i)&#94;2}\\) \\(b=\\rho\\frac{\\sigma_Y}{\\sigma_X}\\) , where \\(\\rho\\) is the Pearson correlation coefficient, \\(\\sigma_X\\) Finding the value of a \\(a=\\bar{y}-b\\cdot\\bar{x}\\) , where \\(\\bar{x}\\) is the mean of \\(X\\) and \\(\\bar{y}\\) is the mean of \\(Y\\) . Coefficient of determination ( \\(R&#94;2\\) ) The coefficient of determination can be computer with : \\(R&#94;2 = \\frac{SSR}{SST}=1-\\frac{SSE}{SST}\\) Where : \\(SST\\) is the total Sum of Squares : \\(SST=\\sum (y_i-\\bar{y})&#94;2\\) \\(SSR\\) is the regression Sum of Squares : \\(SSR=\\sum (\\hat{y_i}-\\bar{y})&#94;2\\) \\(SSE\\) is the error Sum of Squares : \\(SSE=\\sum (\\hat{y_i}-y)&#94;2\\) If \\(SSE\\) is small, we can assume that our fit is good. Linear Regression in Python We can use the fit function in the sklearn.linear_model.LinearRegression class. from sklearn import linear_model import numpy as np xl = [ 1 , 2 , 3 , 4 , 5 ] x = np . asarray ( xl ) . reshape ( - 1 , 1 ) y = [ 2 , 1 , 4 , 3 , 5 ] lm = linear_model . LinearRegression () lm . fit ( x , y ) print ( f 'a = {lm.intercept_}' ) print ( f 'b = {lm.coef_[0]}' ) print ( \"Where Y=a+b*X\" ) a = 0.5999999999999996 b = 0.8000000000000002 Where Y=a+b*X if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Machine Learning","url":"redoules.github.io/machine-learning/LeastSquareRegressionLine.html","loc":"redoules.github.io/machine-learning/LeastSquareRegressionLine.html"},{"title":"Day 8 - Least Square Regression Line","text":"Least Square Regression Line Problem A group of five students enrolls in Statistics immediately after taking a Math aptitude test. Each student's Math aptitude test score, \\(x\\) , and Statistics course grade, \\(y\\) , can be expressed as the following list \\((x,y)\\) of points: \\((95, 85)\\) \\((85, 95)\\) \\((80, 70)\\) \\((70, 65)\\) \\((60, 70)\\) If a student scored an 80 on the Math aptitude test, what grade would we expect them to achieve in Statistics? Determine the equation of the best-fit line using the least squares method, then compute and print the value of \\(y\\) when \\(x=80\\) . X = [ 95 , 85 , 80 , 70 , 60 ] Y = [ 85 , 95 , 70 , 65 , 70 ] n = len ( X ) def cov ( X , Y , n ): x_mean = 1 / n * sum ( X ) y_mean = 1 / n * sum ( Y ) return 1 / n * sum ([( X [ i ] - x_mean ) * ( Y [ i ] - y_mean ) for i in range ( n )]) def stdv ( X , mu_x , n ): return ( sum ([( x - mu_x ) ** 2 for x in X ]) / n ) ** 0.5 def pearson_1 ( X , Y , n ): std_x = stdv ( X , 1 / n * sum ( X ), n ) std_y = stdv ( Y , 1 / n * sum ( Y ), n ) return cov ( X , Y , n ) / ( std_x * std_y ) b = pearson_1 ( X , Y , n ) * stdv ( Y , sum ( Y ) / n , n ) / stdv ( X , sum ( X ) / n , n ) a = sum ( Y ) / n - b * sum ( X ) / n print ( f \"If a student scored 80 on the math test, he would most likely score a {round(a+80*b,3)} in statistics\" ) If a student scored 80 on the math test, he would most likely score a 78.288 in statistics Pearson correlation coefficient Problem The regression line of \\(y\\) on \\(x\\) is \\(3x+4y+8=0\\) , and the regression line of \\(x\\) on \\(y\\) is \\(4x+3y+7=0\\) . What is the value of the Pearson correlation coefficient? Mathematical explanation The initial equation system is : $$ \\left\\{\\begin{array}{ r @{{}={}} r >{{}}c<{{}} r >{{}}c<{{}} r } 3x+4y+8=0 & (1)\\\\ 4x+3y+7=0 & (2)\\\\ \\end{array} \\right. $$ So we can rewrite the 2 lines this way : $$ \\left\\{\\begin{array}{ r @{{}={}} r >{{}}c<{{}} r >{{}}c<{{}} r } y=-2+(\\frac{-3}{4})x & (1)\\\\ x=-\\frac{7}{4}+(-\\frac{3}{4})y & (2)\\\\ \\end{array} \\right. $$ so \\(b_1=-\\frac{3}{4}\\) and \\(b_2=-\\frac{3}{4}\\) When we apply the Pearson's coefficient formula : let \\(p\\) be the pearson coefficient let \\(\\sigma_X\\) be the standard deviation of \\(x\\) let \\(\\sigma_Y\\) be the standard deviation of \\(y\\) We hence have $$ \\left\\{\\begin{array}{ r @{{}={}} r >{{}}c<{{}} r >{{}}c<{{}} r } p=b_1\\left(\\frac{\\sigma_X}{\\sigma_Y}\\right) & (1)\\\\ p=b_2\\left(\\frac{\\sigma_Y}{\\sigma_X}\\right) & (2)\\\\ \\end{array} \\right. $$ by multiplying theses 2 equations together we get $$p&#94;2=b_1\\cdot b_2$$ $$p&#94;2=\\left(-\\frac{3}{4}\\right)\\left(-\\frac{3}{4}\\right)$$ $$p&#94;2=\\left(-\\frac{9}{16}\\right)$$ finally we get \\(p=\\left(-\\frac{3}{4}\\right)\\) or \\(p=\\left(\\frac{3}{4}\\right)\\) Since \\(X\\) and \\(Y\\) are negatively correlated we have \\(p=\\left(-\\frac{3}{4}\\right)\\) if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day8.html","loc":"redoules.github.io/blog/Statistics_10days-day8.html"},{"title":"Spearman's Rank Correlation Coefficient","text":"A rank correlation is any of several statistics that measure an ordinal association—the relationship between rankings of different ordinal variables or different rankings of the same variable, where a \"ranking\" is the assignment of the ordering labels \"first\", \"second\", \"third\", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. We have two random variables \\(X\\) and \\(Y\\) : \\(X=\\{x_i, x_2, x_3, ..., x_n\\}\\) \\(Y=\\{y_i, y_2, y_3, ..., y_n\\}\\) if \\(Rank_X\\) and \\(Rank_Y\\) denote the respective ranks of each data point, then the Spearman's rank correlation coefficient, \\(r_s\\) , is the Pearson correlation coefficient of \\(Rank_X\\) and \\(Rank_Y\\) . What does it means? The Spearman's rank correlation coefficientis is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function. The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson's correlation assesses linear relationships, Spearman's correlation assesses monotonic relationships (whether linear or not). If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other. A Spearman correlation of 1 results when the two variables being compared are monotonically related, even if their relationship is not linear. This means that all data-points with greater x-values than that of a given data-point will have greater y-values as well. In contrast, this does not give a perfect Pearson correlation. Example \\(X=\\{0.2, 1.3, 0.2, 1.1, 1.4, 1.5\\}\\) \\(Y=\\{1.9, 2.2, 3.1, 1.2, 2.2, 2.2\\}\\) $$ Rank_X \\quad \\begin{bmatrix} X: & 0.2 & 1.3 & 0.2 & 1.1 & 1.4 & 1.5 \\\\ Rank: & 1 & 3 & 1 & 2 & 4 & 5 \\end{bmatrix} \\quad $$ so, \\(Rank_X = \\{1, 3, 1, 2, 4, 5\\}\\) similarly, \\(Rank_Y=\\{2,3,4,1,3,3\\}\\) \\(r_s\\) equals the Pearson correlation coefficient of \\(Rank_X\\) and \\(Rank_Y\\) , meaning that \\(r=0.158114\\) Special case : \\(X\\) and \\(Y\\) don't contain duplicates $$r_s=1-\\frac{6\\sum d_i&#94;2}{n(n&#94;2-1)}$$ Where, \\(d_i\\) is the difference between the respective values of \\(Rank_X\\) and \\(Rank_Y\\) . if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Mathematics","url":"redoules.github.io/mathematics/spearman.html","loc":"redoules.github.io/mathematics/spearman.html"},{"title":"Pearson correlation coefficient","text":"Covariance This is a measure of how two random variables change together, or the strength of their correlation. Consider two random variables, \\(X\\) and \\(Y\\) , each with \\(n\\) values (i.e., \\(x_1\\) , \\(x_2\\) , \\(...\\) , \\(x_n\\) and \\(y_1\\) , \\(y_2\\) , \\(...\\) , \\(y_n\\) ). The covariance of \\(X\\) and \\(Y\\) can be found using either of the following equivalent formulas: $$cov(X,Y)=\\frac{1}{n}\\sum_{i=1}&#94;{n}(x_i-\\bar{x})\\cdot(y_i-\\bar{y})$$ or $$cov(X,Y)=\\frac{1}{n&#94;2}\\sum_{i=1}&#94;{n}\\sum_{j=1}&#94;{n}\\frac{1}{2}(x_i-x_j)\\cdot(y_i-y_j))$$ $$cov(X,Y)=\\frac{1}{n&#94;2}\\sum_{i}\\sum_{j\\gt i}&#94;{n}(x_i-x_j)\\cdot(y_i-y_j)$$ where, \\(\\bar{x}\\) is the mean of \\(X\\) (or \\(\\mu_X\\) ) and \\(\\bar{y}\\) is the mean of \\(Y\\) (or \\(\\mu_Y\\) ) Pearson correlation coefficient The pearson correlation coefficient, \\(\\rho_{X,Y}\\) , is given by : $$\\rho_{X,Y}=\\frac{cov(X,Y)}{\\sigma_X\\sigma_Y}=\\frac{\\sum_{i}(x_i-\\bar{x})(y_i-\\bar{y})}{n\\sigma_X\\sigma_Y}$$ Here, \\(\\sigma_X\\) is the standard deviation of \\(X\\) and \\(\\sigma_Y\\) is the standard deviation of \\(Y\\) . You may also see \\(\\rho_{X,Y}\\) written as \\(r_{X,Y}\\) . The pearson correlation coefficient is a measure of the linear correlation between two variables X and Y. if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Mathematics","url":"redoules.github.io/mathematics/pearson.html","loc":"redoules.github.io/mathematics/pearson.html"},{"title":"Day 7 - Pearson and spearman correlations","text":"Pearson correlation coefficient Problem Given two n-element data sets, \\(X\\) and \\(Y\\) , calculate the value of the Pearson correlation coefficient. Python implementation Using the formula $$\\rho_{X,Y}=\\frac{cov(X,Y)}{\\sigma_X\\sigma_Y}$$ where $$cov(X,Y)=\\frac{1}{n}\\sum_{i=1}&#94;{n}(x_i-\\bar{x})\\cdot(y_i-\\bar{y})$$ n = 10 X = [ 10 , 9.8 , 8 , 7.8 , 7.7 , 7 , 6 , 5 , 4 , 2 ] Y = [ 200 , 44 , 32 , 24 , 22 , 17 , 15 , 12 , 8 , 4 ] def cov ( X , Y , n ): x_mean = 1 / n * sum ( X ) y_mean = 1 / n * sum ( Y ) return 1 / n * sum ([( X [ i ] - x_mean ) * ( Y [ i ] - y_mean ) for i in range ( n )]) def stdv ( X , mu_x , n ): return ( sum ([( x - mu_x ) ** 2 for x in X ]) / n ) ** 0.5 def pearson_1 ( X , Y , n ): std_x = stdv ( X , 1 / n * sum ( X ), n ) std_y = stdv ( Y , 1 / n * sum ( Y ), n ) return cov ( X , Y , n ) / ( std_x * std_y ) pearson_1 ( X , Y , n ) 0.6124721937208479 Python implementation Using the formula $$\\rho_{X,Y}=\\frac{\\sum_{i}(x_i-\\bar{x})(y_i-\\bar{y})}{n\\sigma_X\\sigma_Y}$$ def pearson_2 ( X , Y , n ): std_x = stdv ( X , 1 / n * sum ( X ), n ) std_y = stdv ( Y , 1 / n * sum ( Y ), n ) x_mean = 1 / n * sum ( X ) y_mean = 1 / n * sum ( Y ) return sum ([( X [ i ] - x_mean ) * ( Y [ i ] - y_mean ) for i in range ( n )]) / ( n * std_x * std_y ) pearson_2 ( X , Y , n ) 0.6124721937208479 Spearman's rank correlation coefficient Problem Given two \\(n\\) -element data sets, \\(X\\) and \\(Y\\) , calculate the value of Spearman's rank correlation coefficient. Python implementation We knwo that in this case, the values in each dataset are unique. Hence we can use the formula : $$r_s=1-\\frac{6\\sum d_i&#94;2}{n(n&#94;2-1)}$$ n = 10 X = [ 10 , 9.8 , 8 , 7.8 , 7.7 , 1.7 , 6 , 5 , 1.4 , 2 ] Y = [ 200 , 44 , 32 , 24 , 22 , 17 , 15 , 12 , 8 , 4 ] def spearman_rank ( X , Y , n ): rank_X = [ sorted ( X ) . index ( v ) + 1 for v in X ] rank_Y = [ sorted ( Y ) . index ( v ) + 1 for v in Y ] d = [( rank_X [ i ] - rank_Y [ i ]) ** 2 for i in range ( n )] return 1 - ( 6 * sum ( d )) / ( n * ( n * n - 1 )) spearman_rank ( X , Y , n ) 0.9030303030303031 if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day7.html","loc":"redoules.github.io/blog/Statistics_10days-day7.html"},{"title":"Day 6 - The Central Limit Theorem","text":"Problem 1 A large elevator can transport a maximum of \\(9800\\) kg. Suppose a load of cargo containing \\(49\\) boxes must be transported via the elevator. The box weight of this type of cargo follows a distribution with a mean of \\(\\mu=205\\) kg and a standard deviation of \\(\\sigma=15\\) kg. Based on this information, what is the probability that all boxes can be safely loaded into the freight elevator and transported? Mathematical explanation This problem can be tackled with the central limit theorem. Since the number of boxes is \"large\", the sum of the weight approaches normal distribution with : \\(\\mu' = n\\mu\\) \\(\\sigma'=\\sigma\\sqrt{n}\\) If we want to know the probability of the sum of the mass of all boxes to be under a certain weight, we can compute the cumulative density function : $$P(x<9800) = F_X(9800)$$ max_load = 9800 n = 49 mu = 205 st_dev = 15 import math def cumulative ( x , mean , sd ): return 0.5 * ( 1 + math . erf (( x - mean ) / ( sd * math . sqrt ( 2 )))) mu_group = n * mu st_dev_group = st_dev * math . sqrt ( n ) print ( f \"Probability that all the boxes can be lifted by the elevator : {cumulative(max_load, mu_group, st_dev_group)}\" ) Probability that all the boxes can be lifted by the elevator : 0.009815328628645315 Problem 2 The number of tickets purchased by each student for the University X vs. University Y football game follows a distribution that has a mean of \\(\\mu=2.4\\) and a standard deviation of \\(\\sigma=2.0\\) . A few hours before the game starts, \\(100\\) eager students line up to purchase last-minute tickets. If there are only \\(250\\) tickets left, what is the probability that all students will be able to purchase tickets? Mathematical explanation We want to know if the sum of all the purchases will exceed the total supply of tickets. Since each student buy follows a Normal distribution and that the number of students is relatively high, the probability that all the students will be able to buy a ticket can be computed by applying the central limit theorem. The total number of tickets bought follows a normal distribution of mean \\(\\mu'=n*\\mu\\) and of standard deviation \\(\\sigma'=\\sigma\\sqrt{n}\\) ticket_supply = 250 n_students = 100 mu = 2.4 st_dev = 2 mu_group = n_students * mu st_dev_group = st_dev * math . sqrt ( n_students ) print ( f \"Probability that all the students can purchase tickets : {cumulative(ticket_supply, mu_group, st_dev_group)}\" ) Probability that all the students can purchase tickets : 0.691462461274013 Problem 3 You have a sample of \\(100\\) values from a population with mean \\(\\mu=500\\) and with standard deviation \\(\\sigma=80\\) . Compute the interval that covers the middle \\(95%\\) of the distribution of the sample mean; in other words, compute \\(A\\) and \\(B\\) such that \\(P(A<x<B)=0.95\\) . Use the value of \\(z=1.96\\) . Note that is the z-score. Mathematical explanation The margin of error can be computed with : $$MoE = \\frac{z-\\sigma}{\\sqrt{n}}$$ Knowing this, we can figure out what values for x paint the exact middle of the distribution for 0.95 probability, that means theres a 0.0025 leftover on both sides to make the total of 1. zScore = 1.96 std = 80 n = 100 mean = 500 marginOfError = zScore * std / math . sqrt ( n ); print ( \"A =\" , mean - marginOfError ) print ( \"B =\" , mean + marginOfError ) A = 484.32 B = 515.68 if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day6.html","loc":"redoules.github.io/blog/Statistics_10days-day6.html"},{"title":"Day 5 - Poisson and Normal distributions","text":"Poisson Distribution Problem 1 A random variable, \\(X\\) , follows Poisson distribution with mean of 2.5. Find the probability with which the random variable \\(X\\) is equal to 5. Mathematical explanation In this case, the answer is straightforward, we just need to compute the value of the Poisson distribution of mean 2.5 at 5: $$P(\\lambda = 2.5, x=5)=\\frac{\\lambda&#94;ke&#94;{-\\lambda}}{k!}$$ $$P(\\lambda = 2.5, x=5)=\\frac{2.5&#94;5e&#94;{-2.5}}{5!}$$ def factorial ( k ): return 1 if k == 1 else k * factorial ( k - 1 ) from math import exp def poisson ( l , k ): return ( l ** k * exp ( - l )) / factorial ( k ) l = 2.5 k = 5 print ( f 'Probability that a random variable X following a Poisson distribution of mean {l} equals {k} : {round(poisson(l,k),3)}' ) Probability that a random variable X following a Poisson distribution of mean 2.5 equals 5 : 0.067 Problem 2 The manager of a industrial plant is planning to buy a machine of either type \\(A\\) or type \\(B\\) . For each day's operation: The number of repairs, \\(X\\) , that machine \\(A\\) needs is a Poisson random variable with mean 0.88. The daily cost of operating \\(A\\) is \\(C_A=160+40X&#94;2\\) . The number of repairs, \\(Y\\) , that machine \\(B\\) needs is a Poisson random variable with mean 1.55. The daily cost of operating \\(B\\) is \\(C_B=128+40Y&#94;2\\) . Assume that the repairs take a negligible amount of time and the machines are maintained nightly to ensure that they operate like new at the start of each day. What is the expected daily cost for each machine. Mathematical explanation The cost for each machine follows a law that is the square of a Poisson distribution. $$C_Z = a + b*Z&#94;2$$ Since the expectation is a linear operator : $$E[C_Z] = aE[1] + bE[Z&#94;2]$$ Knowing that \\(Z\\) follows a Poisson distribution of mean \\(\\lambda\\) we have : $$E[C_Z] = a+ b(\\lambda + \\lambda&#94;2)$$ averageX = 0.88 averageY = 1.55 CostX = 160 + 40 * ( averageX + averageX ** 2 ) CostY = 128 + 40 * ( averageY + averageY ** 2 ) print ( f 'Expected cost to run machine A : {round(CostX, 3)}' ) print ( f 'Expected cost to run machine A : {round(CostY, 3)}' ) Expected cost to run machine A : 226.176 Expected cost to run machine A : 286.1 Normal Distribution Problem 1 In a certain plant, the time taken to assemble a car is a random variable, \\(X\\) , having a normal distribution with a mean of 20 hours and a standard deviation of 2 hours. What is the probability that a car can be assembled at this plant in: Less than 19.5 hours? Between 20 and 22 hours? Mathematical explanation \\(X\\) is a real-valued random variable following a normal distribution : the probability of assembly the car in less than 19.5 hours is the cumulative distribution function of X evaluated at 19.5: $$P(X\\leq 19.5)=F_X(19.5)$$ For a normal distribution, the cumulative distribution function is : $$\\Phi(x)=\\frac{1}{2}\\left(1+erf\\left(\\frac{x-\\mu}{\\sigma\\sqrt{2}}\\right)\\right)$$ import math def cumulative ( x , mean , sd ): return 0.5 * ( 1 + math . erf (( x - mean ) / ( sd * math . sqrt ( 2 )))) mean = 20 sd = 2 print ( f 'Probability that the car is built in less than 19.5 hours : {round(cumulative(19.5,mean,sd),3)}' ) Probability that the car is built in less than 19.5 hours : 0.401 Similarly, the probability that a car is built between 20 and 22hours can be computed thanks to the cumulative density function: $$P(20\\leq x\\leq 22) = F_X(22)-F_X(20)$$ print ( f 'Probability that the car is built between 20 and 22 hours : {round(cumulative(22,mean,sd)-cumulative(20,mean,sd),3)}' ) Probability that the car is built between 20 and 22 hours : 0.341 Problem 2 The final grades for a Physics exam taken by a large group of students have a mean of \\(\\mu=70\\) and a standard deviation of \\(\\sigma=10\\) . If we can approximate the distribution of these grades by a normal distribution, what percentage of the students: Scored higher than 80 (i.e., have a \\(grade \\gt 80\\) ))? Passed the test (i.e., have a \\(grade \\gt 60\\) )? * Failed the test (i.e., have a \\(grade \\lt 60\\) )? Mathematical explanation Here again, we need to appy the cumulative density function to get the probabilities : Probability that they scored higher than 80 : $$P(X\\gt80) = 1- P(X\\lt80)$$ $$P(X\\gt80) = 1- F_X(80)$$ mean = 70 sd = 10 print ( f 'Probability that the the student scored higher than 80 : {round(1- cumulative(80,mean,sd),3)}' ) Probability that the the student scored higher than 80 : 0.159 Probability that they passed the test : $$P(X\\gt60) = 1- P(X\\lt60)$$ $$P(X\\gt80) = 1- F_X(60)$$ print ( f 'Probability that the the student passed the test : {round(1- cumulative(60,mean,sd),3)}' ) Probability that the the student passed the test : 0.841 Probability that they failed : $$P(X\\lt60) = F_X(60)$$ print ( f 'Probability that the student failed the test: {round(cumulative(60,mean,sd),3)}' ) Probability that the student failed the test: 0.159 if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day5.html","loc":"redoules.github.io/blog/Statistics_10days-day5.html"},{"title":"Normal Distribution","text":"Normal Distribution The probability density of normal distribution is: $$\\mathcal{N}(\\mu,\\sigma&#94;2)=\\frac{1}{\\sigma\\sqrt{2\\pi}}e&#94;{-\\frac{(x-\\mu)&#94;2}{2\\sigma&#94;2}}$$ where, \\(\\mu\\) is the mean (or expectation) of the distribution. It is also equal to median and mode of the distribution. \\(\\sigma&#94;2\\) is the variance. * \\(\\sigma\\) is the standard deviation. Standard Normal Distribution If \\(\\mu=0\\) and \\(\\sigma=1\\) , then the normal distribution is known as standard normal distribution: $$\\phi(x)=\\frac{e&#94;{-\\frac{x&#94;2}{2}}}{\\sigma\\sqrt{2\\pi}}$$ Every normal distribution can be represented as standard normal distribution: $$\\mathcal{N}(\\mu,\\sigma&#94;2)=\\frac{1}{\\sigma}\\phi(\\frac{x-\\mu}{\\sigma})$$ Cumulative Probability Consider a real-valued random variable, \\(X\\) . The cumulative distribution function of \\(X\\) (or just the distribution function of \\(X\\) ) evaluated at \\(x\\) is the probability that \\(X\\) will take a value less than or equal to \\(x\\) : $$F_X(x)=P(X\\leq x)$$ also, $$P(a\\leq X\\leq b)=P(a\\lt X\\lt b)=F_X(b)-F_X(a)$$ the cumulative distribution function for a function with normal distribution is: $$\\Phi(x)=\\frac{1}{2}\\left(1+erf\\left(\\frac{x-\\mu}{\\sigma\\sqrt{2}}\\right)\\right)$$ where \\(erf\\) is the error function: $$erf(z)=\\frac{2}{\\sqrt{\\pi}}\\int_0&#94;ze&#94;{-x&#94;2}dx$$ if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Mathematics","url":"redoules.github.io/mathematics/normal.html","loc":"redoules.github.io/mathematics/normal.html"},{"title":"Poisson Distribution","text":"Poisson Experiment Poisson experiment is a statistical experiment that has the following properties: The outcome of each trial is either success or failure. The average number of successes ( \\(\\lambda\\) ) that occurs in a specified region is known. The probability that a success will occur is proportional to the size of the region. The probability that a success will occur in an extremely small region is virtually zero. Poisson Distribution A Poisson random variable is the number of successes that result from a Poisson experiment. The probability distribution of a Poisson random variable is called a Poisson distribution: $$P(k,\\lambda)=\\frac{\\lambda&#94;ke&#94;{-\\lambda}}{k!}$$ where : \\(\\lambda\\) is the average number of successes that occur in a specified region. \\(k\\) is the actual number of successes that occur in a specified region. * \\(P(k,\\lambda)\\) is the Poisson probability, which is the probability of getting exactly \\(k\\) successes when the average number of successes is \\(\\lambda\\) . Example The average number of goals in the soccer world cup is 2.5. The probability that 4 goals are scored is then: $$p(\\lambda=2.5,k=4)=\\frac{2.5&#94;4e&#94;{-2.5}}{4!}=0.133$$ Expectation for the Poisson distribution Consider some Poisson random variable, \\(X\\) . Let \\(E[X]\\) be the expectation of \\(X\\) . Find the value of \\(E[X&#94;2]\\) . Let \\(Var(X)\\) be the variance of \\(X\\) . Recall that if a random variable has a Poisson distribution, then: \\(E[X]=\\lambda\\) \\(Var[X]=\\lambda\\) Now, we'll use the following property of expectation and variance for any random variable, \\(X\\) : $$Var(X)=E[X&#94;2]-(E[X])&#94;2$$ $$E[X&#94;2]=Var(X)+(E[X])&#94;2$$ So, for any random variable having a Poisson distribution, the above result can be rewritten as: $$E[X&#94;2]=\\lambda + \\lambda&#94;2$$ if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Mathematics","url":"redoules.github.io/mathematics/poisson.html","loc":"redoules.github.io/mathematics/poisson.html"},{"title":"Geometric distribution","text":"Negative Binomial Experiment A negative binomial experiment is a statistical experiment that has the following properties: The experiment consists of n repeated trials. The trials are independent. The outcome of each trial is either success (s) or failure (f). \\(P(s)\\) is the same for every trial. The experiment continues until x successes are observed If \\(X\\) is the number of experiments until the \\(x&#94;{th}\\) success occures, then \\(X\\) is a discrete random variable called a negative binomial Negative Binomial Distribution Consider the following probability mass function: $$b&#94;*(x,n,p) = {\\binom{n-1}{x-1}}p&#94;xq&#94;{n-x}$$ The function above is negative binomial and has the following properties: The number of successes to be observed is \\(x\\) . The total number of trials is \\(n\\) . The probability of success of 1 trial is \\(p\\) . The probability of failure of 1 trial \\(q\\) , where \\(q=1-p\\) . \\(b&#94;*(x,n,p)\\) is the negative binomial probability , meaning the probability of having exactly \\(x-1\\) successes out of \\(n-1\\) trials and having \\(x\\) successes after \\(n\\) trials. Geometric Distribution The geometric distribution is a special case of the negative binomial distribution that deals with the number of Bernoulli trials required to get a success (i.e., counting the number of failures before the first success). Recall that \\(X\\) is the number of successes in \\(n\\) independent Bernoulli trials, so for each \\(i\\) (where $1\\leq i\\leq n): $ X_i = \\begin{cases} 1 if the i&#94;{th} trial is a success \\\\ 0 otherwise x \\end{cases} $ The geometric distribution is a negative binomial distribution where the number of successes is 1. We express this with the following formula: $$g(n,p)=q&#94;{n-1}p$$ Example Bob is a high school basketball player. He is a 70% free throw shooter, meaning his probability of making a free throw is 0.7. What is the probability that Bob makes his first free throw on his fifth shot? For this experiment n=5, p=0.7 and q=0.3 So : $$g(n=5, p=0.7)=0.3&#94;4 0.7=0.00567$$ if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Mathematics","url":"redoules.github.io/mathematics/Geometric.html","loc":"redoules.github.io/mathematics/Geometric.html"},{"title":"Binomial distribution","text":"Binomial Experiment A binomial experiment (or Bernoulli trial) is a statistical experiment that has the following properties: The experiment consists of n repeated trials. The trials are independent. The outcome of each trial is either success (s) or failure (f). Binomial Distribution We define a binomial process to be a binomial experiment meeting the following conditions: The number of successes is \\(x\\) . The total number of trials is \\(n\\) . The probability of success of 1 trial is \\(p\\) . The probability of failure of 1 trial \\(q\\) , where \\(q=1-p\\) . \\(b(x,n,p)\\) is the binomial probability , meaning the probability of having exactly \\(x\\) successes out of \\(n\\) trials. The binomial random variable is the number of successes, \\(x\\) , out of \\(n\\) trials. The binomial distribution is the probability distribution for the binomial random variable, given by the following probability mass function: $$b(x,n,p) = \\frac{n!}{x!(n-x)!}p&#94;xq&#94;{n-x}$$ Python code for the Binomial distribution import math def bi_dist ( x , n , p ): b = ( math . factorial ( n ) / ( math . factorial ( x ) * math . factorial ( n - x ))) * ( p ** x ) * (( 1 - p ) ** ( n - x )) return ( b ) Using numpy import numpy as np n = 10 #number of coin toss p = 0.5 #probability samples = 1000 #number of samples s = np . random . binomial ( n , p , samples ) Using the stats module from scipy from scipy.stats import binom n = 10 #number of coin toss p = 0.5 #probability samples = 1000 #number of samples s = binom . rvs ( n , p , size = samples ) Cumulative probabilities A cumulative probability refers to the probability that the value of a random variable falls within a specified range. Frequently, cumulative probabilities refer to the probability that a random variable is less than or equal to a specified value. A fair coin is tossed 10 times. Probability of getting 5 heads The probability of getting heads is: $$b(x=5, n=10, p=0.5)=0.246$$ Probability of getting at least 5 heads The probability of getting at least heads is: $$b(x\\geq 5, n=10, p=0.5)= \\sum_{r=5}&#94;{10} b(x=r, n=10, p=0.5)$$ $$b(x\\geq 5, n=10, p=0.5)= 0.623$$ Probability of getting at most 5 heads The probability of getting at most heads is: $$b(x\\leq 5, n=10, p=0.5)= \\sum_{r=0}&#94;{5} b(x=r, n=10, p=0.5)$$ $$b(x\\leq 5, n=10, p=0.5)= 0.623$$ if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Mathematics","url":"redoules.github.io/mathematics/Binomial.html","loc":"redoules.github.io/mathematics/Binomial.html"},{"title":"Day 4 - Binomial and geometric distributions","text":"Binomial distribution Problem 1 The ratio of boys to girls for babies born in Russia is \\(r=\\frac{N_b}{N_g}=1.09\\) . If there is 1 child born per birth, what proportion of Russian families with exactly 6 children will have at least 3 boys? Mathematical explanation Let's first compute the probability of having a boy : $$p_b=\\frac{N_b}{N_b+N_g}$$ where: \\(N_b\\) is the number of boys \\(N_g\\) is the number of girls * \\(r=\\frac{N_b}{N_g}\\) $$p_b=\\frac{1}{1+\\frac{1}{r}}$$ $$p_b=\\frac{r}{r+1}$$ r = 1.09 p_b = r / ( r + 1 ) print ( f \"The probability of having a boy is p={p_b:3f}\" ) The probability of having a boy is p=0.521531 The probability of getting 3 boys in 6 children is given by : $$b(x=3, n=6, p=p_b)$$ In order to compute the proportion of Russian families with exactly 6 children will have at 3 least boys we need to compute the cumulative probability distribution $$b(x\\geq 3, n=6, p=p_b) = \\sum_{i=3}&#94;{6} b(x\\geq i, n=6, p=p_b)$$ Let's code it ! import math def bi_dist ( x , n , p ): b = ( math . factorial ( n ) / ( math . factorial ( x ) * math . factorial ( n - x ))) * ( p ** x ) * (( 1 - p ) ** ( n - x )) return ( b ) b , p , n = 0 , p_b , 6 for i in range ( 3 , 7 ): b += bi_dist ( i , n , p ) print ( f \"probability of getting at least 3 boys in a family with exactly 6 children : {b:.3f}\" ) probability of getting at least 3 boys in a family with exactly 6 children : 0.696 Problem 2 A manufacturer of metal pistons finds that, 12% on average, of the pistons they manufacture are rejected because they are incorrectly sized. What is the probability that a batch of 10 pistons will contain: No more than 2 rejects? At least 2 rejects? Mathematical explanation On average 12% of the pistons are rejected, this means that a piston has a probability of \\(p_{rejected}=0.12\\) to be rejected. The probability of getting less than 2 faulty pistons in a batch is : $$p(rejet<2) = b(x\\leq 2, n= 10, p=p_{rejected})$$ $$p(rejet<2) = \\sum_{i=0}&#94;{2} b(x\\leq i, n=10, p=p_{rejected})$$ b , p , n = 0 , 12 / 100 , 10 for i in range ( 0 , 3 ): b += bi_dist ( i , n , p ) print ( f \"The probability of getting less than 2 faulty pistons in a batch is : {b:.3f}\" ) The probability of getting less than 2 faulty pistons in a batch is : 0.891 The probability that a batch of 10 pistons will contain at least 2 rejects : $$p(rejet<2) = b(x\\geq 2, n= 10, p=p_{rejected})$$ $$p(rejet<2) = \\sum_{i=2}&#94;{10} b(x\\geq i, n=10, p=p_{rejected})$$ b , p , n = 0 , 12 / 100 , 10 for i in range ( 2 , 11 ): b += bi_dist ( i , n , p ) print ( f \"The probability of getting at least 2 faulty pistons in a batch is : {b:.3f}\" ) The probability of getting at least 2 faulty pistons in a batch is : 0.342 Geometric distribution Problem 1 The probability that a machine produces a defective product is \\(\\frac{1}{3}\\) . What is the probability that the first defect is found during the fith inspection? Mathematical explanation In this case, we will use a geometric distribution to evaluate the probability : \\(n=5\\) \\(p=\\frac{1}{3}\\) Hence, the probability that the first defect is found during the fith inspection is \\(g(n=5,p=1/3)\\) print ( f \"The probability that the first defect is found during the fith inspection is {round(((1-p)**(n-1)) * p, 3)}\" ) The probability that the first defect is found during the fith inspection is 0.038 Problem 2 The probability that a machine produces a defective product is \\(\\frac{1}{3}\\) . What is the probability that the first defect is found during the first 5 inspections? Mathematical explanation In this problem, we need to compute the cumulative distribution function $$p(x \\leq5) = \\sum_{i=1}&#94;{5} g(n=i,p=1/3)$$ p_x5 = 0 p = 1 / 3 n = 5 for i in range ( 1 , n + 1 ): p_x5 += ( 1 - p ) ** ( i - 1 ) * p print ( f \"The probability that the first defect is found during the first 5 inspection is {round(p_x5, 3)}\" ) The probability that the first defect is found during the first 5 inspection is 0.868 if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day4.html","loc":"redoules.github.io/blog/Statistics_10days-day4.html"},{"title":"Premutations and combinations","text":"Finding patterns in the possible ways events can occur is very useful in helping us count the number of desirable events in our sample space. Two of the easiest methods for doing this are with permutations (when order matters) and combinations (when order doesn't matter). Permutations An ordered arrangement r of objects from a set, A, of n objects (where \\(0\\lt r \\leq n\\) ) is called an r-element permutation of A. You can also think of this as a permutation of A's elements taken r at a time. The number of r-element permutations of an n-object set is denoted by the following formula: $$_{n}P_{r}=\\frac{n!}{(n-r)!}$$ Combinations An unordered arrangement of r objects from a set, A, of n objects (where \\(0\\lt r \\leq n\\) ) is called an r-element combination of A. You can also think of this as a combination of A's elements taken r at a time. Because the only difference between permutations and combinations is that combinations are unordered, we can easily find the number of r-element combinations by dividing out the permutations (r!): $$_{n}C_{r}=\\frac{_{n}P_{r}}{r!}=\\frac{n!}{r!(n-r)!}$$ When we talk about combinations, we're talking about the number of subsets of size r that can be made from a set of size n. In fact, \\(_{n}C_{r}\\) is often referred to as \"n choose r\", because it's counting the number of r-element combinations that can be chosen from a set of n elements. if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Mathematics","url":"redoules.github.io/mathematics/Premutations_and_combinations.html","loc":"redoules.github.io/mathematics/Premutations_and_combinations.html"},{"title":"Conditional Probability","text":"This is defined as the probability of an event occurring, assuming that one or more other events have already occurred. Two events, A and B are considered to be independent if event A has no effect on the probability of event B (i.e. P(B|A)=P(A)). If events A and B are not independent, then we must consider the probability that both events occur. This can be referred to as the intersection of events A and B, defined as P(A∩B) = P(B|A)P(A). We can then use this definition to find the conditional probability by dividing the probability of the intersection of the two events (A∩B) by the probability of the event that is assumed to have already occurred (event A): $$ P(B|A)=\\frac{P(A\\cap B)}{P(A)}$$ if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Mathematics","url":"redoules.github.io/mathematics/cond_prob.html","loc":"redoules.github.io/mathematics/cond_prob.html"},{"title":"Bayes' Theorem","text":"Let A and B be two events such that P(A|B) denotes the probability of the occurrence of A given that B has occurred and denotes the probability of the occurrence B of given that A has occurred, then: $$ P(A|B)=\\frac{P(B|A)P(A)}{P(B)}$$ $$ P(A|B)=\\frac{P(B|A)P(A)}{P(B|A)P(A)+P(B|A&#94;c)P(A&#94;c)}$$ if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Mathematics","url":"redoules.github.io/mathematics/Bayes.html","loc":"redoules.github.io/mathematics/Bayes.html"},{"title":"Day 3 - Conditionnal probability","text":"Conditionnal probability Problem Suppose a family has 2 children, one of which is a boy. What is the probability that both children are boys? Mathematical explanation Let's look at the possible outcomes : .dataframe tbody tr th:only-of-type { vertical-align: middle; }</p> <div class=\"highlight\"><pre><span></span><span class=\"na\">.dataframe</span> <span class=\"no\">tbody</span> <span class=\"no\">tr</span> <span class=\"no\">th</span> <span class=\"err\">{</span> <span class=\"nl\">vertical-align:</span> <span class=\"nf\">top</span><span class=\"c\">;</span> <span class=\"err\">}</span> <span class=\"na\">.dataframe</span> <span class=\"no\">thead</span> <span class=\"no\">th</span> <span class=\"err\">{</span> <span class=\"nl\">text-align:</span> <span class=\"nf\">right</span><span class=\"c\">;</span> <span class=\"err\">}</span> </pre></div> <p> B G B BB BG G GB GG We know that at least one of the children is a boy, so only \"GG\" is not possible. The event where the family has a new boy is then \"BB\". Hence the probability is : $$\\frac{BB}{BB+GB+BG}=\\frac{1}{3}$$ Draw 2 cards from a deck Problem You 2 draw cards from a standard 52-card deck without replacing them. What is the probability that both cards are of the same suit? Mathematical explanation There are 13 cards of each suit. Draw one card. It can be anything with probability of 1. Now there are 51 cards left and 12 of them are the same suit as the first card you drew. So the chance the second card matches the 1st is \\(\\frac{12}{51}\\) . Drawing marbles Problem A bag contains 3 red marbles and 4 blue marbles. Then, 2 marbles are drawn from the bag, at random, without replacement. If the first marble drawn is red, what is the probability that the second marble is blue? Mathematical explanation On the first draw, the probabilities are the following : we call B the event \"a blue ball is drawn\" and R the event \"a red ball is drawn\" \\(P(B)=\\frac{4}{7}\\) \\(P(R)=\\frac{3}{7}\\) On the second draw, if a red ball has been drawn at first, the probabilities are : \\(P(B|R)=\\frac{4}{6}\\) \\(P(R|R)=\\frac{2}{6}\\) Hence, the probability of drawing a blue ball if the first ball drawn was red is \\(\\frac{1}{3}\\) if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day3.html","loc":"redoules.github.io/blog/Statistics_10days-day3.html"},{"title":"Day 2 - Probability, Compound Event Probability","text":"Basic probability with dices Problem In this challenge, we practice calculating probability. In a single toss of 2 fair (evenly-weighted) six-sided dice, find the probability that their sum will be at most 9. Mathematical explanation A nice way to think about sums-of-two-dice problems is to lay out the sums in a 6-by-6 grid in the obvious manner. .dataframe tbody tr th:only-of-type { vertical-align: middle; }</p> <div class=\"highlight\"><pre><span></span><span class=\"na\">.dataframe</span> <span class=\"no\">tbody</span> <span class=\"no\">tr</span> <span class=\"no\">th</span> <span class=\"err\">{</span> <span class=\"nl\">vertical-align:</span> <span class=\"nf\">top</span><span class=\"c\">;</span> <span class=\"err\">}</span> <span class=\"na\">.dataframe</span> <span class=\"no\">thead</span> <span class=\"no\">th</span> <span class=\"err\">{</span> <span class=\"nl\">text-align:</span> <span class=\"nf\">right</span><span class=\"c\">;</span> <span class=\"err\">}</span> </pre></div> <p> 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12 We see that the identic values are on the same diagonal. The number of elements on the diagonal varies from 1 to 6 and then back to 1. let's call A < x the event : the sum all the 2 tosses is at most x. $$P(A\\leq9)=\\sum_{i=2}&#94;{9} P(A = i)$$ $$P(A\\leq9)=1-P(A\\gt9)$$ $$P(A\\leq9)=1-\\sum_{i=10}&#94;{12} P(A = i)$$ The value of \\(P(A = i) = \\frac{i-1}{36}\\) if \\(i \\leq 7\\) and \\(P(A = i) = \\frac{13-i}{36}\\) hence $$P(A\\leq9)=1-\\sum_{i=10}&#94;{12} \\frac{13-i}{36}$$ $$P(A\\leq9)= 1-\\frac{6}{36}$$ $$P(A\\leq9)= \\frac{5}{6}$$ Let's program it sum ([ 1 for d1 in range ( 1 , 7 ) for d2 in range ( 1 , 7 ) if d1 + d2 <= 9 ]) / 36 0.8333333333333334 More dices Problem In a single toss of 2 fair (evenly-weighted) six-sided dice, find the probability that the values rolled by each die will be different and the two dice have a sum of 6. Mathematical explanation Let's consider 2 events : A and B. A compound event is a combination of 2 or more simple events. If A and B are simple events, then A∪B denotes the occurence of either A or B. A∩B denotes the occurence of A and B together. We denote A the event \"the values of each dice is different\". The opposit event is A' \"the values of each dice is the same\". $$P(A) = 1-P(A')$$ $$P(A)=1-\\frac{6}{36}$$ $$P(A)=\\frac{5}{6}$$ We denote B the event \"the two dice have a sum of 6\", this probability has been computed on the first part of the article : $$P(B)=\\frac{5}{36}$$ The probability of having 2 dice different of sum 6 is : $$P(A|B) = 4/5$$ The probability that both A and B occure is equal to P(A∩B). Since \\(P(A|B)=\\frac{P(A∩B)}{P(B)}\\) $$P(A∩B)=P(B)*P(A|B)$$ $$P(A∩B)=5/36*4/5$$ $$P(A∩B)=1/9$$ Let's program it sum ([ 1 for d1 in range ( 1 , 7 ) for d2 in range ( 1 , 7 ) if ( d1 + d2 == 6 ) and ( d1 != d2 )]) / 36 0.1111111111111111 Compound Event Probability Problem There are 3 urns labeled X, Y, and Z. Urn X contains 4 red balls and 3 black balls. Urn Y contains 5 red balls and 4 black balls. Urn Z contains 4 red balls and 4 black balls. One ball is drawn from each of the urns. What is the probability that, of the 3 balls drawn, are 2 red and is 1 black? Mathematical explanation Let's write the different probabilities: .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } Red ball Black ball Urne X $$\\frac{4}{7}$$ $$\\frac{3}{7}$$ Urne Y $$\\frac{5}{9}$$ $$\\frac{4}{9}$$ Urne Z $$\\frac{1}{2}$$ $$\\frac{1}{2}$$ Addition rule A and B are said to be mutually exclusive or disjoint if they have no events in common (i.e., and A∩B=∅ and P(A∩B)=0. The probability of any of 2 or more events occurring is the union (∪) of events. Because disjoint probabilities have no common events, the probability of the union of disjoint events is the sum of the events' individual probabilities. A and B are said to be collectively exhaustive if their union covers all events in the sample space (i.e., A∪B=S and P(A∪B)=1). This brings us to our next fundamental rule of probability: if 2 events, A and B, are disjoint, then the probability of either event is the sum of the probabilities of the 2 events (i.e., P(A or B) = P(A)+P(B)) Mutliplication rule If the outcome of the first event (A) has no impact on the second event (B), then they are considered to be independent (e.g., tossing a fair coin). This brings us to the next fundamental rule of probability: the multiplication rule. It states that if two events, A and B, are independent, then the probability of both events is the product of the probabilities for each event (i.e., P(A and B)= P(A)xP(B)). The chance of all events occurring in a sequence of events is called the intersection (∩) of those events. The balls drawn from the urns are independant hence : p = P(2 red (R) and 1 back (B)) $$p = P(RRB) + P(RBR) + P(BRR)$$ Each of those 3 probability if equal to the product of the probability of drawing each ball \\(P(RRB) = P(R|X) * P(R|Y) * P(B|Z) = 4/7*5/9*1/2\\) \\(P(RRB) = 20/126\\) \\(P(RBR) = 16/126\\) \\(P(BRR) = 15/126\\) this leads to \\(p = 51/126\\) and finally $$p = \\frac{17}{42}$$ Let's program it X = 3 * [ \"B\" ] + 4 * [ \"R\" ] Y = 4 * [ \"B\" ] + 5 * [ \"R\" ] Z = 4 * [ \"B\" ] + 4 * [ \"R\" ] target = [ \"BRR\" , \"RRB\" , \"RBR\" ] sum ([ 1 for x in X for y in Y for z in Z if x + y + z in target ]) / sum ([ 1 for x in X for y in Y for z in Z ]) 0.40476190476190477 if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day2.html","loc":"redoules.github.io/blog/Statistics_10days-day2.html"},{"title":"Day 1 - Quartiles, Interquartile Range and standard deviation","text":"Quartile Definition A quartile is a type of quantile. The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highest value of the data set. Implementation in python without using the scientific libraries def median ( l ): l = sorted ( l ) if len ( l ) % 2 == 0 : return ( l [ len ( l ) // 2 ] + l [( len ( l ) // 2 - 1 )]) / 2 else : return l [ len ( l ) // 2 ] def quartiles ( l ): # check the input is not empty if not l : raise StatsError ( 'no data points passed' ) # 1. order the data set l = sorted ( l ) # 2. divide the data set in two halves mid = int ( len ( l ) / 2 ) Q2 = median ( l ) if ( len ( l ) % 2 == 0 ): # even Q1 = median ( l [: mid ]) Q3 = median ( l [ mid :]) else : # odd Q1 = median ( l [: mid ]) # same as even Q3 = median ( l [ mid + 1 :]) return ( Q1 , Q2 , Q3 ) L = [ 3 , 7 , 8 , 5 , 12 , 14 , 21 , 13 , 18 ] Q1 , Q2 , Q3 = quartiles ( L ) print ( f \"Sample : {L} \\n Q1 : {Q1}, Q2 : {Q2}, Q3 : {Q3}\" ) Sample : [3, 7, 8, 5, 12, 14, 21, 13, 18] Q1 : 6.0, Q2 : 12, Q3 : 16.0 Interquartile Range Definition The interquartile range of an array is the difference between its first (Q1) and third (Q3) quartiles. Hence the interquartile range is Q3-Q1 Implementation in python without using the scientific libraries print ( f \"Interquatile range : {Q3-Q1}\" ) Interquatile range : 10.0 Standard deviation Definition The standard deviation (σ) is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. The standard deviation can be computed with the formula: where µ is the mean : Implementation in python without using the scientific libraries import math X = [ 10 , 40 , 30 , 50 , 20 ] mean = sum ( X ) / len ( X ) X = [( x - mean ) ** 2 for x in X ] std = math . sqrt ( sum ( X ) / len ( X ) ) print ( f \"The distribution {X} has a standard deviation of {std}\" ) The distribution [400.0, 100.0, 0.0, 400.0, 100.0] has a standard deviation of 14.142135623730951","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day1.html","loc":"redoules.github.io/blog/Statistics_10days-day1.html"},{"title":"Counting values in an array","text":"Using lists If you want to count the number of occurences of an element in a list you can use the .count() function of the list object arr = [ 1 , 2 , 3 , 3 , 4 , 5 , 3 , 6 , 7 , 7 ] print ( f 'Array : {arr} \\n ' ) print ( f 'The number 3 appears {arr.count(3)} times in the list' ) print ( f 'The number 7 appears {arr.count(7)} times in the list' ) print ( f 'The number 4 appears {arr.count(4)} times in the list' ) Array : [1, 2, 3, 3, 4, 5, 3, 6, 7, 7] The number 3 appears 3 times in the list The number 7 appears 2 times in the list The number 4 appears 1 times in the list Using collections you can get a dictonnary of the number of occurences of each elements in a list thanks to the collections object like this import collections collections . Counter ( arr ) Counter({1: 1, 2: 1, 3: 3, 4: 1, 5: 1, 6: 1, 7: 2}) Using numpy You can have a simular result with numpy by hacking the unique function import numpy as np arr = np . array ( arr ) unique , counts = np . unique ( arr , return_counts = True ) dict ( zip ( unique , counts )) {1: 1, 2: 1, 3: 3, 4: 1, 5: 1, 6: 1, 7: 2}","tags":"Python","url":"redoules.github.io/python/counting.html","loc":"redoules.github.io/python/counting.html"},{"title":"Building a dictonnary using comprehension","text":"An easy way to create a dictionnary in python is to use the comprehension syntaxe. It can be more expressive hence easier to read. d = { key : value for ( key , value ) in iterable } In the example bellow we use the dictionnary comprehension to build a dictonnary from a source list. iterable = list ( range ( 10 )) d = { str ( value ): value ** 2 for value in iterable } # create a dictionnary linking the string value of a number with the square value of this number print ( d ) {'0': 0, '1': 1, '2': 4, '3': 9, '4': 16, '5': 25, '6': 36, '7': 49, '8': 64, '9': 81} of course, you can use an other iterable an repack it with the comprehension syntaxe. In the following example, we convert a list of tuples in a dictonnary. iterable = [( \"France\" , 67.12e6 ), ( \"UK\" , 66.02e6 ), ( \"USA\" , 325.7e6 ), ( \"China\" , 1386e6 ), ( \"Germany\" , 82.79e6 )] population = { key : value for ( key , value ) in iterable } print ( population ) {'France': 67120000.0, 'UK': 66020000.0, 'USA': 325700000.0, 'China': 1386000000.0, 'Germany': 82790000.0}","tags":"Python","url":"redoules.github.io/python/dict_comprehension.html","loc":"redoules.github.io/python/dict_comprehension.html"},{"title":"Extracting unique values from a list or an array","text":"Using lists An easy way to extract the unique values of a list in python is to convert the list to a set. A set is an unordered collection of items. Every element is unique (no duplicates) and must be immutable. my_list = [ 10 , 20 , 30 , 40 , 20 , 50 , 60 , 40 ] print ( f \"Original List : {my_list}\" ) my_set = set ( my_list ) my_new_list = list ( my_set ) # the set is converted back to a list with the list() function print ( f \"List of unique numbers : {my_new_list}\" ) Original List : [10, 20, 30, 40, 20, 50, 60, 40] List of unique numbers : [40, 10, 50, 20, 60, 30] Using numpy If you are using numpy you can extract the unique values of an array with the unique function builtin numpy: import numpy as np arr = np . array ( my_list ) print ( f 'Initial numpy array : {arr} \\n ' ) unique_arr = np . unique ( arr ) print ( f 'Numpy array with unique values : {unique_arr}' ) Initial numpy array : [10 20 30 40 20 50 60 40] Numpy array with unique values : [10 20 30 40 50 60]","tags":"Python","url":"redoules.github.io/python/unique.html","loc":"redoules.github.io/python/unique.html"},{"title":"Sorting an array","text":"Using lists Python provides an iterator to sort an array sorted() you can use it this way : import random # Random lists from [0-999] interval arr = [ random . randint ( 0 , 1000 ) for r in range ( 10 )] print ( f 'Initial random list : {arr} \\n ' ) reversed_arr = list ( sorted ( arr )) print ( f 'Sorted list : {reversed_arr}' ) Initial random list : [277, 347, 976, 367, 604, 878, 148, 670, 229, 432] Sorted list : [148, 229, 277, 347, 367, 432, 604, 670, 878, 976] it is also possible to use the sort function from the list object # Random lists from [0-999] interval arr = [ random . randint ( 0 , 1000 ) for r in range ( 10 )] print ( f 'Initial random list : {arr} \\n ' ) arr . sort () print ( f 'Sorted list : {arr}' ) Initial random list : [727, 759, 68, 103, 23, 90, 258, 737, 791, 567] Sorted list : [23, 68, 90, 103, 258, 567, 727, 737, 759, 791] Using numpy If you are using numpy you can sort an array by creating a view on the array: import numpy as np arr = np . random . random ( 5 ) print ( f 'Initial random array : {arr} \\n ' ) sorted_arr = np . sort ( arr ) print ( f 'Sorted array : {sorted_arr}' ) Initial random array : [0.40021786 0.13876208 0.19939047 0.46015169 0.43734158] Sorted array : [0.13876208 0.19939047 0.40021786 0.43734158 0.46015169]","tags":"Python","url":"redoules.github.io/python/sorting.html","loc":"redoules.github.io/python/sorting.html"},{"title":"Day 0 - Median, mean, mode and weighted mean","text":"A reminder The median The median is the value separating the higher half from the lower half of a data sample. For a data set, it may be thought of as the middle value. For a continuous probability distribution, the median is the value such that a number is equally likely to fall above or below it. The mean The arithmetic mean (or simply mean) of a sample is the sum of the sampled values divided by the number of items. The mode The mode of a set of data values is the value that appears most often. It is the value x at which its probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled. Implementation in python without using the scientific libraries def median ( l ): l = sorted ( l ) if len ( l ) % 2 == 0 : return ( l [ len ( l ) // 2 ] + l [( len ( l ) // 2 - 1 )]) / 2 else : return l [ len ( l ) // 2 ] def mean ( l ): return sum ( l ) / len ( l ) def mode ( data ): dico = { x : data . count ( x ) for x in list ( set ( data ))} return sorted ( sorted ( dico . items ()), key = lambda x : x [ 1 ], reverse = True )[ 0 ][ 0 ] L = [ 64630 , 11735 , 14216 , 99233 , 14470 , 4978 , 73429 , 38120 , 51135 , 67060 , 4978 , 73429 ] print ( f \"Sample : {L} \\n Mean : {mean(L)}, Median : {median(L)}, Mode : {mode(L)}\" ) Sample : [64630, 11735, 14216, 99233, 14470, 4978, 73429, 38120, 51135, 67060, 4978, 73429] Mean : 43117.75, Median : 44627.5, Mode : 4978 The weighted average The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. data = [ 10 , 40 , 30 , 50 , 20 ] weights = [ 1 , 2 , 3 , 4 , 5 ] sum_X = sum ([ x * w for x , w in zip ( data , weights )]) print ( round (( sum_X / sum ( weights )), 1 )) 32.0","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day0.html","loc":"redoules.github.io/blog/Statistics_10days-day0.html"},{"title":"Create a simple bash function","text":"A basic function The synthaxe to define a function is : #!/bin/bash # Basic function my_function () { echo Text displayed by my_function } #once defined, you can use it like so : my_function and it should return user@bash : ./my_function.sh Text displayed by my_function Function with arguments When used, the arguments are specified directly after the function name. Whithin the function they are accessible this the $ symbol followed by the number of the arguement. Hence $1 will take the value of the first arguement, $2 will take the value of the second arguement and so on. #!/bin/bash # Passing arguments to a function say_hello () { echo Hello $1 } say_hello Guillaume and it should return user@bash : ./function_arguements.sh Hello Guillaume Overriding Commands Using the previous example, let's override the echo function in order to make it say hello. To do so, you just need to name the function with the same name as the command you want to replace. When you are calling the original function, make sure you are using the builtin keyword #!/bin/bash # Overriding a function echo () { builtin echo Hello $1 } echo Guillaume user@bash : ./function_arguements.sh Hello Guillaume Returning values Use the keyword return to send back a value to the main program. The returned value will be stored in the $? variable #!/bin/bash # Retruning a value secret_number () { return 126 } secret_number echo The secret number is $? This code should return user@bash : ./retrun_value.sh The secret number is 126","tags":"Linux","url":"redoules.github.io/linux/simple_bash_function.html","loc":"redoules.github.io/linux/simple_bash_function.html"},{"title":"Number of edges in a Complete graph","text":"A complete graph contains \\(\\frac{n(n-1)}{2}\\) edges where \\(n\\) is the number of vertices (or nodes). if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = \"center\", indent = \"0em\", linebreak = \"false\"; if (false) { align = (screen.width < 768) ? \"left\" : align; indent = (screen.width < 768) ? \"0em\" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? \"innerHTML\" : \"text\")] = \"MathJax.Hub.Config({\" + \" config: ['MMLorHTML.js'],\" + \" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } },\" + \" jax: ['input/TeX','input/MathML','output/HTML-CSS'],\" + \" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js'],\" + \" displayAlign: '\"+ align +\"',\" + \" displayIndent: '\"+ indent +\"',\" + \" showMathMenu: true,\" + \" messageStyle: 'normal',\" + \" tex2jax: { \" + \" inlineMath: [ ['\\\\\\\\(','\\\\\\\\)'] ], \" + \" displayMath: [ ['$$','$$'] ],\" + \" processEscapes: true,\" + \" preview: 'TeX',\" + \" }, \" + \" 'HTML-CSS': { \" + \" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} },\" + \" linebreaks: { automatic: \"+ linebreak +\", width: '90% container' },\" + \" }, \" + \"}); \" + \"if ('default' !== 'default') {\" + \"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {\" + \"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;\" + \"VARIANT['normal'].fonts.unshift('MathJax_default');\" + \"VARIANT['bold'].fonts.unshift('MathJax_default-bold');\" + \"VARIANT['italic'].fonts.unshift('MathJax_default-italic');\" + \"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');\" + \"});\" + \"}\"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }","tags":"Mathematics","url":"redoules.github.io/mathematics/Number_edges_Complete_graph.html","loc":"redoules.github.io/mathematics/Number_edges_Complete_graph.html"},{"title":"Reverse an array","text":"Using lists Python provides an iterator to reverse an array reversed() you can use it this way : arr = list ( range ( 5 )) print ( f 'Initial array : {arr} \\n ' ) reversed_arr = list ( reversed ( arr )) print ( f 'Reversed array : {reversed_arr}' ) Initial array : [0, 1, 2, 3, 4] Reversed array : [4, 3, 2, 1, 0] Using numpy If you are using numpy you can reverse an array by creating a view on the array: import numpy as np arr = np . arange ( 5 ) print ( f 'Initial array : {arr} \\n ' ) reversed_arr = arr [:: - 1 ] print ( f 'Reversed array : {reversed_arr}' ) Initial array : [0 1 2 3 4] Reversed array : [4 3 2 1 0]","tags":"Python","url":"redoules.github.io/python/reverse.html","loc":"redoules.github.io/python/reverse.html"},{"title":"Advice for designing your own libraries","text":"Advice for designing your own libraries When designing your own library make sure to think of the following things. I will add new paragraphs to this article as I dicover new good practices. Use standard python objects Try to use standard python objects as much as possible. That way, your library becomes compatible with all the other python libaries. For instance, when I created SAMpy : a library for reading and writing SAMCEF results, it returned dictonnaries, lists and pandas dataframes. Hence the results extracted from SAMCEF where compatible with all the scientific stack of python. Limit the number of functionnalities Following the same logic as before, the objects should do only one thing but do it well. Indeed, having a simple interface will reduce the complexity of your code and make it easier to use your library. Again, with SAMpy, I decided to strictly limit the functionnalities to reading and writing SAMCEF files. Define an exception class for your library You should define your own exceptions in order to make it easier for your users to debug their code thanks to clearer messages that convey more meaning. That way, the user will know if the error comes from your library or something else. Bonus if you group similar exceptions in a hierachy of inerited Exception classes. Example : let's create a Exception related to the age of a person : def check_age ( age ): if age < 0 and age > 130 : raise ValueError If the user inputed an invalid age, the ValueError exception would be thrown. That's fine but imagine you wan't to provide more feedback to your users that don't know the internal of your library. Let's now create a selfexplanatory Exception class AgeInvalidError ( ValueError ): pass def check_age ( age ): if age < 0 and age > 130 : raise AgeInvalidError ( age ) You can also add some helpful text to guide your users along the way: class AgeInvalidError ( ValueError ): print ( \"Age invalid, must be between 0 and 130\" ) pass def check_age ( age ): if age < 0 and age > 130 : raise AgeInvalidError ( age ) If you want to group all the logically linked exceptions, you can create a base class and inherit from it : class BaseAgeInvalidError ( ValueError ): pass class TooYoungError ( BaseAgeInvalidError ): pass class TooOldError ( BaseAgeInvalidError ): pass def check_age ( age ): if age < 0 : raise TooYoungError ( age ) elif age > 130 : raise TooOldError ( age ) Structure your repository You should have a file structure in your repository. It will help other contributers especially future contributers. A nice directory structure for your project should look like this: README.md LICENSE setup.py requirements.txt ./MyPackage ./docs ./tests Some prefer to use reStructured Text, I personnaly prefer Markdown choosealicense.com will help you pick the license to use for your project. For package and distribution management, create a setup.py file a the root of the directory The list of dependencies required to test, build and generate the doc are listed in a pip requirement file placed a the root of the directory and named requirements.txt Put the documentation of your library in the docs directory. Put your tests in the tests directory. Since your tests will need to import your library, I recommend modifying the path to resolve your package property. In order to do so, you can create a context.py file located in the tests directory : import os import sys sys . path . insert ( 0 , os . path . abspath ( os . path . join ( os . path . dirname ( __file__ ), '..' ))) import MyPackage Then within your individual test files you can import your package like so : from .context import MyPackage Finally, your code will go into the MyPackage directory Test your code Once your library is in production, you have to guaranty some level of forward compatibility. Once your interface is defined, write some tests. In the future, when your code is modified, having those tests will make sure that the behaviour of your functions and objects won't be altered. Document your code Of course, you should have a documentation to go along with your library. Make sure to add a lot of commun examples as most users tend to learn from examples. I recommend writing your documentation using Sphinx.","tags":"Python","url":"redoules.github.io/python/design_own_libs.html","loc":"redoules.github.io/python/design_own_libs.html"},{"title":"Safely creating a folder if it doesn't exist","text":"Safely creating a folder if it doesn't exist When you are writing to files in python, if the file doesn't exist it will be created. However, if you are trying to write a file in a directory that doesn't exist, an exception will be returned FileNotFoundError : [ Errno 2 ] No such file or directory : \"directory\" This article will teach you how to make sure the target directory exists. If it doesn't, the function will create that directory. First, let's import os and make sure that the \"test_directory\" doesn't exist import os os . path . exists ( \". \\\\ test_directory\" ) False copy the ensure_dir function in your code. This function will handle the creation of the directory. Credit goes to Parand posted on StackOverflow def ensure_dir ( file_path ): directory = os . path . dirname ( file_path ) if not os . path . exists ( directory ): os . makedirs ( directory ) Let's now use the function and create a folder named \"test_directory\" ensure_dir ( \". \\\\ test_directory\" ) If we test for the existence of the directory, the exists function will now return True os . path . exists ( \". \\\\ test_directory\" ) True","tags":"Python","url":"redoules.github.io/python/ensure_dir.html","loc":"redoules.github.io/python/ensure_dir.html"},{"title":"List all files in a directory","text":"Listing all the files in a directory Let's start with the basics, the most staigthforward way to list all the files in a direcoty is to use a combinaison of the listdir function and isfile form os.path. You can use a list comprehension to store all the results in a list. mypath = \"./test_directory/\" from os import listdir from os.path import isfile , join [ f for f in listdir ( mypath ) if isfile ( join ( mypath , f ))] ['logfile.log', 'myfile.txt', 'super_music.mp3', 'textfile.txt'] Listing all the files of a certain type in a directory similarly, if you want to filter only a certain kind of file based on its extension you can use the endswith method. In the following example, we will filter all the \"txt\" files contained in the directory [ f for f in listdir ( mypath ) if f . endswith ( '.' + \"txt\" )] ['myfile.txt', 'textfile.txt'] Listing all the files matching a pattern in a directory The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. You can use the *, ?, and character ranges expressed with [] wildcards import glob glob . glob ( \"*.txt\" ) ['myfile.txt'] Listing files recusively If you want to list all files recursively you can select all the sub-directories using the \"**\" wildcard import glob glob . glob ( mypath + '/**/*.txt' , recursive = True ) ['./test_directory\\\\myfile.txt', './test_directory\\\\textfile.txt', './test_directory\\\\subdir1\\\\file_hidden_in_a_sub_direcotry.txt'] Using a regular expression If you'd rather use a regular expression to select the files, the pathlib library provides the rglob function. from pathlib import Path list ( Path ( \"./test_directory/\" ) . rglob ( \"*.[tT][xX][tT]\" )) [WindowsPath('test_directory/myfile.txt'), WindowsPath('test_directory/textfile.txt'), WindowsPath('test_directory/subdir1/file_hidden_in_a_sub_direcotry.txt')] Using regular expressions you can for example select multiple types of files. In the following example, we list all the files that finish either with \"txt\" or with \"log\". list ( Path ( \"./test_directory/\" ) . rglob ( \"*.[tl][xo][tg]\" )) [WindowsPath('test_directory/logfile.log'), WindowsPath('test_directory/myfile.txt'), WindowsPath('test_directory/textfile.txt'), WindowsPath('test_directory/subdir1/file_hidden_in_a_sub_direcotry.txt')]","tags":"Python","url":"redoules.github.io/python/list_files_directory.html","loc":"redoules.github.io/python/list_files_directory.html"},{"title":"Using Dask on infiniband","text":"InfiniBand (abbreviated IB) is a computer-networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used as either a direct or switched interconnect between servers and storage systems, as well as an interconnect between storage systems. (source Wikipedia). If you want to leverage this high speed network instead of the regular ethernet network, you have to specify to the scheduler that you want to used infiniband as your interface. Assuming that you Infiniband interface is ib0 , you would call the scheduler like this : dask-scheduler --interface ib0 --scheduler-file ./cluster.yaml you would have to call the worker using the same interface : dask-worker --interface ib0 --scheduler-file ./cluster.yaml","tags":"Python","url":"redoules.github.io/python/dask_infiniband.html","loc":"redoules.github.io/python/dask_infiniband.html"},{"title":"Clearing the current cell in the notebook","text":"In python, you can clear the output of a cell by importing the IPython.display module and using the clear_output function from IPython.display import clear_output print ( \"text to be cleared\" ) clear_output () As you can see, the text \"text to be cleared\" is not displayed because the function clear_output has been called afterward","tags":"Jupyter","url":"redoules.github.io/jupyter/clear_cell.html","loc":"redoules.github.io/jupyter/clear_cell.html"},{"title":"What's inside my .bashrc ?","text":"############ # Anaconda # ############ export PATH = \"/station/guillaume/anaconda3/bin: $PATH \" alias python = '/station/guillaume/anaconda3/bin/python' ######### # Alias # ######### alias ding = 'echo -e \"\\a\"' alias calc = 'python -ic \"from __future__ import division; from math import *\"' alias h = \"history|grep \" alias f = \"find . |grep \" alias p = \"ps aux |grep \" alias cdl = \"cd /data/guillaume\" alias cp = \"rsync -avz --progress\" alias grep = \"grep --color=auto\" alias ls = \"ls -hN --color=auto --group-directories-first\" alias ll = \"ls -hal\" alias sv = \"ssh compute_cluster\" alias ms = \"ls\" alias jl = \"jupyter lab\" alias lst = \"jupyter notebook list\" ########################## # bashrc personnalisation# ########################## force_color_prompt = yes export EDITOR = nano export BROWSER = \"firefox '%s' &\" if [ -n \" $force_color_prompt \" ] ; then if [ -x /usr/bin/tput ] && tput setaf 1 > & /dev/null ; then # We have color support; assume it's compliant with Ecma-48 # (ISO/IEC-6429). (Lack of such support is extremely rare, and such # a case would tend to support setf rather than setaf.) color_prompt = yes else color_prompt = fi fi if [ \" $color_prompt \" = yes ] ; then #\\h : hostname #\\u : user #\\w : current working directory #\\d : date #\\t : time yellow = 226 green = 83 pink = 198 blue = 34 PS1 = \"\\[\\033[38;5;22m\\]\\u\\[ $( tput sgr0 ) \\]\\[\\033[38;5;163m\\]@\\[ $( tput sgr0 ) \\]\\[\\033[38;5;22m\\]\\h\\[ $( tput sgr0 ) \\]\\[\\033[38;5;162m\\]:\\[ $( tput sgr0 ) \\]\\[\\033[38;5;172m\\]{\\[ $( tput sgr0 ) \\]\\[\\033[38;5;39m\\]\\w\\[ $( tput sgr0 ) \\]\\[\\033[38;5;172m\\]}\\[ $( tput sgr0 ) \\]\\[\\033[38;5;162m\\]>\\[ $( tput sgr0 ) \\]\"","tags":"Linux","url":"redoules.github.io/linux/bashrc.html","loc":"redoules.github.io/linux/bashrc.html"},{"title":"Efficient extraction of eigenvalues from a list of tensors","text":"When you manipulate FEM results you generally have either a: scalar field, vector field, * tensor field. With tensorial results, it is often useful to extract the eigenvalues in order to find the principal values. I have found that it is easier to store the components of the tensors in a 6 column pandas dataframe (because of the symmetric property of stress and strain tensors) import pandas as pd node = [ 1001 , 1002 , 1003 , 1004 ] #when dealing with FEM results you should remember at which element/node the result is computed (in the example, let's assume that we look at node from 1001 to 1004) tensor1 = [ 1 , 1 , 1 , 0 , 0 , 0 ] #eigen : 1 tensor2 = [ 4 , - 1 , 0 , 2 , 2 , 1 ] #eigen : 5.58443, -1.77931, -0.805118 tensor3 = [ 1 , 6 , 5 , 3 , 3 , 1 ] #eigen : 8.85036, 4.46542, -1.31577 tensor4 = [ 1 , 2 , 3 , 0 , 0 , 0 ] #eigen : 1, 2, 3 df = pd . DataFrame ([ tensor1 , tensor2 , tensor3 , tensor4 ], columns = [ \"XX\" , \"YY\" , \"ZZ\" , \"XY\" , \"XZ\" , \"YZ\" ]) df . index = node df .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } XX YY ZZ XY XZ YZ 1001 1 1 1 0 0 0 1002 4 -1 0 2 2 1 1003 1 6 5 3 3 1 1004 1 2 3 0 0 0 If you want to extract the eigenvalues of a tensor with numpy you have to pass a n by n ndarray to the eigenvalue function. In order to avoid having to loop over each node, this oneliner is highly optimized and will help you invert a large number of tensors efficiently. The steps are basically, create a list of n by n values (here n=3) in the right order => reshape it to a list of tensors => pass it to the eigenvals function import numpy as np from numpy import linalg as LA eigenvals = LA . eigvals ( df [[ \"XX\" , \"XY\" , \"XZ\" , \"XY\" , \"YY\" , \"YZ\" , \"XZ\" , \"YZ\" , \"ZZ\" ]] . values . reshape ( len ( df ), 3 , 3 )) eigenvals array([[ 1. , 1. , 1. ], [ 5.58442834, -0.80511809, -1.77931025], [-1.31577211, 8.85035616, 4.46541595], [ 1. , 2. , 3. ]])","tags":"Python","url":"redoules.github.io/python/Efficient_extraction_of_eigenvalues_from_a_list_of_tensors.html","loc":"redoules.github.io/python/Efficient_extraction_of_eigenvalues_from_a_list_of_tensors.html"},{"title":"Optimized numpy random number generation on Intel CPU","text":"Python Intel distribution Make sure you have a python intel distribution. When you startup python you should see somethine like : Python 3.6.2 |Intel Corporation| (default, Aug 15 2017, 11:34:02) [MSC v.1900 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help. If not, you can force the installation of the intel optimized python with : conda update --all conda config --add channels intel conda install numpy --channel intel --override-channels oh and by the way, make sure you a running an Intel CPU ;) Comparing numpy.random with numpy.random_intel Let's now test both the rand function with and without the Intel optimization import numpy as np from numpy import random , random_intel % timeit np . random . rand ( 10 ** 5 ) 1.06 ms ± 91.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) % timeit np . random_intel . rand ( 10 ** 5 ) 225 µs ± 3.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)","tags":"Python","url":"redoules.github.io/python/Optimized_numpy_random_intel.html","loc":"redoules.github.io/python/Optimized_numpy_random_intel.html"},{"title":"How to check Linux process information?","text":"How to check Linux process information (CPU usage, memory, user information, etc.)? You need to use the ps command combined with the grep command. In the example, we want to check the information on the nginx process : ps aux | grep nginx It would return the output : root 9976 0.0 0.0 12272 108 ? S<s Aug12 0:00 nginx: master process /usr/bin/nginx -g pid /run/nginx.pid; daemon on; master_process on; http 16780 0.0 0.0 12384 684 ? S< Aug12 4:11 nginx: worker process http 16781 0.0 0.0 12556 708 ? S< Aug12 0:24 nginx: worker process http 16782 0.0 0.1 12292 744 ? S< Aug12 2:43 nginx: worker process http 16783 0.0 0.1 12276 872 ? S< Aug12 0:24 nginx: worker process admin 17612 0.0 0.1 5120 864 pts/4 S+ 11:22 0:00 grep --color=auto nginx The columns have the following order : USER;PID;%CPU;%MEM;VSZ;RSS;TTY;STAT;START;TIME;COMMAND USER = user owning the process PID = process ID of the process %CPU = It is the CPU time used divided by the time the process has been running. %MEM = ratio of the process's resident set size to the physical memory on the machine VSZ = virtual memory usage of entire process (in KiB) RSS = resident set size, the non-swapped physical memory that a task has used (in KiB) TTY = controlling tty (terminal) STAT = multi-character process state START = starting time or date of the process TIME = cumulative CPU time COMMAND = command with all its arguments Interactive display If you want an interactive display showing in real time the statistics of the running process you can use the top command. If htop is available on you system, use this instead.","tags":"Linux","url":"redoules.github.io/linux/linux_process_information.html","loc":"redoules.github.io/linux/linux_process_information.html"},{"title":"Check the size of a directory","text":"How do you Check the size of a directory in linux? The du command will come handy for this task. Let's say we want to know the size of the directory named recommandations , we would run the following command du -sh recommendations It would return the output : 9.9M recommendations","tags":"Linux","url":"redoules.github.io/linux/directory_size.html","loc":"redoules.github.io/linux/directory_size.html"},{"title":"Check for free disk space","text":"How do you check for free disk space on linux? The df command will come handy for this task. Run the command with the following arguments df -ah Will return in a human readable format for all drives a readout of all your filesystems Filesystem Size Used Avail Use% Mounted on /dev/md0 2.4G 1.3G 1.1G 54% / none 348M 4.0K 348M 1% /dev none 0 0 0 - /dev/pts none 0 0 0 - /proc none 0 0 0 - /sys /tmp 350M 1.3M 349M 1% /tmp /run 350M 3.2M 347M 1% /run /dev/shm 350M 12K 350M 1% /dev/shm /proc/bus/usb 0 0 0 - /proc/bus/usb securityfs 0 0 0 - /sys/kernel/security /dev/md3 1.8T 372G 1.5T 21% /volume2 /dev/vg1000/lv 1.8T 1.5T 340G 82% /volume1 /dev/sdq1 7.4G 3.8G 3.5G 52% /volumeUSB3/usbshare3-1 /dev/sdr 294G 146G 134G 53% /volumeUSB2/usbshare none 0 0 0 - /proc/fs/nfsd none 0 0 0 - /config The free disk space can be read in the Avail column","tags":"Linux","url":"redoules.github.io/linux/free_disk_space_linux.html","loc":"redoules.github.io/linux/free_disk_space_linux.html"},{"title":"Check your current ip address","text":"Check your current ip address Run the command ip addr show and that will give you every information available 4: eth0: <> mtu 1500 group default qlen 1 link/ether 5c:51:4f:41:7a:b1 inet 169.254.33.33/16 brd 169.254.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::390a:f69e:1ba2:2121/64 scope global dynamic valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 group default qlen 1 link/ether 0a:00:27:00:00:03 inet 192.168.56.1/24 brd 192.168.56.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::84cd:374e:843f:f82f/64 scope global dynamic valid_lft forever preferred_lft forever 15: eth2: <> mtu 1500 group default qlen 1 link/ether 00:ff:d2:8a:19:c3 inet 169.254.40.62/16 brd 169.254.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::f199:2ca0:7aff:283e/64 scope global dynamic valid_lft forever preferred_lft forever 1: lo: <LOOPBACK,UP> mtu 1500 group default qlen 1 link/loopback 00:00:00:00:00:00 inet 127.0.0.1/8 brd 127.255.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 ::1/128 scope global dynamic valid_lft forever preferred_lft forever 5: wifi0: <BROADCAST,MULTICAST,UP> mtu 1500 group default qlen 1 link/ieee802.11 5c:51:4f:41:7a:ad inet 192.168.1.1/24 brd 192.168.1.255 scope global dynamic valid_lft 42720sec preferred_lft 42720sec inet6 fe80::395f:3594:1dc2:57e3/64 scope global dynamic valid_lft forever preferred_lft forever 21: wifi1: <> mtu 1500 group default qlen 1 link/ieee802.11 5c:51:4f:41:7a:ae inet 169.254.12.77/16 brd 169.254.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::58d5:630:cbbd:c4d/64 scope global dynamic valid_lft forever preferred_lft forever 12: eth3: <> mtu 1472 group default qlen 1 link/ether 00:00:00:00:00:00:00:e0:00:00:00:00:00:00:00:00 inet6 fe80::100:7f:fffe/64 scope global dynamic valid_lft forever preferred_lft forever 10: eth4: <BROADCAST,MULTICAST,UP> mtu 1500 group default qlen 1 link/ether 22:b7:57:52:5f:ff inet 192.168.42.106/24 brd 192.168.42.255 scope global dynamic valid_lft 6659sec preferred_lft 6659sec inet6 fe80::5110:eb6f:deb0:45c4/64 scope global dynamic valid_lft forever preferred_lft forever You can select only one interface ip addr show eth0 and only the relevant information will be displayed 4: eth0: <> mtu 1500 group default qlen 1 link/ether 5c:51:4f:41:7a:b1 inet 169.254.33.33/16 brd 169.254.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::390a:f69e:1ba2:2121/64 scope global dynamic valid_lft forever preferred_lft forever Check your current ip address the old way The ifconfig command will return information regarding your network interfaces. Let's try it: ifconfig eth1 Link encap:Ethernet HWaddr 0a:00:27:00:00:03 inet adr:192.168.56.1 Bcast:192.168.56.255 Masque:255.255.255.0 adr inet6: fe80::84cd:374e:843f:f82f/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Packets reçus:0 erreurs:0 :0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B) eth4 Link encap:Ethernet HWaddr 22:b7:57:52:5f:ff inet adr:192.168.42.106 Bcast:192.168.42.255 Masque:255.255.255.0 adr inet6: fe80::5110:eb6f:deb0:45c4/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Packets reçus:0 erreurs:0 :0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B) lo Link encap:Boucle locale inet adr:127.0.0.1 Masque:255.0.0.0 adr inet6: ::1/128 Scope:Global UP LOOPBACK RUNNING MTU:1500 Metric:1 Packets reçus:0 erreurs:0 :0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B) wifi0 Link encap:UNSPEC HWaddr 5C-51-4F-41-7A-AD-00-00-00-00-00-00-00-00-00-00 inet adr:192.168.1.1 Bcast:192.168.1.255 Masque:255.255.255.0 adr inet6: fe80::395f:3594:1dc2:57e3/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Packets reçus:0 erreurs:0 :0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B) On the left column you have the list of network adapters. The lo is the local loopback, it is an interface that points to the localhost. Interfaces starting with eth refer to wired connections over ethernet (or sometimes USB in the case of a phone acting as an access point over USB). Interfaces starting with wlan or wifi refer to wireless connections. On the right column you have some information corresponding to the interface such as the IPv4, the IPv6, the mask, some statistics about the interface and so on.","tags":"Linux","url":"redoules.github.io/linux/get_ip_linux.html","loc":"redoules.github.io/linux/get_ip_linux.html"},{"title":"Check the version of the kernel currently running","text":"Check the version of the kernel currently running The uname command will give you the version of the kernel. In order to get a more useful output, type uname -a This will return : the hostname os name kernel release/version architecture * etc. variations If you only want the kernel version you can type uname -v if you only want the kernel release you can type uname -r","tags":"Linux","url":"redoules.github.io/linux/version_kernel.html","loc":"redoules.github.io/linux/version_kernel.html"},{"title":"Running the notebook on a remote server","text":"Jupyter hub With JupyterHub you can create a multi-user Hub which spawns, manages, and proxies multiple instances of the single-user Jupyter notebook server. Project Jupyter created JupyterHub to support many users. The Hub can offer notebook servers to a class of students, a corporate data science workgroup, a scientific research project, or a high performance computing group. However, if you are the only one using the server and you just want a simple way to run the notebook on your server and access it through the web interface on a light client without having to install and configure the jupyter hub, you can do the following. Problem with jupyter notebook On your server, run the command jupyter-notebook you should get something like : [I 11:18:44.514 NotebookApp] Serving notebooks from local directory: /volume2/homes/admin [I 11:18:44.515 NotebookApp] 0 active kernels [I 11:18:44.516 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/?token=357587e3269b0f20f2b7e1918492890ae7573ac7ef1d2023 [I 11:18:44.516 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [W 11:18:44.519 NotebookApp] No web browser found: could not locate runnable browser. [C 11:18:44.520 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8888/?token=357587e3269b0f20f2b7e1918492890ae7573ac7ef1d2023 and if you try to connect to your server ip (in my example : http://192.168.1.2:8888/?token=357587e3269b0f20f2b7e1918492890ae7573ac7ef1d2023) you will get an \"ERR_CONNECTION_REFUSED\" error. This is because, by default, Jupyter Notebook only accepts connections from localhost. Allowing connexions from other sources From any IP The simplest way to avoid the connection error is to allow the notebook to accept connections from any ip jupyter-notebook --ip = * you will get something like [W 11:26:45.285 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended. [I 11:26:45.342 NotebookApp] Serving notebooks from local directory: /volume2/homes/admin [I 11:26:45.342 NotebookApp] 0 active kernels [I 11:26:45.343 NotebookApp] The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/?token=52af33d628881824968b4031967e8541a27cc28b1720c199 [I 11:26:45.343 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [W 11:26:45.346 NotebookApp] No web browser found: could not locate runnable browser. [C 11:26:45.347 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8888/?token=52af33d628881824968b4031967e8541a27cc28b1720c199 and if you connect form a remote client (192.168.1.1 in my example), the following line will be added to the output : [I 11:26:54.798 NotebookApp] 302 GET /?token=52af33d628881824968b4031967e8541a27cc28b1720c199 (192.168.1.1) 111.17ms note that you should only do that if you are the only one using the server because the connection is not encypted. From a specific IP You can also, explicitly specify the ip of the client jupyter-notebook --ip = 192 .168.1.1 [I 11:44:58.104 NotebookApp] JupyterLab extension loaded from C:\\Users\\Guillaume\\Miniconda3\\lib\\site-packages\\jupyterlab [I 11:44:58.104 NotebookApp] JupyterLab application directory is C:\\Users\\Guillaume\\Miniconda3\\share\\jupyter\\lab [I 11:44:58.244 NotebookApp] Serving notebooks from local directory: C:\\Users\\Guillaume [I 11:44:58.245 NotebookApp] 0 active kernels [I 11:44:58.245 NotebookApp] The Jupyter Notebook is running at: http://192.168.1.1:8888/?token=503576dd8fa87d1f2c416df307e9b900e520b4942e317b32 [I 11:44:58.245 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 11:44:58.258 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://192.168.1.1:8888/?token=503576dd8fa87d1f2c416df307e9b900e520b4942e317b32 [I 11:44:59.083 NotebookApp] Accepting one-time-token-authenticated connection from 192.168.1.1","tags":"Jupyter","url":"redoules.github.io/jupyter/remote_run_notebook.html","loc":"redoules.github.io/jupyter/remote_run_notebook.html"},{"title":"Running multiple calls to a function in parallel with Dask","text":"Dask.distributed is a lightweight library for distributed computing in Python. It allows to create a compute graph. Dask distributed is architectured around 3 parts : the dask-scheduler the dask-worker(s) the dask client Dask architecture The Dask scheduler is a centrally managed, distributed, dynamic task scheduler. It recieves tasks from a/multiple client(s) and spread them across one or multiple dask-worker(s). Dask-scheduler is an event based asynchronous dynamic scheduler, meaning that mutliple clients can submit a list of task to be executed on multiple workers. Internally, the task are represented as a directed acyclic graph. Both new clients and new workers can be connected or disconnected during the execution of the task graph. Tasks can be submited with the function client . submit ( function , * args , ** kwargs ) or by using objects from the dask library such as dask.dataframe, dask.bag or dask.array Setup In this example, we will use a distributed scheduler on a single machine with multiple workers and a single client. We will use the client to submit some tasks to the scheduler. The scheduler will then dispatch those tasks to the workers. The process can be monitored in real time through a web application. For this example, all the computations will be run on a local computer. However dask can scale to a large HPC cluster. First we have to launch the dask-scheduler; from the command line, input dask-scheduler Next, you can load the web dashboard. In order to do so, the scheduler returns the number of the port you have to connect to in the line starting with \"bokeh at :\". The default port is 8787. Since we are running all the programs on the same computer, we just have to login to http://127.0.0.1:8787/status Finally, we have to launch the dask-worker(s). If you want to run the worker(s) on the same computer as the scheduler the type : dask-worker 127 .0.0.1:8786 otherwise, make sure you are inputing the ip address of the computer hosting the dask-scheduler. You can launch as many workers as you want. In this example, we will run 3 workers on the local machine. Use the dask workers within your python code We will now see how to submit multiple calls to a fucntion in parallel on the dask-workers. Import the required libraries and define the function to be executed. import numpy as np import pandas as pd from distributed import Client #function used to do parallel computing on def compute_pi_MonteCarlo ( Nb_Data ): \"\"\" computes the value of pi using the monte carlo method \"\"\" Radius = 1 Nb_Data = int ( round ( Nb_Data )) x = np . random . uniform ( - Radius , Radius , Nb_Data ) y = np . random . uniform ( - Radius , Radius , Nb_Data ) pi_mc = 4 * np . sum ( np . power ( x , 2 ) + np . power ( y , 2 ) < Radius ** 2 ) / Nb_Data err = 100 * np . abs ( pi_mc - np . pi ) / np . pi return [ Nb_Data , pi_mc , err ] In order to connect to the scheduler, we create a client. client = Client ( '127.0.0.1:8786' ) client Client Scheduler: tcp://127.0.0.1:8786 Dashboard: http://127.0.0.1:8787/status Cluster Workers: 3 Cores: 12 Memory: 25.48 GB We submit tasks using the submit method data = [ client . submit ( compute_pi_MonteCarlo , Nb_Data ) for Nb_Data in np . logspace ( 3 , 7 , num = 1200 , dtype = int )] If you look at http://127.0.0.1:8787/status you will see the tasks beeing completed. Once competed, gather the data: data = client . gather ( data ) df = pd . DataFrame ( data ) df . columns = [ \"number of points for MonteCarlo\" , \"value of pi\" , \"error (%)\" ] df . tail () .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } number of points for MonteCarlo value of pi error (%) 1195 9697405 3.141296 0.009454 1196 9772184 3.141058 0.017008 1197 9847540 3.141616 0.000739 1198 9923477 3.141009 0.018574 1199 10000000 3.141032 0.017833 There, we have completed a simple example on how to use dask to run multiple functions in parallel. Full source code: import numpy as np import pandas as pd from distributed import Client #function used to do parallel computing on def compute_pi_MonteCarlo ( Nb_Data ): \"\"\" computes the value of pi using the monte carlo method \"\"\" Radius = 1 Nb_Data = int ( round ( Nb_Data )) x = np . random . uniform ( - Radius , Radius , Nb_Data ) y = np . random . uniform ( - Radius , Radius , Nb_Data ) pi_mc = 4 * np . sum ( np . power ( x , 2 ) + np . power ( y , 2 ) < Radius ** 2 ) / Nb_Data err = 100 * np . abs ( pi_mc - np . pi ) / np . pi return [ Nb_Data , pi_mc , err ] #connect to the scheduler client = Client ( '127.0.0.1:8786' ) #submit tasks data = [ client . submit ( compute_pi_MonteCarlo , Nb_Data ) for Nb_Data in np . logspace ( 3 , 7 , num = 1200 , dtype = int )] #gather the results data = client . gather ( data ) df = pd . DataFrame ( data ) df . columns = [ \"number of points for MonteCarlo\" , \"value of pi\" , \"error (%)\" ] df . tail () A word on the environement variables On Windows, to make sure that you can run dask-scheduler and dask-worker from the command line, you have to add the location of the executable to your path. On linux, you can append the location of the dask-worker and scheduler to the path variable with the command export PATH = $PATH :/path/to/dask","tags":"Python","url":"redoules.github.io/python/dask_distributed_parallelism.html","loc":"redoules.github.io/python/dask_distributed_parallelism.html"},{"title":"Plotting data using log axis","text":"Plotting in log axis with matplotlib import matplotlib.pyplot as plt % matplotlib inline import numpy as np x = np . linspace ( 0.1 , 20 ) y = 20 * np . exp ( - x / 10.0 ) Plotting using the standard function then specifying the axis scale One of the easiest way to plot in a log plot is to specify the plot normally and then specify which axis is to be plotted with a log scale. This can be specified by the function set_xscale or set_yscale # Normal plot fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . plot ( x , y ) ax . grid () plt . show () # Log x axis plot fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . plot ( x , y ) ax . set_xscale ( 'log' ) ax . grid () plt . show () # Log x axis plot fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . plot ( x , y ) ax . set_yscale ( 'log' ) ax . grid () plt . show () # Log x axis plot fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . plot ( x , y ) ax . set_xscale ( 'log' ) ax . set_yscale ( 'log' ) ax . grid () plt . show () Plotting using the matplotlib defined function Matplotlib has the function : semilogx, semilogy and loglog that can help you avoid having to specify the axis scale. # Plot using semilogx fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . semilogx ( x , y ) ax . grid () plt . show () # Plot using semilogy fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . semilogy ( x , y ) ax . grid () plt . show () # Plot using loglog fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . loglog ( x , y ) ax . grid () plt . show ()","tags":"Python","url":"redoules.github.io/python/logplot.html","loc":"redoules.github.io/python/logplot.html"},{"title":"Downloading a static webpage with python","text":"If you are using python legacy (aka python 2) first of all, stop ! Furthermore, this method won't work in python legacy # Import modules from urllib.request import urlopen The webpage source code can be downloaded with the command urlopen url = \"http://example.com/\" #create a HTTP request in order to read the page page = urlopen ( url ) . read () The source code will be stored in the variable page as a string print ( page ) b '&lt;!doctype html&gt;\\n&lt;html&gt;\\n&lt;head&gt;\\n &lt;title&gt;Example Domain&lt;/title&gt;\\n\\n &lt;meta charset=\"utf-8\" /&gt;\\n &lt;meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" /&gt;\\n &lt;meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" /&gt;\\n &lt;style type=\"text/css\"&gt;\\n body {\\n background-color: #f0f0f2;\\n margin: 0;\\n padding: 0;\\n font-family: \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\\n \\n }\\n div {\\n width: 600px;\\n margin: 5em auto;\\n padding: 50px;\\n background-color: #fff;\\n border-radius: 1em;\\n }\\n a:link, a:visited {\\n color: #38488f;\\n text-decoration: none;\\n }\\n @media (max-width: 700px) {\\n body {\\n background-color: #fff;\\n }\\n div {\\n width: auto;\\n margin: 0 auto;\\n border-radius: 0;\\n padding: 1em;\\n }\\n }\\n &lt;/style&gt; \\n&lt;/head&gt;\\n\\n&lt;body&gt;\\n&lt;div&gt;\\n &lt;h1&gt;Example Domain&lt;/h1&gt;\\n &lt;p&gt;This domain is established to be used for illustrative examples in documents. You may use this\\n domain in examples without prior coordination or asking for permission.&lt;/p&gt;\\n &lt;p&gt;&lt;a href=\"http://www.iana.org/domains/example\"&gt;More information...&lt;/a&gt;&lt;/p&gt;\\n&lt;/div&gt;\\n&lt;/body&gt;\\n&lt;/html&gt;\\n' Additionally, you can beautifulsoup in order to make it easier to work with html from bs4 import BeautifulSoup soup = BeautifulSoup ( page , 'lxml' ) soup . prettify () print ( soup ) & lt ;! DOCTYPE html & gt ; & lt ; html & gt ; & lt ; head & gt ; & lt ; title & gt ; Example Domain & lt ;/ title & gt ; & lt ; meta charset = \"utf-8\" /& gt ; & lt ; meta content = \"text/html; charset=utf-8\" http-equiv = \"Content-type\" /& gt ; & lt ; meta content = \"width=device-width, initial-scale=1\" name = \"viewport\" /& gt ; & lt ; style type = \"text/css\" & gt ; body { background-color : #f0f0f2 ; margin : 0 ; padding : 0 ; font-family : \"Open Sans\" , \"Helvetica Neue\" , Helvetica , Arial , sans-serif ; } div { width : 600 px ; margin : 5 em auto ; padding : 50 px ; background-color : #fff ; border-radius : 1 em ; } a : link , a : visited { color : #38488f ; text-decoration : none ; } @ media ( max-width : 700px ) { body { background-color : #fff ; } div { width : auto ; margin : 0 auto ; border-radius : 0 ; padding : 1 em ; } } & lt ;/ style & gt ; & lt ;/ head & gt ; & lt ; body & gt ; & lt ; div & gt ; & lt ; h1 & gt ; Example Domain & lt ;/ h1 & gt ; & lt ; p & gt ; This domain is established to be used for illustrative examples in documents . You may use this domain in examples without prior coordination or asking for permission .& lt ;/ p & gt ; & lt ; p & gt ;& lt ; a href = \"http://www.iana.org/domains/example\" & gt ; More information ...& lt ;/ a & gt ;& lt ;/ p & gt ; & lt ;/ div & gt ; & lt ;/ body & gt ; & lt ;/ html & gt ;","tags":"Python","url":"redoules.github.io/python/download_page.html","loc":"redoules.github.io/python/download_page.html"},{"title":"Getting stock market data","text":"Start by importing the packages. We will need pandas and the pandas_datareader. # Import modules import pandas as pd from pandas_datareader import data Datareader allows you to import data from the internet. I have found that Quandl and robinhood works the best as a source for stockmarket data. Note that if you want an other type of data (e.g. GDP, inflation, etc.) other sources exist. #import stock from robinhood aapl_robinhood = data . DataReader ( 'AAPL' , 'robinhood' , '1980-01-01' ) aapl_robinhood . head () .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } close_price high_price interpolated low_price open_price session volume symbol begins_at AAPL 2017-08-04 153.996200 154.990700 False 153.306900 153.681100 reg 20559852 2017-08-07 156.379100 156.487400 False 154.272000 154.655900 reg 21870321 2017-08-08 157.629700 159.352900 False 155.847400 156.172300 reg 36205896 2017-08-09 158.594700 158.801500 False 156.674500 156.822200 reg 26131530 2017-08-10 153.543100 158.169600 False 152.861000 158.070700 reg 40804273 #import stock from quandl aapl_quandl = data . DataReader ( 'AAPL' , 'quandl' , '1980-01-01' ) aapl_quandl . head () .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } Open High Low Close Volume ExDividend SplitRatio AdjOpen AdjHigh AdjLow AdjClose AdjVolume Date 2018-03-27 173.68 175.15 166.92 168.340 38962839.0 0.0 1.0 173.68 175.15 166.92 168.340 38962839.0 2018-03-26 168.07 173.10 166.44 172.770 36272617.0 0.0 1.0 168.07 173.10 166.44 172.770 36272617.0 2018-03-23 168.39 169.92 164.94 164.940 40248954.0 0.0 1.0 168.39 169.92 164.94 164.940 40248954.0 2018-03-22 170.00 172.68 168.60 168.845 41051076.0 0.0 1.0 170.00 172.68 168.60 168.845 41051076.0 2018-03-21 175.04 175.09 171.26 171.270 35247358.0 0.0 1.0 175.04 175.09 171.26 171.270 35247358.0","tags":"Python","url":"redoules.github.io/python/stock_pandas.html","loc":"redoules.github.io/python/stock_pandas.html"},{"title":"Moving average with pandas","text":"# Import modules import pandas as pd from pandas_datareader import data , wb #import packages from pandas_datareader import data aapl = data . DataReader ( 'AAPL' , 'quandl' , '1980-01-01' ) aapl . head () .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } Open High Low Close Volume ExDividend SplitRatio AdjOpen AdjHigh AdjLow AdjClose AdjVolume Date 2018-03-27 173.68 175.15 166.92 168.340 38962839.0 0.0 1.0 173.68 175.15 166.92 168.340 38962839.0 2018-03-26 168.07 173.10 166.44 172.770 36272617.0 0.0 1.0 168.07 173.10 166.44 172.770 36272617.0 2018-03-23 168.39 169.92 164.94 164.940 40248954.0 0.0 1.0 168.39 169.92 164.94 164.940 40248954.0 2018-03-22 170.00 172.68 168.60 168.845 41051076.0 0.0 1.0 170.00 172.68 168.60 168.845 41051076.0 2018-03-21 175.04 175.09 171.26 171.270 35247358.0 0.0 1.0 175.04 175.09 171.26 171.270 35247358.0 In order to computer the moving average, we will use the rolling function. #120 days moving average moving_averages = aapl [[ \"Open\" , \"High\" , \"Low\" , \"Close\" , \"Volume\" ]] . rolling ( window = 120 ) . mean () moving_averages . tail () .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } Open High Low Close Volume Date 1980-12-18 28.457667 28.551917 28.385000 28.385000 139495.000000 1980-12-17 28.410750 28.502917 28.338083 28.338083 141772.500000 1980-12-16 28.362833 28.453917 28.289167 28.289167 141256.666667 1980-12-15 28.335750 28.426833 28.262083 28.262083 144321.666667 1980-12-12 28.310750 28.402833 28.238167 28.238167 159625.000000 % matplotlib inline import matplotlib.pyplot as plt plt . plot ( aapl . index , aapl . Open , label = 'Open price' ) plt . plot ( moving_averages . index , moving_averages . Open , label = \"120 MA Open price\" ) plt . legend () plt . show ()","tags":"Python","url":"redoules.github.io/python/Moving_average_pandas.html","loc":"redoules.github.io/python/Moving_average_pandas.html"},{"title":"Keywords to use with WHERE","text":"Keywords to use with WHERE #load the extension % load_ext sql #connect to the database % sql sqlite : /// mydatabase . db 'Connected: @mydatabase.db' Assignment operator The assignment operator is =. % sql SELECT * FROM tutyfrutty WHERE color = \"red\" * sqlite:///mydatabase.db Done. index fruit color kcal 2 Apple red 52 7 Cranberry red 308 Comparison operators Comparison operation can be done in a SQL querry. They are the following : Equality : = Greater than : > greater than or equal to : >= less than : < less than or equal to : <= not equal to : <>, != not greater than : !> not less than : !< % sql SELECT * FROM tutyfrutty WHERE kcal = 47 * sqlite:///mydatabase.db Done. index fruit color kcal 1 Orange orange 47 % sql SELECT * FROM tutyfrutty WHERE kcal > 47 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 2 Apple red 52 7 Cranberry red 308 % sql SELECT * FROM tutyfrutty WHERE kcal >= 47 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 7 Cranberry red 308 % sql SELECT * FROM tutyfrutty WHERE kcal < 47 * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 4 lime green 30 5 plum purple 28 % sql SELECT * FROM tutyfrutty WHERE kcal <= 47 * sqlite:///mydatabase.db Done. index fruit color kcal 1 Orange orange 47 3 lemon yellow 15 4 lime green 30 5 plum purple 28 % sql SELECT * FROM tutyfrutty WHERE kcal <> 47 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 7 Cranberry red 308 Logical operators Logical operators test a condition and return a boolean. The logicial operators in SQL are : ALL : true if all the condtions are true AND : true is both conditions are true ANY : true if any one of the conditions are true BETWEEN : true if the operand in withing a range of values EXISTS : true if the subquery contains any rows IN : true if the condition is present in a row LIKE : true if a pattern is matched NOT : True if the operand is false, false otherwise OR : True is either condition is true SOME : true is any of the conditions is true % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" AND kcal < 100 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 3 lemon yellow 15 % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR kcal > 300 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 3 lemon yellow 15 7 Cranberry red 308 % sql SELECT * FROM tutyfrutty WHERE fruit LIKE 'l%' * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 4 lime green 30 % sql SELECT * FROM tutyfrutty WHERE NOT color = \"yellow\" * sqlite:///mydatabase.db Done. index fruit color kcal 1 Orange orange 47 2 Apple red 52 4 lime green 30 5 plum purple 28 7 Cranberry red 308 % sql SELECT * FROM tutyfrutty WHERE kcal BETWEEN 40 AND 100 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 Bitwise operators Some bitwise operators exist in SQL. They will not be demonstrated here. They are the following : AND : & OR : | XOR : &#94; NOT : ~","tags":"SQL","url":"redoules.github.io/sql/WHERE_SQL_keywords.html","loc":"redoules.github.io/sql/WHERE_SQL_keywords.html"},{"title":"Sorting results","text":"Sorting results in SQL Sorting results can be achieved by using a modifier command at the end of the SQL querry #load the extension % load_ext sql #connect to the database % sql sqlite : /// mydatabase . db 'Connected: @mydatabase.db' The results can be sorted with the command ORDER BY SELECT column-list FROM table_name [WHERE condition] [ORDER BY column1, column2, .. columnN] [ASC | DESC] Let's show an example where we extract the fruits that are either yellow or red % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR color = \"red\" * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 2 Apple red 52 3 lemon yellow 15 7 Cranberry red 308 Ascending sort % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR color = \"red\" ORDER BY kcal ASC * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 2 Apple red 52 0 Banana yellow 89 7 Cranberry red 308 descending sort % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR color = \"red\" ORDER BY kcal DESC * sqlite:///mydatabase.db Done. index fruit color kcal 7 Cranberry red 308 0 Banana yellow 89 2 Apple red 52 3 lemon yellow 15 Sort by multiple columns You can sort by more than one column. Just specify multiple columns in the ORDER BY keyword. In the example, we will sort alphabetically on the color column first and sort alphabetically on the fruit column % sql SELECT * FROM tutyfrutty ORDER BY color , fruit ASC * sqlite:///mydatabase.db Done. index fruit color kcal 4 lime green 30 1 Orange orange 47 5 plum purple 28 2 Apple red 52 7 Cranberry red 308 0 Banana yellow 89 3 lemon yellow 15","tags":"SQL","url":"redoules.github.io/sql/Sorting_results.html","loc":"redoules.github.io/sql/Sorting_results.html"},{"title":"Filter content of a TABLE","text":"Filter content of a TABLE in SQL In this example, we will display the content of a table but we will filter out the results. Since we are working in the notebook, we will load the sql extension in order to manipulate the database. The database mydatabase.db is a SQLite database already created before the example. #load the extension % load_ext sql #connect to the database % sql sqlite : /// mydatabase . db 'Connected: @mydatabase.db' Filter content matching exactly a condition We want to extract all the entries in a dataframe that match a certain condition, in order to do so, we will use the following command : SELECT * FROM TABLE WHERE column=\"condition\" In our example, we will filter all the entries in the tutyfrutty table whose color is yellow % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 3 lemon yellow 15 Complex conditions You can build more complex conditions by using the keywords OR and AND In the following example, we will filter all entries that are either yellow or red % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR color = \"red\" * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 2 Apple red 52 3 lemon yellow 15 7 Cranberry red 308 Note : when combining multiple conditions with AND and OR, be careful to use parentesis where needed Conditions matching a pattern You can also use the LIKE keyword in order to find all entries that match a certain pattern. In our example, we want to find all fruits begining with a \"l\". In order to do so, we will use the LIKE keyword and the wildcard \"%\" meaning any string % sql SELECT * FROM tutyfrutty WHERE fruit LIKE \"l%\" * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 4 lime green 30 Numerical conditions When we are working with numerical data, we can use the GREATER THAN > and SMALLER THAN < operators % sql SELECT * FROM tutyfrutty WHERE kcal < 47 * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 4 lime green 30 5 plum purple 28 If we want the condition to be inclusive we can use the operator <= (alternatively >=) % sql SELECT * FROM tutyfrutty WHERE kcal <= 47 * sqlite:///mydatabase.db Done. index fruit color kcal 1 Orange orange 47 3 lemon yellow 15 4 lime green 30 5 plum purple 28","tags":"SQL","url":"redoules.github.io/sql/display_table_filter.html","loc":"redoules.github.io/sql/display_table_filter.html"},{"title":"Displaying the content of a TABLE","text":"Displaying the content of a TABLE in SQL In this very simple example we will see how to display the content of a table. Since we are working in the notebook, we will load the sql extension in order to manipulate the database. The database mydatabase.db is a SQLite database already created before the example. #load the extension % load_ext sql #connect to the database % sql sqlite : /// mydatabase . db 'Connected: @mydatabase.db' In order to extract all the values from a table, we will use the following command : SELECT * FROM TABLE In our example, we want to display the data contained in the table named tutyfrutty % sql SELECT * FROM tutyfrutty * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 7 Cranberry red 308","tags":"SQL","url":"redoules.github.io/sql/display_table.html","loc":"redoules.github.io/sql/display_table.html"},{"title":"Opening a file with python","text":"This short article show you how to open a file using python. We will use the with keyword in order to avoid having to close the file. There is no need to import anything in order to open a file. All the function related to file manipulation are part of the python standard library In order to open a file, we will use the function open. This function takes two arguments : the path of the file the mode you want to open the file The mode can be : 'r' : read 'w' : write 'a' : append (writes at the end of the file) 'b' : binary mode 'x' : exclusive creation 't' : text mode (by default) Note that if the file does not exit it will be created if you use the following options \"w\", \"a\", \"x\". If you try to open a non existing file in read mode 'r', a FileNotFoundError will be returned. It is possible to combine multiple options together. For instance, you can open a file in binary mode for writing using the 'wb' option. Python distinguishes between binary and text I/O. Files opened in binary mode return contents as bytes objects without any decoding. In text mode , the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given. Writing to a file Let's first open (create) a text file a write a string to it. filepath = \". \\\\ myfile.txt\" with open ( filepath , 'w' ) as f : f . write ( \"Hello world !\" ) Reading a file we can now see how to read the content of a file. To do so, we will use the 'r' option with open ( filepath , \"r\" ) as f : content = f . read () print ( content ) Hello world ! A word on the with keyword In python the with keyword is used when working with unmanaged resources (like file streams). The python documentation tells us that : The with statement clarifies code that previously would use try...finally blocks to ensure that clean-up code is executed. In this section, I'll discuss the statement as it will commonly be used. In the next section, I'll examine the implementation details and show how to write objects for use with this statement. The with statement is a control-flow structure whose basic structure is: with expression [ as variable ]: with - block The expression is evaluated, and it should result in an object that supports the context management protocol (that is, has enter () and exit () methods).","tags":"Python","url":"redoules.github.io/python/Opening_file.html","loc":"redoules.github.io/python/Opening_file.html"},{"title":"Opening a SQLite database with python","text":"This short article show you how to connect to a SQLite database using python. We will use the with keyword in order to avoid having to close the database. In order to connect to the database, we will have to import sqlite3 import sqlite3 from sqlite3 import Error In python the with keyword is used when working with unmanaged resources (like file streams). The python documentation tells us that : The with statement clarifies code that previously would use try...finally blocks to ensure that clean-up code is executed. In this section, I'll discuss the statement as it will commonly be used. In the next section, I'll examine the implementation details and show how to write objects for use with this statement. The with statement is a control-flow structure whose basic structure is: with expression [ as variable ]: with - block The expression is evaluated, and it should result in an object that supports the context management protocol (that is, has enter () and exit () methods). db_file = \". \\\\ mydatabase.db\" try : with sqlite3 . connect ( db_file ) as conn : print ( \"Connected to the database\" ) #your code here except Error as e : print ( e ) Connected to the database","tags":"Python","url":"redoules.github.io/python/Opening_SQLite_database.html","loc":"redoules.github.io/python/Opening_SQLite_database.html"},{"title":"Reading data from a sql database with pandas","text":"When manipulating you data using pandas, it is sometimes useful to pull data from a database. In this tutorial, we will see how to querry a dataframe from a sqlite table. Note than it would also work with any other sql database a long as you change the connxion to the one that suits your needs. First let's import pandas and sqlite3 import pandas as pd import sqlite3 from sqlite3 import Error We want to store the table tutyfrutty in our dataframe. To do so, we will query all the elements present in the tutyfrutty TABLE with the command : SELECT * FROM tutyfrutty db_file = \". \\\\ mydatabase.db\" try : with sqlite3 . connect ( db_file ) as conn : df = pd . read_sql ( \"SELECT * FROM tutyfrutty\" , conn ) del df [ \"index\" ] #juste delete the index column that was stored in the table except Error as e : print ( e ) df .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 6 Cranberry red 308 7 Cranberry red 308","tags":"Python","url":"redoules.github.io/python/Reading_data_from_a_sql_database_with_pandas.html","loc":"redoules.github.io/python/Reading_data_from_a_sql_database_with_pandas.html"},{"title":"Writing data to a sql database with pandas","text":"When manipulating you data using pandas, it is sometimes useful to store a dataframe. Pandas provides multiple ways to export dataframes. The most common consist in exporting to a csv, a pickle, to hdf or to excel. However, exporting to a sql database can prove very useful. Indeed, having a well structured database is a great for storing all the data related to your analysis in one place. In this tutorial, we will see how to store a dataframe in a new table of a sqlite dataframe. Note than it would also work with any other sql database a long as you change the connxion to the one that suits your needs. First let's import pandas and sqlite3 import pandas as pd import sqlite3 from sqlite3 import Error # Example dataframe raw_data = { 'fruit' : [ 'Banana' , 'Orange' , 'Apple' , 'lemon' , \"lime\" , \"plum\" ], 'color' : [ 'yellow' , 'orange' , 'red' , 'yellow' , \"green\" , \"purple\" ], 'kcal' : [ 89 , 47 , 52 , 15 , 30 , 28 ] } df = pd . DataFrame ( raw_data , columns = [ 'fruit' , 'color' , 'kcal' ]) df .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 Now that the DataFrame has been created, let's push it to the sqlite database called mydatabase.db in a new table called tutyfrutty db_file = \". \\\\ mydatabase.db\" try : with sqlite3 . connect ( db_file ) as conn : df . to_sql ( \"tutyfrutty\" , conn ) except Error as e : print ( e ) except ValueError : print ( \"The TABLE tutyfrutty already exists, read below to understand how to handle this case\" ) Note that if the table tutyfrutty was already existing, the to_sql function will return a ValueError. This is where, the if_exists option comes into play. Let's look at the docstring of this function : \"\"\" if_exists : {'fail', 'replace', 'append'}, default 'fail' - fail: If table exists, do nothing. - replace: If table exists, drop it, recreate it, and insert data. - append: If table exists, insert data. Create if does not exist. \"\"\" Let's say, I want to update my dataframe with some new rows df . loc [ len ( df ) + 1 ] = [ 'Cranberry' , 'red' , 308 ] df .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 7 Cranberry red 308 8 Cranberry red 308 I can now replace the table with the new values using the \"replace\" option db_file = \". \\\\ mydatabase.db\" try : with sqlite3 . connect ( db_file ) as conn : df . to_sql ( \"tutyfrutty\" , conn , if_exists = \"replace\" ) except Error as e : print ( e )","tags":"Python","url":"redoules.github.io/python/Writing_data_to_a_sql_database_with_pandas.html","loc":"redoules.github.io/python/Writing_data_to_a_sql_database_with_pandas.html"},{"title":"Creating a sqlite database","text":"When you want to start with using databases SQlite is a great tool. It provides an easy onramp to learn and prototype you database with a SQL compatible database. First, let's import the libraries we need import sqlite3 from sqlite3 import Error SQlite doesn't need a database server, however, you have to start by creating an empty database file import os def check_for_db_file (): if os . path . exists ( \"mydatabase.db\" ): print ( \"the database is ready\" ) else : print ( \"no database found\" ) check_for_db_file () no database found Let's then create a function that will connect to a database, print the verison of sqlite and then close the connexion to the database. def create_database ( db_file ): \"\"\" create a database connection to a SQLite database \"\"\" try : with sqlite3 . connect ( db_file ) as conn : print ( \"database created with sqlite3 version {0}\" . format ( sqlite3 . version )) except Error as e : print ( e ) create_database ( \".\\mydatabase.db\" ) database created with sqlite3 version 2.6.0 check_for_db_file () the database is ready You're all set. From now on, you can open the database and write sql querries into it.","tags":"Python","url":"redoules.github.io/python/Creating_a_sqlite_database.html","loc":"redoules.github.io/python/Creating_a_sqlite_database.html"},{"title":"Setting up the notebook for plotting with matplotlib","text":"Importing Matplotlib First we need to import pyplot, a collection of command style functions that make matplotlib work like MATLAB. Let's, as well, use the magic command %matplotlib inline in order to display the figures in the notebook import matplotlib.pyplot as plt % matplotlib inline # this doubles image size, but we'll do it manually below # %config InlineBackend.figure_format = 'retina' The following parameters are recommended for matplotlib, they will make matplotlib output a better quality image # %load snippets/matplot_setup.py plt . rcParams [ 'savefig.dpi' ] = 300 plt . rcParams [ 'figure.dpi' ] = 163 plt . rcParams [ 'figure.autolayout' ] = False plt . rcParams [ 'figure.figsize' ] = 20 , 12 plt . rcParams [ 'axes.labelsize' ] = 18 plt . rcParams [ 'axes.titlesize' ] = 20 plt . rcParams [ 'font.size' ] = 16 plt . rcParams [ 'lines.linewidth' ] = 2.0 plt . rcParams [ 'lines.markersize' ] = 8 plt . rcParams [ 'legend.fontsize' ] = 14 plt . rcParams [ 'text.usetex' ] = False # True activates latex output in fonts! plt . rcParams [ 'font.family' ] = \"serif\" plt . rcParams [ 'font.serif' ] = \"cm\" plt . rcParams [ 'text.latex.preamble' ] = \" \\\\ usepackage{subdepth}, \\\\ usepackage{type1cm}\" You can change the second line in order to fit your display. 163 dpi corresponds to a Dell Ultra HD 4k P2715Q. You can check your screen's dpi count at http://dpi.lv/","tags":"Python","url":"redoules.github.io/python/Setting_up_the_notebook_for_plotting_with_matplotlib.html","loc":"redoules.github.io/python/Setting_up_the_notebook_for_plotting_with_matplotlib.html"},{"title":"Why using a blockchain is a bad idea for your business","text":"What having a blockchain implies? storage costs : everyone maintaining the ledger needs to store every transaction bandwith costs : everyone has to broadcast every transaction computational costs : every node has to validate the blockchain control : the creator does not control the blockchain, everyone collectively controls it developpement costs : developping on a blockchain is way harder than on a traditionnal database What to ask a business when they tell you that they are using a blockchain? When a business is telling you about their innovative technology leveraging the power of the blockchain this should immedialty spake some questions : What is the consensus algorithm? who is responsible for validating the consensus rules? what is the nature of the participation ? is it open to access? is it open to innovation? is it a public ledger? is it transparent? does it improves acountability? is it cross borders? how is it validated?","tags":"Cryptocurrencies","url":"redoules.github.io/cryptocurrencies/blockchain_bad.html","loc":"redoules.github.io/cryptocurrencies/blockchain_bad.html"},{"title":"Synology NFS share","text":"Setting up a NFS share login to your DSM admin account, open the \"Control Panel\" and go to \"File Services\" Make sure NFS is enabled Back in the control panel, go to \"Shared Folder\" Select the folder you want to share and clic \"Edit\" Go to the \"NFS Permissions tab and clic \"Create\", add the ip of the device you want to mount the mapped drive on. Make sure you copy the \"Mount path\"","tags":"Linux","url":"redoules.github.io/linux/share_nfs_share.html","loc":"redoules.github.io/linux/share_nfs_share.html"},{"title":"Mount a NFS share using fstab","text":"Mount nfs using fstab The fstab file, generally located at /etc/fstab lists the differents partitions and where to load them on the filesystem. You can edit this file as root by using the following command sudo nano /etc/fstab in the following example, we want to mount a NFS v3 share from : server : 192.168.1.2 mountpoint (on the server) : /volumeUSB2/usbshare * mountlocation (on the client) : /mnt we specify 192 .168.1.2:/volumeUSB2/usbshare /mnt nfs nfsvers = 3 ,users 0 0 the client will then automatically mount the share ont /mnt at startup. Related you can reload the fstab file using this method : https://redoules.github.io/linux/Reloading_fstab.html You can create a NFS share on a Synology using the method : https://redoules.github.io/linux/share_nfs_share.html","tags":"Linux","url":"redoules.github.io/linux/mount_nfs_share_fstab.html","loc":"redoules.github.io/linux/mount_nfs_share_fstab.html"},{"title":"Installing bitcoind on raspberry pi","text":"Installing bitcoind on linux Running a full bitcoin node helps the bitcoin network to accept, validate and relay transactions. If you want to volunteer some spare computing and bandwidth resources to run a full node and allow Bitcoin to continue to grow you can grab an inexpensive and power efficient raspberry pi and turn it into a full node. There are plenty of tutorials on the Internet explaining how to install a bitcoin full node; this tutorial won't go over setting up a raspberry pi and using ssh. In order to store the full blockchain we will mount a network drive and tell bitcoind to use this mapped drive as the data directory. Download the bitcoin client Go to https://bitcoin.org/en/download Copy the URL for the ARM 32 bit version and download it onto your raspberry pi. wget https://bitcoin.org/bin/bitcoin-core-0.15.1/bitcoin-0.15.1-arm-linux-gnueabihf.tar.gz Locate the downloaded file and extract it using the arguement xzf tar xzf bitcoin-0.15.1-arm-linux-gnueabihf.tar.gz a new directory bitcoin-0.15.1 will be created, it contrains the files we need to install the software Install the bitcoin client We will install the content by copying the binaries located in the bin folder into /usr/local/bin by using the install command. You must use sudo because it will write data to a system directory sudo install -m 0755 -o root -g root -t /usr/local/bin bitcoin-0.15.1/bin/* Launch the bitcoin core client by running bitcoind -daemon Configuration of the node Start your node at boot Starting you node automatically at boot time is a good idea because it doesn't require a manual action from the user. The simplest way to achive this is to create a cronjob. Run the following command crontab -e Select the text editor of your choice, then add the following line at the end of the file @reboot bitcoind -daemon Save the file and exit; the updated crontab file will be installed for you. Full Node If you can afford to download and store all the blockchain, you can run a full node. At the time of writing, the blockchain is 150Go ( https://blockchain.info/fr/charts/blocks-size ). Tree ways to store this are : use a microSD with 256Go or more add a thumbdrive or an external drive to your raspberry pi * mount a network drive from a NAS If you have purchased a big SD card then you can leave the default location for the blockchain data (~/.bitcoin/). Otherwise, you will have to change the datadir location to where your drive is mounted (in my case I have mounted it to /mnt) In order to configure your bitcoin client, edit/create the file bitcoin.conf located in ~/.bitcoin/ nano ~/.bitcoin/bitcoin.conf copy the following text # From redoules.github.io # This config should be placed in following path: # ~/.bitcoin/bitcoin.conf # [core] # Specify a non-default location to store blockchain and other data. datadir=/mnt # Set database cache size in megabytes; machines sync faster with a larger cache. Recommend setting as high as possible based upon mach$ dbcache=100 # Keep at most <n> unconnectable transactions in memory. maxorphantx=10 # Keep the transaction memory pool below <n> megabytess. maxmempool=50 # [network] # Maintain at most N connections to peers. maxconnections=40 # Tries to keep outbound traffic under the given target (in MiB per 24h), 0 = no limit. maxuploadtarget=5000 Check https://jlopp.github.io/bitcoin-core-config-generator it is a handy site to edit the bitcoin.conf file Pruning node If you don't want to store the entire blockchain you can run a pruning node which reduces storage requirements by enabling pruning (deleting) of old blocks. Let's say you want to allocated at most 5Go to the blockchain, then specify prune=5000 into your bitcoin.conf file. Edit/create the file bitcoin.conf located in ~/.bitcoin/ nano ~/.bitcoin/bitcoin.conf copy the following text # From redoules.github.io # This config should be placed in following path: # ~/.bitcoin/bitcoin.conf # [core] # Set database cache size in megabytes; machines sync faster with a larger cache. Recommend setting as high as possible based upon mach$ dbcache=100 # Keep at most <n> unconnectable transactions in memory. maxorphantx=10 # Keep the transaction memory pool below <n> megabytess. maxmempool=50 # Reduce storage requirements by only storing most recent N MiB of block. This mode is incompatible with -txindex and -rescan. WARNING: Reverting this setting requires re-downloading the entire blockchain. (default: 0 = disable pruning blocks, 1 = allow manual pruning via RPC, greater than 550 = automatically prune blocks to stay under target size in MiB). prune=5000 # [network] # Maintain at most N connections to peers. maxconnections=40 # Tries to keep outbound traffic under the given target (in MiB per 24h), 0 = no limit. maxuploadtarget=5000 Checking if your node is public one of the best way to help the bitcoin network is to allow your node to be visible and to propagate block to other nodes. The bitcoin protocole uses port 8333, other clients should be able to share information with your client. Run ifconfig and check if you have an ipv6 adresse (look for adr inet6:) IPV6 Get the global ipv6 adresse of your raspberry pi Link encap:Ethernet HWaddr xx:xx:xx:xx:xx:xx inet adr:192.168.1.x Bcast:192.168.1.255 Masque:255.255.255.0 adr inet6: xxxx::xxxx:xxxx:xxxx:xxxx/64 Scope:Lien adr inet6: xxxx:xxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:42681744 errors:0 dropped:0 overruns:0 frame:0 TX packets:38447218 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 lg file transmission:1000 RX bytes:3044414780 (2.8 GiB) TX bytes:2599878680 (2.4 GiB) it is located between adr inet4 and Scope:Global adr inet6: xxxx:xxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx/64 Scope:Global Copy this adresse and past it into the search field on https://bitnodes.earn.com/ If your node is visible, it will appear on the website IPV4 If you don't have an ipv6 adresse, you will have to open port 8333 on your router and redirect it to the internal IP of your raspberry pi. It is not detailed here because the configuration depends on your router.","tags":"Cryptocurrencies","url":"redoules.github.io/cryptocurrencies/Installing_bitcoind_on_raspberry_pi.html","loc":"redoules.github.io/cryptocurrencies/Installing_bitcoind_on_raspberry_pi.html"},{"title":"Reloading .bashrc","text":"Reload .bashrc The .bashrc file, located at ~/.bashrc allows a user to personalize its bash shell. If you edit this file, the changes won't be loaded without login out and back in. However, you can use the following command to do it source ~/.bashrc","tags":"Linux","url":"redoules.github.io/linux/Reloading_.bashrc.html","loc":"redoules.github.io/linux/Reloading_.bashrc.html"},{"title":"Reloading fstab","text":"Reload fstab The fstab file, generally located at /etc/fstab lists the differents partitions and where to load them on the filesystem. If you edit this file, the changes won't be automounted. You either have to reboot your system of use the following command as root mount -a","tags":"Linux","url":"redoules.github.io/linux/Reloading_fstab.html","loc":"redoules.github.io/linux/Reloading_fstab.html"},{"title":"Updating all python package with anaconda","text":"Updating anaconda packages All packages managed by conda can be updated with the following command : conda update --all Updating other packages with pip For the other packages, the pip package manager can be used. Unfortunately pip hasn't the same update all fonctionnality. import pip from subprocess import call for dist in pip . get_installed_distributions (): print ( \"updating {0}\" . format ( dist )) call ( \"pip install --upgrade \" + dist . project_name , shell = True )","tags":"Python","url":"redoules.github.io/python/updating_all_python_package_with_anaconda.html","loc":"redoules.github.io/python/updating_all_python_package_with_anaconda.html"},{"title":"Saving a matplotlib figure with a high resolution","text":"creating a matplotlib figure #Importing matplotlib % matplotlib inline import matplotlib.pyplot as plt import numpy as np Drawing a figure # Fixing random state for reproducibility np . random . seed ( 19680801 ) mu , sigma = 100 , 15 x = mu + sigma * np . random . randn ( 10000 ) # the histogram of the data n , bins , patches = plt . hist ( x , 50 , normed = 1 , facecolor = 'g' , alpha = 0.75 ) plt . xlabel ( 'Smarts' ) plt . ylabel ( 'Probability' ) plt . title ( 'Histogram of IQ' ) plt . text ( 60 , . 025 , r '$\\mu=100,\\ \\sigma=15$' ) plt . axis ([ 40 , 160 , 0 , 0.03 ]) plt . grid ( True ) plt . show () Saving the figure normally, one would use the following code plt . savefig ( 'filename.png' ) &lt;matplotlib.figure.Figure at 0x2e45e92f400&gt; The figure in then exported to the file \"filename.png\" with a standard resolution. In adittion, you can specify the dpi arg to some scalar value, for example: plt . savefig ( 'filename_hi_dpi.png' , dpi = 300 ) &lt;matplotlib.figure.Figure at 0x2e462164898&gt;","tags":"Python","url":"redoules.github.io/python/Saving_a_matplotlib_figure_with_a_high_resolution.html","loc":"redoules.github.io/python/Saving_a_matplotlib_figure_with_a_high_resolution.html"},{"title":"Iterating over a DataFrame","text":"Create a sample dataframe # Import modules import pandas as pd # Example dataframe raw_data = { 'fruit' : [ 'Banana' , 'Orange' , 'Apple' , 'lemon' , \"lime\" , \"plum\" ], 'color' : [ 'yellow' , 'orange' , 'red' , 'yellow' , \"green\" , \"purple\" ], 'kcal' : [ 89 , 47 , 52 , 15 , 30 , 28 ] } df = pd . DataFrame ( raw_data , columns = [ 'fruit' , 'color' , 'kcal' ]) df .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 Using the iterrows method Pandas DataFrames can return a generator with the iterrrows method. It can then be used to loop over the rows of the DataFrame for index , row in df . iterrows (): print ( \"At line {0} there is a {1} which is {2} and contains {3} kcal\" . format ( index , row [ \"fruit\" ], row [ \"color\" ], row [ \"kcal\" ])) At line 0 there is a Banana which is yellow and contains 89 kcal At line 1 there is a Orange which is orange and contains 47 kcal At line 2 there is a Apple which is red and contains 52 kcal At line 3 there is a lemon which is yellow and contains 15 kcal At line 4 there is a lime which is green and contains 30 kcal At line 5 there is a plum which is purple and contains 28 kcal","tags":"Python","url":"redoules.github.io/python/Iterating_over_a_dataframe.html","loc":"redoules.github.io/python/Iterating_over_a_dataframe.html"},{"title":"Article Recommander","text":"import pandas as pd import numpy as np % matplotlib inline Loading data and preprocessing we first learn the pickled article database. We will be cleaning it and separating the interesting articles from the uninteresting ones. df = pd . read_pickle ( './article.pkl' ) del df [ \"html\" ] del df [ \"image\" ] del df [ \"URL\" ] del df [ \"hash\" ] del df [ \"source\" ] df [ \"label\" ] = df [ \"note\" ] . apply ( lambda x : 0 if x <= 0 else 1 ) df . head ( 5 ) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } authors note resume texte titre label 0 [Danny Bradbury, Marco Santori, Adam Draper, M... -10.0 Black Market Reloaded, a black market site tha... Black Market Reloaded, a black market site tha... Black Market Reloaded back online after source... 0 1 [Emily Spaven, Stan Higgins, Emilyspaven] 1.0 The UK Home Office believes the government sho... The UK Home Office believes the government sho... Home Office: UK Should Create a Crime-Fighting... 1 2 [Pete Rizzo, Alex Batlin, Yessi Bello Perez, P... -10.0 Though lofty in its ideals, lead developer Dan... A new social messaging app is aiming to disrup... Gems Bitcoin App Lets Users Earn Money From So... 0 3 [Nermin Hajdarbegovic, Stan Higgins, Pete Rizz... 3.0 US satellite service provider DISH Network has... US satellite service provider DISH Network has... DISH Becomes World's Largest Company to Accept... 1 4 [Stan Higgins, Bailey Reutzel, Garrett Keirns,... -10.0 An unidentified 28-year-old man was robbed of ... An unidentified 28-year-old man was robbed of ... Bitcoin Stolen at Gunpoint in New York City Ro... 0 Basic statistics on the dataset let's explore the dataset and extract some numbers : * the number of article liked/disliked df [ \"label\" ] . value_counts () 0 879 1 324 Name: label, dtype: int64 Create the full content column df [ 'full_content' ] = df . titre + ' ' + df . resume #exclude the full texte of the article for the moment df . head ( 1 ) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } authors note resume texte titre label full_content 0 [Danny Bradbury, Marco Santori, Adam Draper, M... -10.0 Black Market Reloaded, a black market site tha... Black Market Reloaded, a black market site tha... Black Market Reloaded back online after source... 0 Black Market Reloaded back online after source... from sklearn.model_selection import train_test_split training , testing = train_test_split ( df , # The dataset we want to split train_size = 0.75 , # The proportional size of our training set stratify = df . label , # The labels are used for stratification random_state = 400 # Use the same random state for reproducibility ) training . head ( 5 ) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } authors note resume texte titre label full_content 748 [Jon Brodkin] -10.0 Amazon, Reddit, Mozilla, and other Internet co... Amazon, Reddit, Mozilla, and other Internet co... Amazon and Reddit try to save net neutrality r... 0 Amazon and Reddit try to save net neutrality r... 1183 [Jon Brodkin] -10.0 (The Time Warner involved in this transaction ... A group of mostly Democratic senators led by A... Democrats urge Trump administration to block A... 0 Democrats urge Trump administration to block A... 769 [Joseph Brogan] -10.0 On Twitter, bad news comes at all hours, with ... On Twitter, bad news comes at all hours, with ... Some of the best art on Twitter comes from the... 0 Some of the best art on Twitter comes from the... 57 [Michael Del Castillo, Pete Rizzo, Trond Vidar... -10.0 Publicly traded online travel service Webjet i... Publicly traded online travel service Webjet i... Webjet Ethereum Pilot Targets Hotel Industry's... 0 Webjet Ethereum Pilot Targets Hotel Industry's... 892 [Andrew Cunningham] 10.0 What has changed on the 2017 MacBook, then?\\nI... Andrew Cunningham\\n\\nAndrew Cunningham\\n\\nAndr... Mini-review: The 2017 MacBook could actually b... 1 Mini-review: The 2017 MacBook could actually b... from sklearn.feature_extraction.text import TfidfVectorizer , CountVectorizer from sklearn.svm import LinearSVC , SVC from sklearn.pipeline import Pipeline from sklearn.model_selection import cross_val_predict from utils.plotting import pipeline_performance steps = ( ( 'vectorizer' , TfidfVectorizer ()), ( 'classifier' , LinearSVC ()) ) pipeline = Pipeline ( steps ) predicted_labels = cross_val_predict ( pipeline , training . full_content , training . label ) pipeline_performance ( training . label , predicted_labels ) pipeline = pipeline . fit ( training . titre , training . label ) Accuracy = 80.6% Confusion matrix, without normalization [[624 35] [140 103]] import re from utils.plotting import print_top_features from sklearn.model_selection import GridSearchCV def mask_integers ( s ): return re . sub ( r '\\d+' , 'INTMASK' , s ) steps = ( ( 'vectorizer' , TfidfVectorizer ()), ( 'classifier' , LinearSVC ()) ) pipeline = Pipeline ( steps ) gs_params = { #'vectorizer__use_idf': (True, False), 'vectorizer__lowercase' : [ True , False ], 'vectorizer__stop_words' : [ 'english' , None ], 'vectorizer__ngram_range' : [( 1 , 1 ), ( 1 , 2 ), ( 2 , 2 )], 'vectorizer__preprocessor' : [ mask_integers , None ], 'classifier__C' : np . linspace ( 5 , 20 , 25 ) } gs = GridSearchCV ( pipeline , gs_params , n_jobs = 1 ) gs . fit ( training . full_content , training . label ) print ( gs . best_params_ ) print ( gs . best_score_ ) pipeline1 = gs . best_estimator_ predicted_labels = pipeline1 . predict ( testing . full_content ) pipeline_performance ( testing . label , predicted_labels ) print_top_features ( pipeline1 , n_features = 10 ) aaa = gs . predict ( testing . full_content ) == testing . label aaa = aaa [ testing . label == 1 ] testing [ \"titre\" ] . iloc [ ~ aaa . values ] #pipeline1.predict([\"windows xbox bitcoin\"]) from sklearn.externals import joblib joblib . dump ( pipeline1 , 'classifier.pkl' ) gs . predict ([ 'Google' ]) array([1], dtype=int64) steps = ( ( 'vectorizer' , TfidfVectorizer ()), ( 'classifier' , SVC ()) ) pipeline = Pipeline ( steps ) gs_params = { #'vectorizer__use_idf': (True, False), 'vectorizer__stop_words' : [ 'english' , None ], 'vectorizer__ngram_range' : [( 1 , 1 ), ( 1 , 2 ), ( 2 , 2 )], 'vectorizer__preprocessor' : [ mask_integers , None ], 'classifier__C' : np . linspace ( 5 , 20 , 25 ) } gs = GridSearchCV ( pipeline , gs_params , n_jobs = 1 ) gs . fit ( training . full_content , training . label ) print ( gs . best_params_ ) print ( gs . best_score_ ) pipeline1 = gs . best_estimator_ predicted_labels = pipeline1 . predict ( testing . full_content ) pipeline_performance ( testing . label , predicted_labels ) print_top_features ( pipeline1 , n_features = 10 ) {'classifier__C': 5.0, 'vectorizer__ngram_range': (1, 1), 'vectorizer__preprocessor': &lt;function mask_integers at 0x00000237491B67B8&gt;, 'vectorizer__stop_words': 'english'} 0.711180124224 Accuracy = 71.2% Confusion matrix, without normalization [[153 0] [ 62 0]] --------------------------------------------------------------------------- ValueError Traceback (most recent call last) &lt;ipython-input-9-3e0781e307fb&gt; in &lt;module&gt;() 25 pipeline_performance(testing.label, predicted_labels) 26 ---&gt; 27 print_top_features(pipeline1, n_features=10) C:\\Users\\Guillaume\\Documents\\Code\\recommandation\\utils\\plotting.py in print_top_features(pipeline, vectorizer_name, classifier_name, n_features) 81 def print_top_features(pipeline, vectorizer_name='vectorizer', classifier_name='classifier', n_features=7): 82 vocabulary = np.array(pipeline.named_steps[vectorizer_name].get_feature_names()) ---&gt; 83 coefs = pipeline.named_steps[classifier_name].coef_[0] 84 top_feature_idx = np.argsort(coefs) 85 top_features = vocabulary[top_feature_idx] C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\svm\\base.py in coef_(self) 483 def coef_(self): 484 if self.kernel != 'linear': --&gt; 485 raise ValueError('coef_ is only available when using a ' 486 'linear kernel') 487 ValueError: coef_ is only available when using a linear kernel from sklearn.naive_bayes import BernoulliNB steps = ( ( 'vectorizer' , TfidfVectorizer ()), ( 'classifier' , BernoulliNB ()) ) pipeline2 = Pipeline ( steps ) gs_params = { 'vectorizer__stop_words' : [ 'english' , None ], 'vectorizer__ngram_range' : [( 1 , 1 ), ( 1 , 2 ), ( 2 , 2 )], 'vectorizer__preprocessor' : [ mask_integers , None ], 'classifier__alpha' : np . linspace ( 0 , 1 , 5 ), 'classifier__fit_prior' : [ True , False ] } gs = GridSearchCV ( pipeline2 , gs_params , n_jobs = 1 ) gs . fit ( training . full_content , training . label ) print ( gs . best_params_ ) print ( gs . best_score_ ) pipeline2 = gs . best_estimator_ predicted_labels = pipeline2 . predict ( testing . full_content ) pipeline_performance ( testing . label , predicted_labels ) print_top_features ( pipeline2 , n_features = 10 ) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - {'classifier__alpha': 0.25, 'classifier__fit_prior': True, 'vectorizer__ngram_range': (1, 1), 'vectorizer__preprocessor': &lt;function mask_integers at 0x00000237491B67B8&gt;, 'vectorizer__stop_words': 'english'} 0.805900621118 Accuracy = 78.1% Confusion matrix, without normalization [[140 13] [ 34 28]] Top like features: ['use' 'just' 'year' 'price' 'time' 'Bitcoin' 'bitcoin' 'new' 'The' 'INTMASK'] --- Top dislike features: ['ABBA' 'cable' 'cab' 'byte' 'publication' 'bye' 'publications' 'publicity' 'buyer' 'publicizing'] from sklearn.naive_bayes import MultinomialNB steps = ( ( 'vectorizer' , TfidfVectorizer ()), ( 'classifier' , MultinomialNB ()) ) pipeline3 = Pipeline ( steps ) gs_params = { 'vectorizer__stop_words' : [ 'english' , None ], 'vectorizer__ngram_range' : [( 1 , 1 ), ( 1 , 2 ), ( 2 , 2 )], 'vectorizer__preprocessor' : [ mask_integers , None ], 'classifier__alpha' : np . linspace ( 0 , 1 , 5 ), 'classifier__fit_prior' : [ True , False ] } gs = GridSearchCV ( pipeline3 , gs_params , n_jobs = 1 ) gs . fit ( training . full_content , training . label ) print ( gs . best_params_ ) print ( gs . best_score_ ) pipeline3 = gs . best_estimator_ predicted_labels = pipeline3 . predict ( testing . full_content ) pipeline_performance ( testing . label , predicted_labels ) print_top_features ( pipeline3 , n_features = 10 ) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - {'classifier__alpha': 0.5, 'classifier__fit_prior': False, 'vectorizer__ngram_range': (1, 1), 'vectorizer__preprocessor': &lt;function mask_integers at 0x00000237491B67B8&gt;, 'vectorizer__stop_words': 'english'} 0.80900621118 Accuracy = 79.1% Confusion matrix, without normalization [[141 12] [ 33 29]] Top like features: ['time' 'Google' 'Pro' 'Apple' 'new' 'The' 'Bitcoin' 'price' 'bitcoin' 'INTMASK'] --- Top dislike features: ['ABBA' 'categories' 'catching' 'catalyst' 'catalog' 'casually' 'casts' 'cast' 'cashier' 'ran']","tags":"Machine Learning","url":"redoules.github.io/machine-learning/Source code for the recommandation engine for articles.html","loc":"redoules.github.io/machine-learning/Source code for the recommandation engine for articles.html"}]}
						
						
					
				
				
					
						Reference in New Issue
					
					View Git Blame
					Copy Permalink