redoules.github.io/tipuesearch_content.json
redoules 7e8e645115 added an article
stat challenge day 1
2018-11-08 23:41:12 +01:00

1 line
134 KiB
JSON
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{"pages":[{"title":"About Guillaume Redoulès","text":"I am a data scientist and a mechanical engineer working on numerical methods for stress computations in the field of rocket propulsion. Prior to that, I've got a MSc in Computational Fluid Dynamics and aerodynamics from Imperial College London. Email: guillaume.redoules@gadz.org Linkedin: Guillaume Redoulès Curriculum Vitae Experience Thermomecanical method and tools engineer , Ariane Group , 2015 - Present In charge of tools and methods related to thermomecanical computations. Focal point for machine learning. Education MSc Advanced Computational Methods for Aeronautics, Flow Management and Fluid-Structure Interaction , Imperial College London, London. 2013 Dissertation: \"Estimator design for fluid flows\" Fields: Aeronautics, aerodynamics, computational fluid dynamics, numerical methods Arts et Métiers Paristech , France, 2011 Generalist engineering degree Fields: Mechanics, electrical engineering, casting, machining, project management, finance, IT, etc.","tags":"pages","url":"redoules.github.io/pages/about.html","loc":"redoules.github.io/pages/about.html"},{"title":"Day 1 - Quartiles, Interquartile Range and standard deviation","text":"Quartile Definition A quartile is a type of quantile. The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highest value of the data set. Implementation in python without using the scientific libraries def median ( l ): l = sorted ( l ) if len ( l ) % 2 == 0 : return ( l [ len ( l ) // 2 ] + l [( len ( l ) // 2 - 1 )]) / 2 else : return l [ len ( l ) // 2 ] def quartiles ( l ): # check the input is not empty if not l : raise StatsError ( 'no data points passed' ) # 1. order the data set l = sorted ( l ) # 2. divide the data set in two halves mid = int ( len ( l ) / 2 ) Q2 = median ( l ) if ( len ( l ) % 2 == 0 ): # even Q1 = median ( l [: mid ]) Q3 = median ( l [ mid :]) else : # odd Q1 = median ( l [: mid ]) # same as even Q3 = median ( l [ mid + 1 :]) return ( Q1 , Q2 , Q3 ) L = [ 3 , 7 , 8 , 5 , 12 , 14 , 21 , 13 , 18 ] Q1 , Q2 , Q3 = quartiles ( L ) print ( f \"Sample : {L} \\n Q1 : {Q1}, Q2 : {Q2}, Q3 : {Q3}\" ) Sample : [3, 7, 8, 5, 12, 14, 21, 13, 18] Q1 : 6.0, Q2 : 12, Q3 : 16.0 Interquartile Range Definition The interquartile range of an array is the difference between its first (Q1) and third (Q3) quartiles. Hence the interquartile range is Q3-Q1 Implementation in python without using the scientific libraries print ( f \"Interquatile range : {Q3-Q1}\" ) Interquatile range : 10.0 Standard deviation Definition The standard deviation (σ) is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. The standard deviation can be computed with the formula: where µ is the mean : Implementation in python without using the scientific libraries import math X = [ 10 , 40 , 30 , 50 , 20 ] mean = sum ( X ) / len ( X ) X = [( x - mean ) ** 2 for x in X ] std = math . sqrt ( sum ( X ) / len ( X ) ) print ( f \"The distribution {X} has a standard deviation of {std}\" ) The distribution [400.0, 100.0, 0.0, 400.0, 100.0] has a standard deviation of 14.142135623730951","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day1.html","loc":"redoules.github.io/blog/Statistics_10days-day1.html"},{"title":"Counting values in an array","text":"Using lists If you want to count the number of occurences of an element in a list you can use the .count() function of the list object arr = [ 1 , 2 , 3 , 3 , 4 , 5 , 3 , 6 , 7 , 7 ] print ( f 'Array : {arr} \\n ' ) print ( f 'The number 3 appears {arr.count(3)} times in the list' ) print ( f 'The number 7 appears {arr.count(7)} times in the list' ) print ( f 'The number 4 appears {arr.count(4)} times in the list' ) Array : [1, 2, 3, 3, 4, 5, 3, 6, 7, 7] The number 3 appears 3 times in the list The number 7 appears 2 times in the list The number 4 appears 1 times in the list Using collections you can get a dictonnary of the number of occurences of each elements in a list thanks to the collections object like this import collections collections . Counter ( arr ) Counter({1: 1, 2: 1, 3: 3, 4: 1, 5: 1, 6: 1, 7: 2}) Using numpy You can have a simular result with numpy by hacking the unique function import numpy as np arr = np . array ( arr ) unique , counts = np . unique ( arr , return_counts = True ) dict ( zip ( unique , counts )) {1: 1, 2: 1, 3: 3, 4: 1, 5: 1, 6: 1, 7: 2}","tags":"Python","url":"redoules.github.io/python/counting.html","loc":"redoules.github.io/python/counting.html"},{"title":"Building a dictonnary using comprehension","text":"An easy way to create a dictionnary in python is to use the comprehension syntaxe. It can be more expressive hence easier to read. d = { key : value for ( key , value ) in iterable } In the example bellow we use the dictionnary comprehension to build a dictonnary from a source list. iterable = list ( range ( 10 )) d = { str ( value ): value ** 2 for value in iterable } # create a dictionnary linking the string value of a number with the square value of this number print ( d ) {'0': 0, '1': 1, '2': 4, '3': 9, '4': 16, '5': 25, '6': 36, '7': 49, '8': 64, '9': 81} of course, you can use an other iterable an repack it with the comprehension syntaxe. In the following example, we convert a list of tuples in a dictonnary. iterable = [( \"France\" , 67.12e6 ), ( \"UK\" , 66.02e6 ), ( \"USA\" , 325.7e6 ), ( \"China\" , 1386e6 ), ( \"Germany\" , 82.79e6 )] population = { key : value for ( key , value ) in iterable } print ( population ) {'France': 67120000.0, 'UK': 66020000.0, 'USA': 325700000.0, 'China': 1386000000.0, 'Germany': 82790000.0}","tags":"Python","url":"redoules.github.io/python/dict_comprehension.html","loc":"redoules.github.io/python/dict_comprehension.html"},{"title":"Extracting unique values from a list or an array","text":"Using lists An easy way to extract the unique values of a list in python is to convert the list to a set. A set is an unordered collection of items. Every element is unique (no duplicates) and must be immutable. my_list = [ 10 , 20 , 30 , 40 , 20 , 50 , 60 , 40 ] print ( f \"Original List : {my_list}\" ) my_set = set ( my_list ) my_new_list = list ( my_set ) # the set is converted back to a list with the list() function print ( f \"List of unique numbers : {my_new_list}\" ) Original List : [10, 20, 30, 40, 20, 50, 60, 40] List of unique numbers : [40, 10, 50, 20, 60, 30] Using numpy If you are using numpy you can extract the unique values of an array with the unique function builtin numpy: import numpy as np arr = np . array ( my_list ) print ( f 'Initial numpy array : {arr} \\n ' ) unique_arr = np . unique ( arr ) print ( f 'Numpy array with unique values : {unique_arr}' ) Initial numpy array : [10 20 30 40 20 50 60 40] Numpy array with unique values : [10 20 30 40 50 60]","tags":"Python","url":"redoules.github.io/python/unique.html","loc":"redoules.github.io/python/unique.html"},{"title":"Sorting an array","text":"Using lists Python provides an iterator to sort an array sorted() you can use it this way : import random # Random lists from [0-999] interval arr = [ random . randint ( 0 , 1000 ) for r in range ( 10 )] print ( f 'Initial random list : {arr} \\n ' ) reversed_arr = list ( sorted ( arr )) print ( f 'Sorted list : {reversed_arr}' ) Initial random list : [277, 347, 976, 367, 604, 878, 148, 670, 229, 432] Sorted list : [148, 229, 277, 347, 367, 432, 604, 670, 878, 976] it is also possible to use the sort function from the list object # Random lists from [0-999] interval arr = [ random . randint ( 0 , 1000 ) for r in range ( 10 )] print ( f 'Initial random list : {arr} \\n ' ) arr . sort () print ( f 'Sorted list : {arr}' ) Initial random list : [727, 759, 68, 103, 23, 90, 258, 737, 791, 567] Sorted list : [23, 68, 90, 103, 258, 567, 727, 737, 759, 791] Using numpy If you are using numpy you can sort an array by creating a view on the array: import numpy as np arr = np . random . random ( 5 ) print ( f 'Initial random array : {arr} \\n ' ) sorted_arr = np . sort ( arr ) print ( f 'Sorted array : {sorted_arr}' ) Initial random array : [0.40021786 0.13876208 0.19939047 0.46015169 0.43734158] Sorted array : [0.13876208 0.19939047 0.40021786 0.43734158 0.46015169]","tags":"Python","url":"redoules.github.io/python/sorting.html","loc":"redoules.github.io/python/sorting.html"},{"title":"Day 0 - Median, mean, mode and weighted mean","text":"A reminder The median The median is the value separating the higher half from the lower half of a data sample. For a data set, it may be thought of as the middle value. For a continuous probability distribution, the median is the value such that a number is equally likely to fall above or below it. The mean The arithmetic mean (or simply mean) of a sample is the sum of the sampled values divided by the number of items. The mode The mode of a set of data values is the value that appears most often. It is the value x at which its probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled. Implementation in python without using the scientific libraries def median ( l ): l = sorted ( l ) if len ( l ) % 2 == 0 : return ( l [ len ( l ) // 2 ] + l [( len ( l ) // 2 - 1 )]) / 2 else : return l [ len ( l ) // 2 ] def mean ( l ): return sum ( l ) / len ( l ) def mode ( data ): dico = { x : data . count ( x ) for x in list ( set ( data ))} return sorted ( sorted ( dico . items ()), key = lambda x : x [ 1 ], reverse = True )[ 0 ][ 0 ] L = [ 64630 , 11735 , 14216 , 99233 , 14470 , 4978 , 73429 , 38120 , 51135 , 67060 , 4978 , 73429 ] print ( f \"Sample : {L} \\n Mean : {mean(L)}, Median : {median(L)}, Mode : {mode(L)}\" ) Sample : [64630, 11735, 14216, 99233, 14470, 4978, 73429, 38120, 51135, 67060, 4978, 73429] Mean : 43117.75, Median : 44627.5, Mode : 4978 The weighted average The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. data = [ 10 , 40 , 30 , 50 , 20 ] weights = [ 1 , 2 , 3 , 4 , 5 ] sum_X = sum ([ x * w for x , w in zip ( data , weights )]) print ( round (( sum_X / sum ( weights )), 1 )) 32.0","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day0.html","loc":"redoules.github.io/blog/Statistics_10days-day0.html"},{"title":"Create a simple bash function","text":"A basic function The synthaxe to define a function is : #!/bin/bash # Basic function my_function () { echo Text displayed by my_function } #once defined, you can use it like so : my_function and it should return user@bash : ./my_function.sh Text displayed by my_function Function with arguments When used, the arguments are specified directly after the function name. Whithin the function they are accessible this the $ symbol followed by the number of the arguement. Hence $1 will take the value of the first arguement, $2 will take the value of the second arguement and so on. #!/bin/bash # Passing arguments to a function say_hello () { echo Hello $1 } say_hello Guillaume and it should return user@bash : ./function_arguements.sh Hello Guillaume Overriding Commands Using the previous example, let's override the echo function in order to make it say hello. To do so, you just need to name the function with the same name as the command you want to replace. When you are calling the original function, make sure you are using the builtin keyword #!/bin/bash # Overriding a function echo () { builtin echo Hello $1 } echo Guillaume user@bash : ./function_arguements.sh Hello Guillaume Returning values Use the keyword return to send back a value to the main program. The returned value will be stored in the $? variable #!/bin/bash # Retruning a value secret_number () { return 126 } secret_number echo The secret number is $? This code should return user@bash : ./retrun_value.sh The secret number is 126","tags":"Linux","url":"redoules.github.io/linux/simple_bash_function.html","loc":"redoules.github.io/linux/simple_bash_function.html"},{"title":"Number of edges in a Complete graph","text":"A complete graph contains $\\frac{n(n-1)}{2}$ edges where $n$ is the number of vertices (or nodes).","tags":"Mathematics","url":"redoules.github.io/mathematics/Number_edges_Complete_graph.html","loc":"redoules.github.io/mathematics/Number_edges_Complete_graph.html"},{"title":"Reverse an array","text":"Using lists Python provides an iterator to reverse an array reversed() you can use it this way : arr = list ( range ( 5 )) print ( f 'Initial array : {arr} \\n ' ) reversed_arr = list ( reversed ( arr )) print ( f 'Reversed array : {reversed_arr}' ) Initial array : [0, 1, 2, 3, 4] Reversed array : [4, 3, 2, 1, 0] Using numpy If you are using numpy you can reverse an array by creating a view on the array: import numpy as np arr = np . arange ( 5 ) print ( f 'Initial array : {arr} \\n ' ) reversed_arr = arr [:: - 1 ] print ( f 'Reversed array : {reversed_arr}' ) Initial array : [0 1 2 3 4] Reversed array : [4 3 2 1 0]","tags":"Python","url":"redoules.github.io/python/reverse.html","loc":"redoules.github.io/python/reverse.html"},{"title":"Advice for designing your own libraries","text":"Advice for designing your own libraries When designing your own library make sure to think of the following things. I will add new paragraphs to this article as I dicover new good practices. Use standard python objects Try to use standard python objects as much as possible. That way, your library becomes compatible with all the other python libaries. For instance, when I created SAMpy : a library for reading and writing SAMCEF results, it returned dictonnaries, lists and pandas dataframes. Hence the results extracted from SAMCEF where compatible with all the scientific stack of python. Limit the number of functionnalities Following the same logic as before, the objects should do only one thing but do it well. Indeed, having a simple interface will reduce the complexity of your code and make it easier to use your library. Again, with SAMpy, I decided to strictly limit the functionnalities to reading and writing SAMCEF files. Define an exception class for your library You should define your own exceptions in order to make it easier for your users to debug their code thanks to clearer messages that convey more meaning. That way, the user will know if the error comes from your library or something else. Bonus if you group similar exceptions in a hierachy of inerited Exception classes. Example : let's create a Exception related to the age of a person : def check_age ( age ): if age < 0 and age > 130 : raise ValueError If the user inputed an invalid age, the ValueError exception would be thrown. That's fine but imagine you wan't to provide more feedback to your users that don't know the internal of your library. Let's now create a selfexplanatory Exception class AgeInvalidError ( ValueError ): pass def check_age ( age ): if age < 0 and age > 130 : raise AgeInvalidError ( age ) You can also add some helpful text to guide your users along the way: class AgeInvalidError ( ValueError ): print ( \"Age invalid, must be between 0 and 130\" ) pass def check_age ( age ): if age < 0 and age > 130 : raise AgeInvalidError ( age ) If you want to group all the logically linked exceptions, you can create a base class and inherit from it : class BaseAgeInvalidError ( ValueError ): pass class TooYoungError ( BaseAgeInvalidError ): pass class TooOldError ( BaseAgeInvalidError ): pass def check_age ( age ): if age < 0 : raise TooYoungError ( age ) elif age > 130 : raise TooOldError ( age ) Structure your repository You should have a file structure in your repository. It will help other contributers especially future contributers. A nice directory structure for your project should look like this: README.md LICENSE setup.py requirements.txt ./MyPackage ./docs ./tests Some prefer to use reStructured Text, I personnaly prefer Markdown choosealicense.com will help you pick the license to use for your project. For package and distribution management, create a setup.py file a the root of the directory The list of dependencies required to test, build and generate the doc are listed in a pip requirement file placed a the root of the directory and named requirements.txt Put the documentation of your library in the docs directory. Put your tests in the tests directory. Since your tests will need to import your library, I recommend modifying the path to resolve your package property. In order to do so, you can create a context.py file located in the tests directory : import os import sys sys . path . insert ( 0 , os . path . abspath ( os . path . join ( os . path . dirname ( __file__ ), '..' ))) import MyPackage Then within your individual test files you can import your package like so : from .context import MyPackage Finally, your code will go into the MyPackage directory Test your code Once your library is in production, you have to guaranty some level of forward compatibility. Once your interface is defined, write some tests. In the future, when your code is modified, having those tests will make sure that the behaviour of your functions and objects won't be altered. Document your code Of course, you should have a documentation to go along with your library. Make sure to add a lot of commun examples as most users tend to learn from examples. I recommend writing your documentation using Sphinx.","tags":"Python","url":"redoules.github.io/python/design_own_libs.html","loc":"redoules.github.io/python/design_own_libs.html"},{"title":"Safely creating a folder if it doesn't exist","text":"Safely creating a folder if it doesn't exist When you are writing to files in python, if the file doesn't exist it will be created. However, if you are trying to write a file in a directory that doesn't exist, an exception will be returned FileNotFoundError : [ Errno 2 ] No such file or directory : \"directory\" This article will teach you how to make sure the target directory exists. If it doesn't, the function will create that directory. First, let's import os and make sure that the \"test_directory\" doesn't exist import os os . path . exists ( \". \\\\ test_directory\" ) False copy the ensure_dir function in your code. This function will handle the creation of the directory. Credit goes to Parand posted on StackOverflow def ensure_dir ( file_path ): directory = os . path . dirname ( file_path ) if not os . path . exists ( directory ): os . makedirs ( directory ) Let's now use the function and create a folder named \"test_directory\" ensure_dir ( \". \\\\ test_directory\" ) If we test for the existence of the directory, the exists function will now return True os . path . exists ( \". \\\\ test_directory\" ) True","tags":"Python","url":"redoules.github.io/python/ensure_dir.html","loc":"redoules.github.io/python/ensure_dir.html"},{"title":"List all files in a directory","text":"Listing all the files in a directory Let's start with the basics, the most staigthforward way to list all the files in a direcoty is to use a combinaison of the listdir function and isfile form os.path. You can use a list comprehension to store all the results in a list. mypath = \"./test_directory/\" from os import listdir from os.path import isfile , join [ f for f in listdir ( mypath ) if isfile ( join ( mypath , f ))] ['logfile.log', 'myfile.txt', 'super_music.mp3', 'textfile.txt'] Listing all the files of a certain type in a directory similarly, if you want to filter only a certain kind of file based on its extension you can use the endswith method. In the following example, we will filter all the \"txt\" files contained in the directory [ f for f in listdir ( mypath ) if f . endswith ( '.' + \"txt\" )] ['myfile.txt', 'textfile.txt'] Listing all the files matching a pattern in a directory The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. You can use the *, ?, and character ranges expressed with [] wildcards import glob glob . glob ( \"*.txt\" ) ['myfile.txt'] Listing files recusively If you want to list all files recursively you can select all the sub-directories using the \"**\" wildcard import glob glob . glob ( mypath + '/**/*.txt' , recursive = True ) ['./test_directory\\\\myfile.txt', './test_directory\\\\textfile.txt', './test_directory\\\\subdir1\\\\file_hidden_in_a_sub_direcotry.txt'] Using a regular expression If you'd rather use a regular expression to select the files, the pathlib library provides the rglob function. from pathlib import Path list ( Path ( \"./test_directory/\" ) . rglob ( \"*.[tT][xX][tT]\" )) [WindowsPath('test_directory/myfile.txt'), WindowsPath('test_directory/textfile.txt'), WindowsPath('test_directory/subdir1/file_hidden_in_a_sub_direcotry.txt')] Using regular expressions you can for example select multiple types of files. In the following example, we list all the files that finish either with \"txt\" or with \"log\". list ( Path ( \"./test_directory/\" ) . rglob ( \"*.[tl][xo][tg]\" )) [WindowsPath('test_directory/logfile.log'), WindowsPath('test_directory/myfile.txt'), WindowsPath('test_directory/textfile.txt'), WindowsPath('test_directory/subdir1/file_hidden_in_a_sub_direcotry.txt')]","tags":"Python","url":"redoules.github.io/python/list_files_directory.html","loc":"redoules.github.io/python/list_files_directory.html"},{"title":"Using Dask on infiniband","text":"InfiniBand (abbreviated IB) is a computer-networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used as either a direct or switched interconnect between servers and storage systems, as well as an interconnect between storage systems. (source Wikipedia). If you want to leverage this high speed network instead of the regular ethernet network, you have to specify to the scheduler that you want to used infiniband as your interface. Assuming that you Infiniband interface is ib0 , you would call the scheduler like this : dask-scheduler --interface ib0 --scheduler-file ./cluster.yaml you would have to call the worker using the same interface : dask-worker --interface ib0 --scheduler-file ./cluster.yaml","tags":"Python","url":"redoules.github.io/python/dask_infiniband.html","loc":"redoules.github.io/python/dask_infiniband.html"},{"title":"Clearing the current cell in the notebook","text":"In python, you can clear the output of a cell by importing the IPython.display module and using the clear_output function from IPython.display import clear_output print ( \"text to be cleared\" ) clear_output () As you can see, the text \"text to be cleared\" is not displayed because the function clear_output has been called afterward","tags":"Jupyter","url":"redoules.github.io/jupyter/clear_cell.html","loc":"redoules.github.io/jupyter/clear_cell.html"},{"title":"What's inside my .bashrc ?","text":"############ # Anaconda # ############ export PATH = \"/station/guillaume/anaconda3/bin: $PATH \" alias python = '/station/guillaume/anaconda3/bin/python' ######### # Alias # ######### alias ding = 'echo -e \"\\a\"' alias calc = 'python -ic \"from __future__ import division; from math import *\"' alias h = \"history|grep \" alias f = \"find . |grep \" alias p = \"ps aux |grep \" alias cdl = \"cd /data/guillaume\" alias cp = \"rsync -avz --progress\" alias grep = \"grep --color=auto\" alias ls = \"ls -hN --color=auto --group-directories-first\" alias ll = \"ls -hal\" alias sv = \"ssh compute_cluster\" alias ms = \"ls\" alias jl = \"jupyter lab\" alias lst = \"jupyter notebook list\" ########################## # bashrc personnalisation# ########################## force_color_prompt = yes export EDITOR = nano export BROWSER = \"firefox '%s' &\" if [ -n \" $force_color_prompt \" ] ; then if [ -x /usr/bin/tput ] && tput setaf 1 > & /dev/null ; then # We have color support; assume it's compliant with Ecma-48 # (ISO/IEC-6429). (Lack of such support is extremely rare, and such # a case would tend to support setf rather than setaf.) color_prompt = yes else color_prompt = fi fi if [ \" $color_prompt \" = yes ] ; then #\\h : hostname #\\u : user #\\w : current working directory #\\d : date #\\t : time yellow = 226 green = 83 pink = 198 blue = 34 PS1 = \"\\[\\033[38;5;22m\\]\\u\\[ $( tput sgr0 ) \\]\\[\\033[38;5;163m\\]@\\[ $( tput sgr0 ) \\]\\[\\033[38;5;22m\\]\\h\\[ $( tput sgr0 ) \\]\\[\\033[38;5;162m\\]:\\[ $( tput sgr0 ) \\]\\[\\033[38;5;172m\\]{\\[ $( tput sgr0 ) \\]\\[\\033[38;5;39m\\]\\w\\[ $( tput sgr0 ) \\]\\[\\033[38;5;172m\\]}\\[ $( tput sgr0 ) \\]\\[\\033[38;5;162m\\]>\\[ $( tput sgr0 ) \\]\"","tags":"Linux","url":"redoules.github.io/linux/bashrc.html","loc":"redoules.github.io/linux/bashrc.html"},{"title":"Efficient extraction of eigenvalues from a list of tensors","text":"When you manipulate FEM results you generally have either a: scalar field, vector field, * tensor field. With tensorial results, it is often useful to extract the eigenvalues in order to find the principal values. I have found that it is easier to store the components of the tensors in a 6 column pandas dataframe (because of the symmetric property of stress and strain tensors) import pandas as pd node = [ 1001 , 1002 , 1003 , 1004 ] #when dealing with FEM results you should remember at which element/node the result is computed (in the example, let's assume that we look at node from 1001 to 1004) tensor1 = [ 1 , 1 , 1 , 0 , 0 , 0 ] #eigen : 1 tensor2 = [ 4 , - 1 , 0 , 2 , 2 , 1 ] #eigen : 5.58443, -1.77931, -0.805118 tensor3 = [ 1 , 6 , 5 , 3 , 3 , 1 ] #eigen : 8.85036, 4.46542, -1.31577 tensor4 = [ 1 , 2 , 3 , 0 , 0 , 0 ] #eigen : 1, 2, 3 df = pd . DataFrame ([ tensor1 , tensor2 , tensor3 , tensor4 ], columns = [ \"XX\" , \"YY\" , \"ZZ\" , \"XY\" , \"XZ\" , \"YZ\" ]) df . index = node df .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } XX YY ZZ XY XZ YZ 1001 1 1 1 0 0 0 1002 4 -1 0 2 2 1 1003 1 6 5 3 3 1 1004 1 2 3 0 0 0 If you want to extract the eigenvalues of a tensor with numpy you have to pass a n by n ndarray to the eigenvalue function. In order to avoid having to loop over each node, this oneliner is highly optimized and will help you invert a large number of tensors efficiently. The steps are basically, create a list of n by n values (here n=3) in the right order => reshape it to a list of tensors => pass it to the eigenvals function import numpy as np from numpy import linalg as LA eigenvals = LA . eigvals ( df [[ \"XX\" , \"XY\" , \"XZ\" , \"XY\" , \"YY\" , \"YZ\" , \"XZ\" , \"YZ\" , \"ZZ\" ]] . values . reshape ( len ( df ), 3 , 3 )) eigenvals array([[ 1. , 1. , 1. ], [ 5.58442834, -0.80511809, -1.77931025], [-1.31577211, 8.85035616, 4.46541595], [ 1. , 2. , 3. ]])","tags":"Python","url":"redoules.github.io/python/Efficient_extraction_of_eigenvalues_from_a_list_of_tensors.html","loc":"redoules.github.io/python/Efficient_extraction_of_eigenvalues_from_a_list_of_tensors.html"},{"title":"Optimized numpy random number generation on Intel CPU","text":"Python Intel distribution Make sure you have a python intel distribution. When you startup python you should see somethine like : Python 3.6.2 |Intel Corporation| (default, Aug 15 2017, 11:34:02) [MSC v.1900 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help. If not, you can force the installation of the intel optimized python with : conda update --all conda config --add channels intel conda install numpy --channel intel --override-channels oh and by the way, make sure you a running an Intel CPU ;) Comparing numpy.random with numpy.random_intel Let's now test both the rand function with and without the Intel optimization import numpy as np from numpy import random , random_intel % timeit np . random . rand ( 10 ** 5 ) 1.06 ms ± 91.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) % timeit np . random_intel . rand ( 10 ** 5 ) 225 µs ± 3.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)","tags":"Python","url":"redoules.github.io/python/Optimized_numpy_random_intel.html","loc":"redoules.github.io/python/Optimized_numpy_random_intel.html"},{"title":"How to check Linux process information?","text":"How to check Linux process information (CPU usage, memory, user information, etc.)? You need to use the ps command combined with the grep command. In the example, we want to check the information on the nginx process : ps aux | grep nginx It would return the output : root 9976 0.0 0.0 12272 108 ? S<s Aug12 0:00 nginx: master process /usr/bin/nginx -g pid /run/nginx.pid; daemon on; master_process on; http 16780 0.0 0.0 12384 684 ? S< Aug12 4:11 nginx: worker process http 16781 0.0 0.0 12556 708 ? S< Aug12 0:24 nginx: worker process http 16782 0.0 0.1 12292 744 ? S< Aug12 2:43 nginx: worker process http 16783 0.0 0.1 12276 872 ? S< Aug12 0:24 nginx: worker process admin 17612 0.0 0.1 5120 864 pts/4 S+ 11:22 0:00 grep --color=auto nginx The columns have the following order : USER;PID;%CPU;%MEM;VSZ;RSS;TTY;STAT;START;TIME;COMMAND USER = user owning the process PID = process ID of the process %CPU = It is the CPU time used divided by the time the process has been running. %MEM = ratio of the process's resident set size to the physical memory on the machine VSZ = virtual memory usage of entire process (in KiB) RSS = resident set size, the non-swapped physical memory that a task has used (in KiB) TTY = controlling tty (terminal) STAT = multi-character process state START = starting time or date of the process TIME = cumulative CPU time COMMAND = command with all its arguments Interactive display If you want an interactive display showing in real time the statistics of the running process you can use the top command. If htop is available on you system, use this instead.","tags":"Linux","url":"redoules.github.io/linux/linux_process_information.html","loc":"redoules.github.io/linux/linux_process_information.html"},{"title":"Check the size of a directory","text":"How do you Check the size of a directory in linux? The du command will come handy for this task. Let's say we want to know the size of the directory named recommandations , we would run the following command du -sh recommendations It would return the output : 9.9M recommendations","tags":"Linux","url":"redoules.github.io/linux/directory_size.html","loc":"redoules.github.io/linux/directory_size.html"},{"title":"Check for free disk space","text":"How do you check for free disk space on linux? The df command will come handy for this task. Run the command with the following arguments df -ah Will return in a human readable format for all drives a readout of all your filesystems Filesystem Size Used Avail Use% Mounted on /dev/md0 2.4G 1.3G 1.1G 54% / none 348M 4.0K 348M 1% /dev none 0 0 0 - /dev/pts none 0 0 0 - /proc none 0 0 0 - /sys /tmp 350M 1.3M 349M 1% /tmp /run 350M 3.2M 347M 1% /run /dev/shm 350M 12K 350M 1% /dev/shm /proc/bus/usb 0 0 0 - /proc/bus/usb securityfs 0 0 0 - /sys/kernel/security /dev/md3 1.8T 372G 1.5T 21% /volume2 /dev/vg1000/lv 1.8T 1.5T 340G 82% /volume1 /dev/sdq1 7.4G 3.8G 3.5G 52% /volumeUSB3/usbshare3-1 /dev/sdr 294G 146G 134G 53% /volumeUSB2/usbshare none 0 0 0 - /proc/fs/nfsd none 0 0 0 - /config The free disk space can be read in the Avail column","tags":"Linux","url":"redoules.github.io/linux/free_disk_space_linux.html","loc":"redoules.github.io/linux/free_disk_space_linux.html"},{"title":"Check your current ip address","text":"Check your current ip address Run the command ip addr show and that will give you every information available 4: eth0: <> mtu 1500 group default qlen 1 link/ether 5c:51:4f:41:7a:b1 inet 169.254.33.33/16 brd 169.254.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::390a:f69e:1ba2:2121/64 scope global dynamic valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 group default qlen 1 link/ether 0a:00:27:00:00:03 inet 192.168.56.1/24 brd 192.168.56.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::84cd:374e:843f:f82f/64 scope global dynamic valid_lft forever preferred_lft forever 15: eth2: <> mtu 1500 group default qlen 1 link/ether 00:ff:d2:8a:19:c3 inet 169.254.40.62/16 brd 169.254.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::f199:2ca0:7aff:283e/64 scope global dynamic valid_lft forever preferred_lft forever 1: lo: <LOOPBACK,UP> mtu 1500 group default qlen 1 link/loopback 00:00:00:00:00:00 inet 127.0.0.1/8 brd 127.255.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 ::1/128 scope global dynamic valid_lft forever preferred_lft forever 5: wifi0: <BROADCAST,MULTICAST,UP> mtu 1500 group default qlen 1 link/ieee802.11 5c:51:4f:41:7a:ad inet 192.168.1.1/24 brd 192.168.1.255 scope global dynamic valid_lft 42720sec preferred_lft 42720sec inet6 fe80::395f:3594:1dc2:57e3/64 scope global dynamic valid_lft forever preferred_lft forever 21: wifi1: <> mtu 1500 group default qlen 1 link/ieee802.11 5c:51:4f:41:7a:ae inet 169.254.12.77/16 brd 169.254.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::58d5:630:cbbd:c4d/64 scope global dynamic valid_lft forever preferred_lft forever 12: eth3: <> mtu 1472 group default qlen 1 link/ether 00:00:00:00:00:00:00:e0:00:00:00:00:00:00:00:00 inet6 fe80::100:7f:fffe/64 scope global dynamic valid_lft forever preferred_lft forever 10: eth4: <BROADCAST,MULTICAST,UP> mtu 1500 group default qlen 1 link/ether 22:b7:57:52:5f:ff inet 192.168.42.106/24 brd 192.168.42.255 scope global dynamic valid_lft 6659sec preferred_lft 6659sec inet6 fe80::5110:eb6f:deb0:45c4/64 scope global dynamic valid_lft forever preferred_lft forever You can select only one interface ip addr show eth0 and only the relevant information will be displayed 4: eth0: <> mtu 1500 group default qlen 1 link/ether 5c:51:4f:41:7a:b1 inet 169.254.33.33/16 brd 169.254.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::390a:f69e:1ba2:2121/64 scope global dynamic valid_lft forever preferred_lft forever Check your current ip address the old way The ifconfig command will return information regarding your network interfaces. Let's try it: ifconfig eth1 Link encap:Ethernet HWaddr 0a:00:27:00:00:03 inet adr:192.168.56.1 Bcast:192.168.56.255 Masque:255.255.255.0 adr inet6: fe80::84cd:374e:843f:f82f/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Packets reçus:0 erreurs:0 :0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B) eth4 Link encap:Ethernet HWaddr 22:b7:57:52:5f:ff inet adr:192.168.42.106 Bcast:192.168.42.255 Masque:255.255.255.0 adr inet6: fe80::5110:eb6f:deb0:45c4/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Packets reçus:0 erreurs:0 :0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B) lo Link encap:Boucle locale inet adr:127.0.0.1 Masque:255.0.0.0 adr inet6: ::1/128 Scope:Global UP LOOPBACK RUNNING MTU:1500 Metric:1 Packets reçus:0 erreurs:0 :0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B) wifi0 Link encap:UNSPEC HWaddr 5C-51-4F-41-7A-AD-00-00-00-00-00-00-00-00-00-00 inet adr:192.168.1.1 Bcast:192.168.1.255 Masque:255.255.255.0 adr inet6: fe80::395f:3594:1dc2:57e3/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Packets reçus:0 erreurs:0 :0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B) On the left column you have the list of network adapters. The lo is the local loopback, it is an interface that points to the localhost. Interfaces starting with eth refer to wired connections over ethernet (or sometimes USB in the case of a phone acting as an access point over USB). Interfaces starting with wlan or wifi refer to wireless connections. On the right column you have some information corresponding to the interface such as the IPv4, the IPv6, the mask, some statistics about the interface and so on.","tags":"Linux","url":"redoules.github.io/linux/get_ip_linux.html","loc":"redoules.github.io/linux/get_ip_linux.html"},{"title":"Check the version of the kernel currently running","text":"Check the version of the kernel currently running The uname command will give you the version of the kernel. In order to get a more useful output, type uname -a This will return : the hostname os name kernel release/version architecture * etc. variations If you only want the kernel version you can type uname -v if you only want the kernel release you can type uname -r","tags":"Linux","url":"redoules.github.io/linux/version_kernel.html","loc":"redoules.github.io/linux/version_kernel.html"},{"title":"Running the notebook on a remote server","text":"Jupyter hub With JupyterHub you can create a multi-user Hub which spawns, manages, and proxies multiple instances of the single-user Jupyter notebook server. Project Jupyter created JupyterHub to support many users. The Hub can offer notebook servers to a class of students, a corporate data science workgroup, a scientific research project, or a high performance computing group. However, if you are the only one using the server and you just want a simple way to run the notebook on your server and access it through the web interface on a light client without having to install and configure the jupyter hub, you can do the following. Problem with jupyter notebook On your server, run the command jupyter-notebook you should get something like : [I 11:18:44.514 NotebookApp] Serving notebooks from local directory: /volume2/homes/admin [I 11:18:44.515 NotebookApp] 0 active kernels [I 11:18:44.516 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/?token=357587e3269b0f20f2b7e1918492890ae7573ac7ef1d2023 [I 11:18:44.516 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [W 11:18:44.519 NotebookApp] No web browser found: could not locate runnable browser. [C 11:18:44.520 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8888/?token=357587e3269b0f20f2b7e1918492890ae7573ac7ef1d2023 and if you try to connect to your server ip (in my example : http://192.168.1.2:8888/?token=357587e3269b0f20f2b7e1918492890ae7573ac7ef1d2023) you will get an \"ERR_CONNECTION_REFUSED\" error. This is because, by default, Jupyter Notebook only accepts connections from localhost. Allowing connexions from other sources From any IP The simplest way to avoid the connection error is to allow the notebook to accept connections from any ip jupyter-notebook --ip = * you will get something like [W 11:26:45.285 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended. [I 11:26:45.342 NotebookApp] Serving notebooks from local directory: /volume2/homes/admin [I 11:26:45.342 NotebookApp] 0 active kernels [I 11:26:45.343 NotebookApp] The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/?token=52af33d628881824968b4031967e8541a27cc28b1720c199 [I 11:26:45.343 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [W 11:26:45.346 NotebookApp] No web browser found: could not locate runnable browser. [C 11:26:45.347 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8888/?token=52af33d628881824968b4031967e8541a27cc28b1720c199 and if you connect form a remote client (192.168.1.1 in my example), the following line will be added to the output : [I 11:26:54.798 NotebookApp] 302 GET /?token=52af33d628881824968b4031967e8541a27cc28b1720c199 (192.168.1.1) 111.17ms note that you should only do that if you are the only one using the server because the connection is not encypted. From a specific IP You can also, explicitly specify the ip of the client jupyter-notebook --ip = 192 .168.1.1 [I 11:44:58.104 NotebookApp] JupyterLab extension loaded from C:\\Users\\Guillaume\\Miniconda3\\lib\\site-packages\\jupyterlab [I 11:44:58.104 NotebookApp] JupyterLab application directory is C:\\Users\\Guillaume\\Miniconda3\\share\\jupyter\\lab [I 11:44:58.244 NotebookApp] Serving notebooks from local directory: C:\\Users\\Guillaume [I 11:44:58.245 NotebookApp] 0 active kernels [I 11:44:58.245 NotebookApp] The Jupyter Notebook is running at: http://192.168.1.1:8888/?token=503576dd8fa87d1f2c416df307e9b900e520b4942e317b32 [I 11:44:58.245 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 11:44:58.258 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://192.168.1.1:8888/?token=503576dd8fa87d1f2c416df307e9b900e520b4942e317b32 [I 11:44:59.083 NotebookApp] Accepting one-time-token-authenticated connection from 192.168.1.1","tags":"Jupyter","url":"redoules.github.io/jupyter/remote_run_notebook.html","loc":"redoules.github.io/jupyter/remote_run_notebook.html"},{"title":"Running multiple calls to a function in parallel with Dask","text":"Dask.distributed is a lightweight library for distributed computing in Python. It allows to create a compute graph. Dask distributed is architectured around 3 parts : the dask-scheduler the dask-worker(s) the dask client Dask architecture The Dask scheduler is a centrally managed, distributed, dynamic task scheduler. It recieves tasks from a/multiple client(s) and spread them across one or multiple dask-worker(s). Dask-scheduler is an event based asynchronous dynamic scheduler, meaning that mutliple clients can submit a list of task to be executed on multiple workers. Internally, the task are represented as a directed acyclic graph. Both new clients and new workers can be connected or disconnected during the execution of the task graph. Tasks can be submited with the function client . submit ( function , * args , ** kwargs ) or by using objects from the dask library such as dask.dataframe, dask.bag or dask.array Setup In this example, we will use a distributed scheduler on a single machine with multiple workers and a single client. We will use the client to submit some tasks to the scheduler. The scheduler will then dispatch those tasks to the workers. The process can be monitored in real time through a web application. For this example, all the computations will be run on a local computer. However dask can scale to a large HPC cluster. First we have to launch the dask-scheduler; from the command line, input dask-scheduler Next, you can load the web dashboard. In order to do so, the scheduler returns the number of the port you have to connect to in the line starting with \"bokeh at :\". The default port is 8787. Since we are running all the programs on the same computer, we just have to login to http://127.0.0.1:8787/status Finally, we have to launch the dask-worker(s). If you want to run the worker(s) on the same computer as the scheduler the type : dask-worker 127 .0.0.1:8786 otherwise, make sure you are inputing the ip address of the computer hosting the dask-scheduler. You can launch as many workers as you want. In this example, we will run 3 workers on the local machine. Use the dask workers within your python code We will now see how to submit multiple calls to a fucntion in parallel on the dask-workers. Import the required libraries and define the function to be executed. import numpy as np import pandas as pd from distributed import Client #function used to do parallel computing on def compute_pi_MonteCarlo ( Nb_Data ): \"\"\" computes the value of pi using the monte carlo method \"\"\" Radius = 1 Nb_Data = int ( round ( Nb_Data )) x = np . random . uniform ( - Radius , Radius , Nb_Data ) y = np . random . uniform ( - Radius , Radius , Nb_Data ) pi_mc = 4 * np . sum ( np . power ( x , 2 ) + np . power ( y , 2 ) < Radius ** 2 ) / Nb_Data err = 100 * np . abs ( pi_mc - np . pi ) / np . pi return [ Nb_Data , pi_mc , err ] In order to connect to the scheduler, we create a client. client = Client ( '127.0.0.1:8786' ) client Client Scheduler: tcp://127.0.0.1:8786 Dashboard: http://127.0.0.1:8787/status Cluster Workers: 3 Cores: 12 Memory: 25.48 GB We submit tasks using the submit method data = [ client . submit ( compute_pi_MonteCarlo , Nb_Data ) for Nb_Data in np . logspace ( 3 , 7 , num = 1200 , dtype = int )] If you look at http://127.0.0.1:8787/status you will see the tasks beeing completed. Once competed, gather the data: data = client . gather ( data ) df = pd . DataFrame ( data ) df . columns = [ \"number of points for MonteCarlo\" , \"value of pi\" , \"error (%)\" ] df . tail () .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } number of points for MonteCarlo value of pi error (%) 1195 9697405 3.141296 0.009454 1196 9772184 3.141058 0.017008 1197 9847540 3.141616 0.000739 1198 9923477 3.141009 0.018574 1199 10000000 3.141032 0.017833 There, we have completed a simple example on how to use dask to run multiple functions in parallel. Full source code: import numpy as np import pandas as pd from distributed import Client #function used to do parallel computing on def compute_pi_MonteCarlo ( Nb_Data ): \"\"\" computes the value of pi using the monte carlo method \"\"\" Radius = 1 Nb_Data = int ( round ( Nb_Data )) x = np . random . uniform ( - Radius , Radius , Nb_Data ) y = np . random . uniform ( - Radius , Radius , Nb_Data ) pi_mc = 4 * np . sum ( np . power ( x , 2 ) + np . power ( y , 2 ) < Radius ** 2 ) / Nb_Data err = 100 * np . abs ( pi_mc - np . pi ) / np . pi return [ Nb_Data , pi_mc , err ] #connect to the scheduler client = Client ( '127.0.0.1:8786' ) #submit tasks data = [ client . submit ( compute_pi_MonteCarlo , Nb_Data ) for Nb_Data in np . logspace ( 3 , 7 , num = 1200 , dtype = int )] #gather the results data = client . gather ( data ) df = pd . DataFrame ( data ) df . columns = [ \"number of points for MonteCarlo\" , \"value of pi\" , \"error (%)\" ] df . tail () A word on the environement variables On Windows, to make sure that you can run dask-scheduler and dask-worker from the command line, you have to add the location of the executable to your path. On linux, you can append the location of the dask-worker and scheduler to the path variable with the command export PATH = $PATH :/path/to/dask","tags":"Python","url":"redoules.github.io/python/dask_distributed_parallelism.html","loc":"redoules.github.io/python/dask_distributed_parallelism.html"},{"title":"Plotting data using log axis","text":"Plotting in log axis with matplotlib import matplotlib.pyplot as plt % matplotlib inline import numpy as np x = np . linspace ( 0.1 , 20 ) y = 20 * np . exp ( - x / 10.0 ) Plotting using the standard function then specifying the axis scale One of the easiest way to plot in a log plot is to specify the plot normally and then specify which axis is to be plotted with a log scale. This can be specified by the function set_xscale or set_yscale # Normal plot fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . plot ( x , y ) ax . grid () plt . show () # Log x axis plot fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . plot ( x , y ) ax . set_xscale ( 'log' ) ax . grid () plt . show () # Log x axis plot fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . plot ( x , y ) ax . set_yscale ( 'log' ) ax . grid () plt . show () # Log x axis plot fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . plot ( x , y ) ax . set_xscale ( 'log' ) ax . set_yscale ( 'log' ) ax . grid () plt . show () Plotting using the matplotlib defined function Matplotlib has the function : semilogx, semilogy and loglog that can help you avoid having to specify the axis scale. # Plot using semilogx fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . semilogx ( x , y ) ax . grid () plt . show () # Plot using semilogy fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . semilogy ( x , y ) ax . grid () plt . show () # Plot using loglog fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . loglog ( x , y ) ax . grid () plt . show ()","tags":"Python","url":"redoules.github.io/python/logplot.html","loc":"redoules.github.io/python/logplot.html"},{"title":"Downloading a static webpage with python","text":"If you are using python legacy (aka python 2) first of all, stop ! Furthermore, this method won't work in python legacy # Import modules from urllib.request import urlopen The webpage source code can be downloaded with the command urlopen url = \"http://example.com/\" #create a HTTP request in order to read the page page = urlopen ( url ) . read () The source code will be stored in the variable page as a string print ( page ) b '&lt;!doctype html&gt;\\n&lt;html&gt;\\n&lt;head&gt;\\n &lt;title&gt;Example Domain&lt;/title&gt;\\n\\n &lt;meta charset=\"utf-8\" /&gt;\\n &lt;meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" /&gt;\\n &lt;meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" /&gt;\\n &lt;style type=\"text/css\"&gt;\\n body {\\n background-color: #f0f0f2;\\n margin: 0;\\n padding: 0;\\n font-family: \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\\n \\n }\\n div {\\n width: 600px;\\n margin: 5em auto;\\n padding: 50px;\\n background-color: #fff;\\n border-radius: 1em;\\n }\\n a:link, a:visited {\\n color: #38488f;\\n text-decoration: none;\\n }\\n @media (max-width: 700px) {\\n body {\\n background-color: #fff;\\n }\\n div {\\n width: auto;\\n margin: 0 auto;\\n border-radius: 0;\\n padding: 1em;\\n }\\n }\\n &lt;/style&gt; \\n&lt;/head&gt;\\n\\n&lt;body&gt;\\n&lt;div&gt;\\n &lt;h1&gt;Example Domain&lt;/h1&gt;\\n &lt;p&gt;This domain is established to be used for illustrative examples in documents. You may use this\\n domain in examples without prior coordination or asking for permission.&lt;/p&gt;\\n &lt;p&gt;&lt;a href=\"http://www.iana.org/domains/example\"&gt;More information...&lt;/a&gt;&lt;/p&gt;\\n&lt;/div&gt;\\n&lt;/body&gt;\\n&lt;/html&gt;\\n' Additionally, you can beautifulsoup in order to make it easier to work with html from bs4 import BeautifulSoup soup = BeautifulSoup ( page , 'lxml' ) soup . prettify () print ( soup ) & lt ;! DOCTYPE html & gt ; & lt ; html & gt ; & lt ; head & gt ; & lt ; title & gt ; Example Domain & lt ;/ title & gt ; & lt ; meta charset = \"utf-8\" /& gt ; & lt ; meta content = \"text/html; charset=utf-8\" http-equiv = \"Content-type\" /& gt ; & lt ; meta content = \"width=device-width, initial-scale=1\" name = \"viewport\" /& gt ; & lt ; style type = \"text/css\" & gt ; body { background-color : #f0f0f2 ; margin : 0 ; padding : 0 ; font-family : \"Open Sans\" , \"Helvetica Neue\" , Helvetica , Arial , sans-serif ; } div { width : 600 px ; margin : 5 em auto ; padding : 50 px ; background-color : #fff ; border-radius : 1 em ; } a : link , a : visited { color : #38488f ; text-decoration : none ; } @ media ( max-width : 700px ) { body { background-color : #fff ; } div { width : auto ; margin : 0 auto ; border-radius : 0 ; padding : 1 em ; } } & lt ;/ style & gt ; & lt ;/ head & gt ; & lt ; body & gt ; & lt ; div & gt ; & lt ; h1 & gt ; Example Domain & lt ;/ h1 & gt ; & lt ; p & gt ; This domain is established to be used for illustrative examples in documents . You may use this domain in examples without prior coordination or asking for permission .& lt ;/ p & gt ; & lt ; p & gt ;& lt ; a href = \"http://www.iana.org/domains/example\" & gt ; More information ...& lt ;/ a & gt ;& lt ;/ p & gt ; & lt ;/ div & gt ; & lt ;/ body & gt ; & lt ;/ html & gt ;","tags":"Python","url":"redoules.github.io/python/download_page.html","loc":"redoules.github.io/python/download_page.html"},{"title":"Getting stock market data","text":"Start by importing the packages. We will need pandas and the pandas_datareader. # Import modules import pandas as pd from pandas_datareader import data Datareader allows you to import data from the internet. I have found that Quandl and robinhood works the best as a source for stockmarket data. Note that if you want an other type of data (e.g. GDP, inflation, etc.) other sources exist. #import stock from robinhood aapl_robinhood = data . DataReader ( 'AAPL' , 'robinhood' , '1980-01-01' ) aapl_robinhood . head () .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } close_price high_price interpolated low_price open_price session volume symbol begins_at AAPL 2017-08-04 153.996200 154.990700 False 153.306900 153.681100 reg 20559852 2017-08-07 156.379100 156.487400 False 154.272000 154.655900 reg 21870321 2017-08-08 157.629700 159.352900 False 155.847400 156.172300 reg 36205896 2017-08-09 158.594700 158.801500 False 156.674500 156.822200 reg 26131530 2017-08-10 153.543100 158.169600 False 152.861000 158.070700 reg 40804273 #import stock from quandl aapl_quandl = data . DataReader ( 'AAPL' , 'quandl' , '1980-01-01' ) aapl_quandl . head () .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } Open High Low Close Volume ExDividend SplitRatio AdjOpen AdjHigh AdjLow AdjClose AdjVolume Date 2018-03-27 173.68 175.15 166.92 168.340 38962839.0 0.0 1.0 173.68 175.15 166.92 168.340 38962839.0 2018-03-26 168.07 173.10 166.44 172.770 36272617.0 0.0 1.0 168.07 173.10 166.44 172.770 36272617.0 2018-03-23 168.39 169.92 164.94 164.940 40248954.0 0.0 1.0 168.39 169.92 164.94 164.940 40248954.0 2018-03-22 170.00 172.68 168.60 168.845 41051076.0 0.0 1.0 170.00 172.68 168.60 168.845 41051076.0 2018-03-21 175.04 175.09 171.26 171.270 35247358.0 0.0 1.0 175.04 175.09 171.26 171.270 35247358.0","tags":"Python","url":"redoules.github.io/python/stock_pandas.html","loc":"redoules.github.io/python/stock_pandas.html"},{"title":"Moving average with pandas","text":"# Import modules import pandas as pd from pandas_datareader import data , wb #import packages from pandas_datareader import data aapl = data . DataReader ( 'AAPL' , 'quandl' , '1980-01-01' ) aapl . head () .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } Open High Low Close Volume ExDividend SplitRatio AdjOpen AdjHigh AdjLow AdjClose AdjVolume Date 2018-03-27 173.68 175.15 166.92 168.340 38962839.0 0.0 1.0 173.68 175.15 166.92 168.340 38962839.0 2018-03-26 168.07 173.10 166.44 172.770 36272617.0 0.0 1.0 168.07 173.10 166.44 172.770 36272617.0 2018-03-23 168.39 169.92 164.94 164.940 40248954.0 0.0 1.0 168.39 169.92 164.94 164.940 40248954.0 2018-03-22 170.00 172.68 168.60 168.845 41051076.0 0.0 1.0 170.00 172.68 168.60 168.845 41051076.0 2018-03-21 175.04 175.09 171.26 171.270 35247358.0 0.0 1.0 175.04 175.09 171.26 171.270 35247358.0 In order to computer the moving average, we will use the rolling function. #120 days moving average moving_averages = aapl [[ \"Open\" , \"High\" , \"Low\" , \"Close\" , \"Volume\" ]] . rolling ( window = 120 ) . mean () moving_averages . tail () .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } Open High Low Close Volume Date 1980-12-18 28.457667 28.551917 28.385000 28.385000 139495.000000 1980-12-17 28.410750 28.502917 28.338083 28.338083 141772.500000 1980-12-16 28.362833 28.453917 28.289167 28.289167 141256.666667 1980-12-15 28.335750 28.426833 28.262083 28.262083 144321.666667 1980-12-12 28.310750 28.402833 28.238167 28.238167 159625.000000 % matplotlib inline import matplotlib.pyplot as plt plt . plot ( aapl . index , aapl . Open , label = 'Open price' ) plt . plot ( moving_averages . index , moving_averages . Open , label = \"120 MA Open price\" ) plt . legend () plt . show ()","tags":"Python","url":"redoules.github.io/python/Moving_average_pandas.html","loc":"redoules.github.io/python/Moving_average_pandas.html"},{"title":"Keywords to use with WHERE","text":"Keywords to use with WHERE #load the extension % load_ext sql #connect to the database % sql sqlite : /// mydatabase . db 'Connected: @mydatabase.db' Assignment operator The assignment operator is =. % sql SELECT * FROM tutyfrutty WHERE color = \"red\" * sqlite:///mydatabase.db Done. index fruit color kcal 2 Apple red 52 7 Cranberry red 308 Comparison operators Comparison operation can be done in a SQL querry. They are the following : Equality : = Greater than : > greater than or equal to : >= less than : < less than or equal to : <= not equal to : <>, != not greater than : !> not less than : !< % sql SELECT * FROM tutyfrutty WHERE kcal = 47 * sqlite:///mydatabase.db Done. index fruit color kcal 1 Orange orange 47 % sql SELECT * FROM tutyfrutty WHERE kcal > 47 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 2 Apple red 52 7 Cranberry red 308 % sql SELECT * FROM tutyfrutty WHERE kcal >= 47 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 7 Cranberry red 308 % sql SELECT * FROM tutyfrutty WHERE kcal < 47 * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 4 lime green 30 5 plum purple 28 % sql SELECT * FROM tutyfrutty WHERE kcal <= 47 * sqlite:///mydatabase.db Done. index fruit color kcal 1 Orange orange 47 3 lemon yellow 15 4 lime green 30 5 plum purple 28 % sql SELECT * FROM tutyfrutty WHERE kcal <> 47 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 7 Cranberry red 308 Logical operators Logical operators test a condition and return a boolean. The logicial operators in SQL are : ALL : true if all the condtions are true AND : true is both conditions are true ANY : true if any one of the conditions are true BETWEEN : true if the operand in withing a range of values EXISTS : true if the subquery contains any rows IN : true if the condition is present in a row LIKE : true if a pattern is matched NOT : True if the operand is false, false otherwise OR : True is either condition is true SOME : true is any of the conditions is true % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" AND kcal < 100 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 3 lemon yellow 15 % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR kcal > 300 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 3 lemon yellow 15 7 Cranberry red 308 % sql SELECT * FROM tutyfrutty WHERE fruit LIKE 'l%' * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 4 lime green 30 % sql SELECT * FROM tutyfrutty WHERE NOT color = \"yellow\" * sqlite:///mydatabase.db Done. index fruit color kcal 1 Orange orange 47 2 Apple red 52 4 lime green 30 5 plum purple 28 7 Cranberry red 308 % sql SELECT * FROM tutyfrutty WHERE kcal BETWEEN 40 AND 100 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 Bitwise operators Some bitwise operators exist in SQL. They will not be demonstrated here. They are the following : AND : & OR : | XOR : &#94; NOT : ~","tags":"SQL","url":"redoules.github.io/sql/WHERE_SQL_keywords.html","loc":"redoules.github.io/sql/WHERE_SQL_keywords.html"},{"title":"Sorting results","text":"Sorting results in SQL Sorting results can be achieved by using a modifier command at the end of the SQL querry #load the extension % load_ext sql #connect to the database % sql sqlite : /// mydatabase . db 'Connected: @mydatabase.db' The results can be sorted with the command ORDER BY SELECT column-list FROM table_name [WHERE condition] [ORDER BY column1, column2, .. columnN] [ASC | DESC] Let's show an example where we extract the fruits that are either yellow or red % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR color = \"red\" * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 2 Apple red 52 3 lemon yellow 15 7 Cranberry red 308 Ascending sort % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR color = \"red\" ORDER BY kcal ASC * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 2 Apple red 52 0 Banana yellow 89 7 Cranberry red 308 descending sort % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR color = \"red\" ORDER BY kcal DESC * sqlite:///mydatabase.db Done. index fruit color kcal 7 Cranberry red 308 0 Banana yellow 89 2 Apple red 52 3 lemon yellow 15 Sort by multiple columns You can sort by more than one column. Just specify multiple columns in the ORDER BY keyword. In the example, we will sort alphabetically on the color column first and sort alphabetically on the fruit column % sql SELECT * FROM tutyfrutty ORDER BY color , fruit ASC * sqlite:///mydatabase.db Done. index fruit color kcal 4 lime green 30 1 Orange orange 47 5 plum purple 28 2 Apple red 52 7 Cranberry red 308 0 Banana yellow 89 3 lemon yellow 15","tags":"SQL","url":"redoules.github.io/sql/Sorting_results.html","loc":"redoules.github.io/sql/Sorting_results.html"},{"title":"Filter content of a TABLE","text":"Filter content of a TABLE in SQL In this example, we will display the content of a table but we will filter out the results. Since we are working in the notebook, we will load the sql extension in order to manipulate the database. The database mydatabase.db is a SQLite database already created before the example. #load the extension % load_ext sql #connect to the database % sql sqlite : /// mydatabase . db 'Connected: @mydatabase.db' Filter content matching exactly a condition We want to extract all the entries in a dataframe that match a certain condition, in order to do so, we will use the following command : SELECT * FROM TABLE WHERE column=\"condition\" In our example, we will filter all the entries in the tutyfrutty table whose color is yellow % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 3 lemon yellow 15 Complex conditions You can build more complex conditions by using the keywords OR and AND In the following example, we will filter all entries that are either yellow or red % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR color = \"red\" * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 2 Apple red 52 3 lemon yellow 15 7 Cranberry red 308 Note : when combining multiple conditions with AND and OR, be careful to use parentesis where needed Conditions matching a pattern You can also use the LIKE keyword in order to find all entries that match a certain pattern. In our example, we want to find all fruits begining with a \"l\". In order to do so, we will use the LIKE keyword and the wildcard \"%\" meaning any string % sql SELECT * FROM tutyfrutty WHERE fruit LIKE \"l%\" * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 4 lime green 30 Numerical conditions When we are working with numerical data, we can use the GREATER THAN > and SMALLER THAN < operators % sql SELECT * FROM tutyfrutty WHERE kcal < 47 * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 4 lime green 30 5 plum purple 28 If we want the condition to be inclusive we can use the operator <= (alternatively >=) % sql SELECT * FROM tutyfrutty WHERE kcal <= 47 * sqlite:///mydatabase.db Done. index fruit color kcal 1 Orange orange 47 3 lemon yellow 15 4 lime green 30 5 plum purple 28","tags":"SQL","url":"redoules.github.io/sql/display_table_filter.html","loc":"redoules.github.io/sql/display_table_filter.html"},{"title":"Displaying the content of a TABLE","text":"Displaying the content of a TABLE in SQL In this very simple example we will see how to display the content of a table. Since we are working in the notebook, we will load the sql extension in order to manipulate the database. The database mydatabase.db is a SQLite database already created before the example. #load the extension % load_ext sql #connect to the database % sql sqlite : /// mydatabase . db 'Connected: @mydatabase.db' In order to extract all the values from a table, we will use the following command : SELECT * FROM TABLE In our example, we want to display the data contained in the table named tutyfrutty % sql SELECT * FROM tutyfrutty * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 7 Cranberry red 308","tags":"SQL","url":"redoules.github.io/sql/display_table.html","loc":"redoules.github.io/sql/display_table.html"},{"title":"Opening a file with python","text":"This short article show you how to open a file using python. We will use the with keyword in order to avoid having to close the file. There is no need to import anything in order to open a file. All the function related to file manipulation are part of the python standard library In order to open a file, we will use the function open. This function takes two arguments : the path of the file the mode you want to open the file The mode can be : 'r' : read 'w' : write 'a' : append (writes at the end of the file) 'b' : binary mode 'x' : exclusive creation 't' : text mode (by default) Note that if the file does not exit it will be created if you use the following options \"w\", \"a\", \"x\". If you try to open a non existing file in read mode 'r', a FileNotFoundError will be returned. It is possible to combine multiple options together. For instance, you can open a file in binary mode for writing using the 'wb' option. Python distinguishes between binary and text I/O. Files opened in binary mode return contents as bytes objects without any decoding. In text mode , the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given. Writing to a file Let's first open (create) a text file a write a string to it. filepath = \". \\\\ myfile.txt\" with open ( filepath , 'w' ) as f : f . write ( \"Hello world !\" ) Reading a file we can now see how to read the content of a file. To do so, we will use the 'r' option with open ( filepath , \"r\" ) as f : content = f . read () print ( content ) Hello world ! A word on the with keyword In python the with keyword is used when working with unmanaged resources (like file streams). The python documentation tells us that : The with statement clarifies code that previously would use try...finally blocks to ensure that clean-up code is executed. In this section, I'll discuss the statement as it will commonly be used. In the next section, I'll examine the implementation details and show how to write objects for use with this statement. The with statement is a control-flow structure whose basic structure is: with expression [ as variable ]: with - block The expression is evaluated, and it should result in an object that supports the context management protocol (that is, has enter () and exit () methods).","tags":"Python","url":"redoules.github.io/python/Opening_file.html","loc":"redoules.github.io/python/Opening_file.html"},{"title":"Opening a SQLite database with python","text":"This short article show you how to connect to a SQLite database using python. We will use the with keyword in order to avoid having to close the database. In order to connect to the database, we will have to import sqlite3 import sqlite3 from sqlite3 import Error In python the with keyword is used when working with unmanaged resources (like file streams). The python documentation tells us that : The with statement clarifies code that previously would use try...finally blocks to ensure that clean-up code is executed. In this section, I'll discuss the statement as it will commonly be used. In the next section, I'll examine the implementation details and show how to write objects for use with this statement. The with statement is a control-flow structure whose basic structure is: with expression [ as variable ]: with - block The expression is evaluated, and it should result in an object that supports the context management protocol (that is, has enter () and exit () methods). db_file = \". \\\\ mydatabase.db\" try : with sqlite3 . connect ( db_file ) as conn : print ( \"Connected to the database\" ) #your code here except Error as e : print ( e ) Connected to the database","tags":"Python","url":"redoules.github.io/python/Opening_SQLite_database.html","loc":"redoules.github.io/python/Opening_SQLite_database.html"},{"title":"Reading data from a sql database with pandas","text":"When manipulating you data using pandas, it is sometimes useful to pull data from a database. In this tutorial, we will see how to querry a dataframe from a sqlite table. Note than it would also work with any other sql database a long as you change the connxion to the one that suits your needs. First let's import pandas and sqlite3 import pandas as pd import sqlite3 from sqlite3 import Error We want to store the table tutyfrutty in our dataframe. To do so, we will query all the elements present in the tutyfrutty TABLE with the command : SELECT * FROM tutyfrutty db_file = \". \\\\ mydatabase.db\" try : with sqlite3 . connect ( db_file ) as conn : df = pd . read_sql ( \"SELECT * FROM tutyfrutty\" , conn ) del df [ \"index\" ] #juste delete the index column that was stored in the table except Error as e : print ( e ) df .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 6 Cranberry red 308 7 Cranberry red 308","tags":"Python","url":"redoules.github.io/python/Reading_data_from_a_sql_database_with_pandas.html","loc":"redoules.github.io/python/Reading_data_from_a_sql_database_with_pandas.html"},{"title":"Writing data to a sql database with pandas","text":"When manipulating you data using pandas, it is sometimes useful to store a dataframe. Pandas provides multiple ways to export dataframes. The most common consist in exporting to a csv, a pickle, to hdf or to excel. However, exporting to a sql database can prove very useful. Indeed, having a well structured database is a great for storing all the data related to your analysis in one place. In this tutorial, we will see how to store a dataframe in a new table of a sqlite dataframe. Note than it would also work with any other sql database a long as you change the connxion to the one that suits your needs. First let's import pandas and sqlite3 import pandas as pd import sqlite3 from sqlite3 import Error # Example dataframe raw_data = { 'fruit' : [ 'Banana' , 'Orange' , 'Apple' , 'lemon' , \"lime\" , \"plum\" ], 'color' : [ 'yellow' , 'orange' , 'red' , 'yellow' , \"green\" , \"purple\" ], 'kcal' : [ 89 , 47 , 52 , 15 , 30 , 28 ] } df = pd . DataFrame ( raw_data , columns = [ 'fruit' , 'color' , 'kcal' ]) df .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 Now that the DataFrame has been created, let's push it to the sqlite database called mydatabase.db in a new table called tutyfrutty db_file = \". \\\\ mydatabase.db\" try : with sqlite3 . connect ( db_file ) as conn : df . to_sql ( \"tutyfrutty\" , conn ) except Error as e : print ( e ) except ValueError : print ( \"The TABLE tutyfrutty already exists, read below to understand how to handle this case\" ) Note that if the table tutyfrutty was already existing, the to_sql function will return a ValueError. This is where, the if_exists option comes into play. Let's look at the docstring of this function : \"\"\" if_exists : {'fail', 'replace', 'append'}, default 'fail' - fail: If table exists, do nothing. - replace: If table exists, drop it, recreate it, and insert data. - append: If table exists, insert data. Create if does not exist. \"\"\" Let's say, I want to update my dataframe with some new rows df . loc [ len ( df ) + 1 ] = [ 'Cranberry' , 'red' , 308 ] df .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 7 Cranberry red 308 8 Cranberry red 308 I can now replace the table with the new values using the \"replace\" option db_file = \". \\\\ mydatabase.db\" try : with sqlite3 . connect ( db_file ) as conn : df . to_sql ( \"tutyfrutty\" , conn , if_exists = \"replace\" ) except Error as e : print ( e )","tags":"Python","url":"redoules.github.io/python/Writing_data_to_a_sql_database_with_pandas.html","loc":"redoules.github.io/python/Writing_data_to_a_sql_database_with_pandas.html"},{"title":"Creating a sqlite database","text":"When you want to start with using databases SQlite is a great tool. It provides an easy onramp to learn and prototype you database with a SQL compatible database. First, let's import the libraries we need import sqlite3 from sqlite3 import Error SQlite doesn't need a database server, however, you have to start by creating an empty database file import os def check_for_db_file (): if os . path . exists ( \"mydatabase.db\" ): print ( \"the database is ready\" ) else : print ( \"no database found\" ) check_for_db_file () no database found Let's then create a function that will connect to a database, print the verison of sqlite and then close the connexion to the database. def create_database ( db_file ): \"\"\" create a database connection to a SQLite database \"\"\" try : with sqlite3 . connect ( db_file ) as conn : print ( \"database created with sqlite3 version {0}\" . format ( sqlite3 . version )) except Error as e : print ( e ) create_database ( \".\\mydatabase.db\" ) database created with sqlite3 version 2.6.0 check_for_db_file () the database is ready You're all set. From now on, you can open the database and write sql querries into it.","tags":"Python","url":"redoules.github.io/python/Creating_a_sqlite_database.html","loc":"redoules.github.io/python/Creating_a_sqlite_database.html"},{"title":"Setting up the notebook for plotting with matplotlib","text":"Importing Matplotlib First we need to import pyplot, a collection of command style functions that make matplotlib work like MATLAB. Let's, as well, use the magic command %matplotlib inline in order to display the figures in the notebook import matplotlib.pyplot as plt % matplotlib inline # this doubles image size, but we'll do it manually below # %config InlineBackend.figure_format = 'retina' The following parameters are recommended for matplotlib, they will make matplotlib output a better quality image # %load snippets/matplot_setup.py plt . rcParams [ 'savefig.dpi' ] = 300 plt . rcParams [ 'figure.dpi' ] = 163 plt . rcParams [ 'figure.autolayout' ] = False plt . rcParams [ 'figure.figsize' ] = 20 , 12 plt . rcParams [ 'axes.labelsize' ] = 18 plt . rcParams [ 'axes.titlesize' ] = 20 plt . rcParams [ 'font.size' ] = 16 plt . rcParams [ 'lines.linewidth' ] = 2.0 plt . rcParams [ 'lines.markersize' ] = 8 plt . rcParams [ 'legend.fontsize' ] = 14 plt . rcParams [ 'text.usetex' ] = False # True activates latex output in fonts! plt . rcParams [ 'font.family' ] = \"serif\" plt . rcParams [ 'font.serif' ] = \"cm\" plt . rcParams [ 'text.latex.preamble' ] = \" \\\\ usepackage{subdepth}, \\\\ usepackage{type1cm}\" You can change the second line in order to fit your display. 163 dpi corresponds to a Dell Ultra HD 4k P2715Q. You can check your screen's dpi count at http://dpi.lv/","tags":"Python","url":"redoules.github.io/python/Setting_up_the_notebook_for_plotting_with_matplotlib.html","loc":"redoules.github.io/python/Setting_up_the_notebook_for_plotting_with_matplotlib.html"},{"title":"Why using a blockchain is a bad idea for your business","text":"What having a blockchain implies? storage costs : everyone maintaining the ledger needs to store every transaction bandwith costs : everyone has to broadcast every transaction computational costs : every node has to validate the blockchain control : the creator does not control the blockchain, everyone collectively controls it developpement costs : developping on a blockchain is way harder than on a traditionnal database What to ask a business when they tell you that they are using a blockchain? When a business is telling you about their innovative technology leveraging the power of the blockchain this should immedialty spake some questions : What is the consensus algorithm? who is responsible for validating the consensus rules? what is the nature of the participation ? is it open to access? is it open to innovation? is it a public ledger? is it transparent? does it improves acountability? is it cross borders? how is it validated?","tags":"Cryptocurrencies","url":"redoules.github.io/cryptocurrencies/blockchain_bad.html","loc":"redoules.github.io/cryptocurrencies/blockchain_bad.html"},{"title":"Synology NFS share","text":"Setting up a NFS share login to your DSM admin account, open the \"Control Panel\" and go to \"File Services\" Make sure NFS is enabled Back in the control panel, go to \"Shared Folder\" Select the folder you want to share and clic \"Edit\" Go to the \"NFS Permissions tab and clic \"Create\", add the ip of the device you want to mount the mapped drive on. Make sure you copy the \"Mount path\"","tags":"Linux","url":"redoules.github.io/linux/share_nfs_share.html","loc":"redoules.github.io/linux/share_nfs_share.html"},{"title":"Mount a NFS share using fstab","text":"Mount nfs using fstab The fstab file, generally located at /etc/fstab lists the differents partitions and where to load them on the filesystem. You can edit this file as root by using the following command sudo nano /etc/fstab in the following example, we want to mount a NFS v3 share from : server : 192.168.1.2 mountpoint (on the server) : /volumeUSB2/usbshare * mountlocation (on the client) : /mnt we specify 192 .168.1.2:/volumeUSB2/usbshare /mnt nfs nfsvers = 3 ,users 0 0 the client will then automatically mount the share ont /mnt at startup. Related you can reload the fstab file using this method : https://redoules.github.io/linux/Reloading_fstab.html You can create a NFS share on a Synology using the method : https://redoules.github.io/linux/share_nfs_share.html","tags":"Linux","url":"redoules.github.io/linux/mount_nfs_share_fstab.html","loc":"redoules.github.io/linux/mount_nfs_share_fstab.html"},{"title":"Installing bitcoind on raspberry pi","text":"Installing bitcoind on linux Running a full bitcoin node helps the bitcoin network to accept, validate and relay transactions. If you want to volunteer some spare computing and bandwidth resources to run a full node and allow Bitcoin to continue to grow you can grab an inexpensive and power efficient raspberry pi and turn it into a full node. There are plenty of tutorials on the Internet explaining how to install a bitcoin full node; this tutorial won't go over setting up a raspberry pi and using ssh. In order to store the full blockchain we will mount a network drive and tell bitcoind to use this mapped drive as the data directory. Download the bitcoin client Go to https://bitcoin.org/en/download Copy the URL for the ARM 32 bit version and download it onto your raspberry pi. wget https://bitcoin.org/bin/bitcoin-core-0.15.1/bitcoin-0.15.1-arm-linux-gnueabihf.tar.gz Locate the downloaded file and extract it using the arguement xzf tar xzf bitcoin-0.15.1-arm-linux-gnueabihf.tar.gz a new directory bitcoin-0.15.1 will be created, it contrains the files we need to install the software Install the bitcoin client We will install the content by copying the binaries located in the bin folder into /usr/local/bin by using the install command. You must use sudo because it will write data to a system directory sudo install -m 0755 -o root -g root -t /usr/local/bin bitcoin-0.15.1/bin/* Launch the bitcoin core client by running bitcoind -daemon Configuration of the node Start your node at boot Starting you node automatically at boot time is a good idea because it doesn't require a manual action from the user. The simplest way to achive this is to create a cronjob. Run the following command crontab -e Select the text editor of your choice, then add the following line at the end of the file @reboot bitcoind -daemon Save the file and exit; the updated crontab file will be installed for you. Full Node If you can afford to download and store all the blockchain, you can run a full node. At the time of writing, the blockchain is 150Go ( https://blockchain.info/fr/charts/blocks-size ). Tree ways to store this are : use a microSD with 256Go or more add a thumbdrive or an external drive to your raspberry pi * mount a network drive from a NAS If you have purchased a big SD card then you can leave the default location for the blockchain data (~/.bitcoin/). Otherwise, you will have to change the datadir location to where your drive is mounted (in my case I have mounted it to /mnt) In order to configure your bitcoin client, edit/create the file bitcoin.conf located in ~/.bitcoin/ nano ~/.bitcoin/bitcoin.conf copy the following text # From redoules.github.io # This config should be placed in following path: # ~/.bitcoin/bitcoin.conf # [core] # Specify a non-default location to store blockchain and other data. datadir=/mnt # Set database cache size in megabytes; machines sync faster with a larger cache. Recommend setting as high as possible based upon mach$ dbcache=100 # Keep at most <n> unconnectable transactions in memory. maxorphantx=10 # Keep the transaction memory pool below <n> megabytess. maxmempool=50 # [network] # Maintain at most N connections to peers. maxconnections=40 # Tries to keep outbound traffic under the given target (in MiB per 24h), 0 = no limit. maxuploadtarget=5000 Check https://jlopp.github.io/bitcoin-core-config-generator it is a handy site to edit the bitcoin.conf file Pruning node If you don't want to store the entire blockchain you can run a pruning node which reduces storage requirements by enabling pruning (deleting) of old blocks. Let's say you want to allocated at most 5Go to the blockchain, then specify prune=5000 into your bitcoin.conf file. Edit/create the file bitcoin.conf located in ~/.bitcoin/ nano ~/.bitcoin/bitcoin.conf copy the following text # From redoules.github.io # This config should be placed in following path: # ~/.bitcoin/bitcoin.conf # [core] # Set database cache size in megabytes; machines sync faster with a larger cache. Recommend setting as high as possible based upon mach$ dbcache=100 # Keep at most <n> unconnectable transactions in memory. maxorphantx=10 # Keep the transaction memory pool below <n> megabytess. maxmempool=50 # Reduce storage requirements by only storing most recent N MiB of block. This mode is incompatible with -txindex and -rescan. WARNING: Reverting this setting requires re-downloading the entire blockchain. (default: 0 = disable pruning blocks, 1 = allow manual pruning via RPC, greater than 550 = automatically prune blocks to stay under target size in MiB). prune=5000 # [network] # Maintain at most N connections to peers. maxconnections=40 # Tries to keep outbound traffic under the given target (in MiB per 24h), 0 = no limit. maxuploadtarget=5000 Checking if your node is public one of the best way to help the bitcoin network is to allow your node to be visible and to propagate block to other nodes. The bitcoin protocole uses port 8333, other clients should be able to share information with your client. Run ifconfig and check if you have an ipv6 adresse (look for adr inet6:) IPV6 Get the global ipv6 adresse of your raspberry pi Link encap:Ethernet HWaddr xx:xx:xx:xx:xx:xx inet adr:192.168.1.x Bcast:192.168.1.255 Masque:255.255.255.0 adr inet6: xxxx::xxxx:xxxx:xxxx:xxxx/64 Scope:Lien adr inet6: xxxx:xxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:42681744 errors:0 dropped:0 overruns:0 frame:0 TX packets:38447218 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 lg file transmission:1000 RX bytes:3044414780 (2.8 GiB) TX bytes:2599878680 (2.4 GiB) it is located between adr inet4 and Scope:Global adr inet6: xxxx:xxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx/64 Scope:Global Copy this adresse and past it into the search field on https://bitnodes.earn.com/ If your node is visible, it will appear on the website IPV4 If you don't have an ipv6 adresse, you will have to open port 8333 on your router and redirect it to the internal IP of your raspberry pi. It is not detailed here because the configuration depends on your router.","tags":"Cryptocurrencies","url":"redoules.github.io/cryptocurrencies/Installing_bitcoind_on_raspberry_pi.html","loc":"redoules.github.io/cryptocurrencies/Installing_bitcoind_on_raspberry_pi.html"},{"title":"Reloading .bashrc","text":"Reload .bashrc The .bashrc file, located at ~/.bashrc allows a user to personalize its bash shell. If you edit this file, the changes won't be loaded without login out and back in. However, you can use the following command to do it source ~/.bashrc","tags":"Linux","url":"redoules.github.io/linux/Reloading_.bashrc.html","loc":"redoules.github.io/linux/Reloading_.bashrc.html"},{"title":"Reloading fstab","text":"Reload fstab The fstab file, generally located at /etc/fstab lists the differents partitions and where to load them on the filesystem. If you edit this file, the changes won't be automounted. You either have to reboot your system of use the following command as root mount -a","tags":"Linux","url":"redoules.github.io/linux/Reloading_fstab.html","loc":"redoules.github.io/linux/Reloading_fstab.html"},{"title":"Updating all python package with anaconda","text":"Updating anaconda packages All packages managed by conda can be updated with the following command : conda update --all Updating other packages with pip For the other packages, the pip package manager can be used. Unfortunately pip hasn't the same update all fonctionnality. import pip from subprocess import call for dist in pip . get_installed_distributions (): print ( \"updating {0}\" . format ( dist )) call ( \"pip install --upgrade \" + dist . project_name , shell = True )","tags":"Python","url":"redoules.github.io/python/updating_all_python_package_with_anaconda.html","loc":"redoules.github.io/python/updating_all_python_package_with_anaconda.html"},{"title":"Saving a matplotlib figure with a high resolution","text":"creating a matplotlib figure #Importing matplotlib % matplotlib inline import matplotlib.pyplot as plt import numpy as np Drawing a figure # Fixing random state for reproducibility np . random . seed ( 19680801 ) mu , sigma = 100 , 15 x = mu + sigma * np . random . randn ( 10000 ) # the histogram of the data n , bins , patches = plt . hist ( x , 50 , normed = 1 , facecolor = 'g' , alpha = 0.75 ) plt . xlabel ( 'Smarts' ) plt . ylabel ( 'Probability' ) plt . title ( 'Histogram of IQ' ) plt . text ( 60 , . 025 , r '$\\mu=100,\\ \\sigma=15$' ) plt . axis ([ 40 , 160 , 0 , 0.03 ]) plt . grid ( True ) plt . show () Saving the figure normally, one would use the following code plt . savefig ( 'filename.png' ) &lt;matplotlib.figure.Figure at 0x2e45e92f400&gt; The figure in then exported to the file \"filename.png\" with a standard resolution. In adittion, you can specify the dpi arg to some scalar value, for example: plt . savefig ( 'filename_hi_dpi.png' , dpi = 300 ) &lt;matplotlib.figure.Figure at 0x2e462164898&gt;","tags":"Python","url":"redoules.github.io/python/Saving_a_matplotlib_figure_with_a_high_resolution.html","loc":"redoules.github.io/python/Saving_a_matplotlib_figure_with_a_high_resolution.html"},{"title":"Iterating over a DataFrame","text":"Create a sample dataframe # Import modules import pandas as pd # Example dataframe raw_data = { 'fruit' : [ 'Banana' , 'Orange' , 'Apple' , 'lemon' , \"lime\" , \"plum\" ], 'color' : [ 'yellow' , 'orange' , 'red' , 'yellow' , \"green\" , \"purple\" ], 'kcal' : [ 89 , 47 , 52 , 15 , 30 , 28 ] } df = pd . DataFrame ( raw_data , columns = [ 'fruit' , 'color' , 'kcal' ]) df .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 Using the iterrows method Pandas DataFrames can return a generator with the iterrrows method. It can then be used to loop over the rows of the DataFrame for index , row in df . iterrows (): print ( \"At line {0} there is a {1} which is {2} and contains {3} kcal\" . format ( index , row [ \"fruit\" ], row [ \"color\" ], row [ \"kcal\" ])) At line 0 there is a Banana which is yellow and contains 89 kcal At line 1 there is a Orange which is orange and contains 47 kcal At line 2 there is a Apple which is red and contains 52 kcal At line 3 there is a lemon which is yellow and contains 15 kcal At line 4 there is a lime which is green and contains 30 kcal At line 5 there is a plum which is purple and contains 28 kcal","tags":"Python","url":"redoules.github.io/python/Iterating_over_a_dataframe.html","loc":"redoules.github.io/python/Iterating_over_a_dataframe.html"},{"title":"Article Recommander","text":"import pandas as pd import numpy as np % matplotlib inline Loading data and preprocessing we first learn the pickled article database. We will be cleaning it and separating the interesting articles from the uninteresting ones. df = pd . read_pickle ( './article.pkl' ) del df [ \"html\" ] del df [ \"image\" ] del df [ \"URL\" ] del df [ \"hash\" ] del df [ \"source\" ] df [ \"label\" ] = df [ \"note\" ] . apply ( lambda x : 0 if x <= 0 else 1 ) df . head ( 5 ) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } authors note resume texte titre label 0 [Danny Bradbury, Marco Santori, Adam Draper, M... -10.0 Black Market Reloaded, a black market site tha... Black Market Reloaded, a black market site tha... Black Market Reloaded back online after source... 0 1 [Emily Spaven, Stan Higgins, Emilyspaven] 1.0 The UK Home Office believes the government sho... The UK Home Office believes the government sho... Home Office: UK Should Create a Crime-Fighting... 1 2 [Pete Rizzo, Alex Batlin, Yessi Bello Perez, P... -10.0 Though lofty in its ideals, lead developer Dan... A new social messaging app is aiming to disrup... Gems Bitcoin App Lets Users Earn Money From So... 0 3 [Nermin Hajdarbegovic, Stan Higgins, Pete Rizz... 3.0 US satellite service provider DISH Network has... US satellite service provider DISH Network has... DISH Becomes World's Largest Company to Accept... 1 4 [Stan Higgins, Bailey Reutzel, Garrett Keirns,... -10.0 An unidentified 28-year-old man was robbed of ... An unidentified 28-year-old man was robbed of ... Bitcoin Stolen at Gunpoint in New York City Ro... 0 Basic statistics on the dataset let's explore the dataset and extract some numbers : * the number of article liked/disliked df [ \"label\" ] . value_counts () 0 879 1 324 Name: label, dtype: int64 Create the full content column df [ 'full_content' ] = df . titre + ' ' + df . resume #exclude the full texte of the article for the moment df . head ( 1 ) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } authors note resume texte titre label full_content 0 [Danny Bradbury, Marco Santori, Adam Draper, M... -10.0 Black Market Reloaded, a black market site tha... Black Market Reloaded, a black market site tha... Black Market Reloaded back online after source... 0 Black Market Reloaded back online after source... from sklearn.model_selection import train_test_split training , testing = train_test_split ( df , # The dataset we want to split train_size = 0.75 , # The proportional size of our training set stratify = df . label , # The labels are used for stratification random_state = 400 # Use the same random state for reproducibility ) training . head ( 5 ) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } authors note resume texte titre label full_content 748 [Jon Brodkin] -10.0 Amazon, Reddit, Mozilla, and other Internet co... Amazon, Reddit, Mozilla, and other Internet co... Amazon and Reddit try to save net neutrality r... 0 Amazon and Reddit try to save net neutrality r... 1183 [Jon Brodkin] -10.0 (The Time Warner involved in this transaction ... A group of mostly Democratic senators led by A... Democrats urge Trump administration to block A... 0 Democrats urge Trump administration to block A... 769 [Joseph Brogan] -10.0 On Twitter, bad news comes at all hours, with ... On Twitter, bad news comes at all hours, with ... Some of the best art on Twitter comes from the... 0 Some of the best art on Twitter comes from the... 57 [Michael Del Castillo, Pete Rizzo, Trond Vidar... -10.0 Publicly traded online travel service Webjet i... Publicly traded online travel service Webjet i... Webjet Ethereum Pilot Targets Hotel Industry's... 0 Webjet Ethereum Pilot Targets Hotel Industry's... 892 [Andrew Cunningham] 10.0 What has changed on the 2017 MacBook, then?\\nI... Andrew Cunningham\\n\\nAndrew Cunningham\\n\\nAndr... Mini-review: The 2017 MacBook could actually b... 1 Mini-review: The 2017 MacBook could actually b... from sklearn.feature_extraction.text import TfidfVectorizer , CountVectorizer from sklearn.svm import LinearSVC , SVC from sklearn.pipeline import Pipeline from sklearn.model_selection import cross_val_predict from utils.plotting import pipeline_performance steps = ( ( 'vectorizer' , TfidfVectorizer ()), ( 'classifier' , LinearSVC ()) ) pipeline = Pipeline ( steps ) predicted_labels = cross_val_predict ( pipeline , training . full_content , training . label ) pipeline_performance ( training . label , predicted_labels ) pipeline = pipeline . fit ( training . titre , training . label ) Accuracy = 80.6% Confusion matrix, without normalization [[624 35] [140 103]] import re from utils.plotting import print_top_features from sklearn.model_selection import GridSearchCV def mask_integers ( s ): return re . sub ( r '\\d+' , 'INTMASK' , s ) steps = ( ( 'vectorizer' , TfidfVectorizer ()), ( 'classifier' , LinearSVC ()) ) pipeline = Pipeline ( steps ) gs_params = { #'vectorizer__use_idf': (True, False), 'vectorizer__lowercase' : [ True , False ], 'vectorizer__stop_words' : [ 'english' , None ], 'vectorizer__ngram_range' : [( 1 , 1 ), ( 1 , 2 ), ( 2 , 2 )], 'vectorizer__preprocessor' : [ mask_integers , None ], 'classifier__C' : np . linspace ( 5 , 20 , 25 ) } gs = GridSearchCV ( pipeline , gs_params , n_jobs = 1 ) gs . fit ( training . full_content , training . label ) print ( gs . best_params_ ) print ( gs . best_score_ ) pipeline1 = gs . best_estimator_ predicted_labels = pipeline1 . predict ( testing . full_content ) pipeline_performance ( testing . label , predicted_labels ) print_top_features ( pipeline1 , n_features = 10 ) aaa = gs . predict ( testing . full_content ) == testing . label aaa = aaa [ testing . label == 1 ] testing [ \"titre\" ] . iloc [ ~ aaa . values ] #pipeline1.predict([\"windows xbox bitcoin\"]) from sklearn.externals import joblib joblib . dump ( pipeline1 , 'classifier.pkl' ) gs . predict ([ 'Google' ]) array([1], dtype=int64) steps = ( ( 'vectorizer' , TfidfVectorizer ()), ( 'classifier' , SVC ()) ) pipeline = Pipeline ( steps ) gs_params = { #'vectorizer__use_idf': (True, False), 'vectorizer__stop_words' : [ 'english' , None ], 'vectorizer__ngram_range' : [( 1 , 1 ), ( 1 , 2 ), ( 2 , 2 )], 'vectorizer__preprocessor' : [ mask_integers , None ], 'classifier__C' : np . linspace ( 5 , 20 , 25 ) } gs = GridSearchCV ( pipeline , gs_params , n_jobs = 1 ) gs . fit ( training . full_content , training . label ) print ( gs . best_params_ ) print ( gs . best_score_ ) pipeline1 = gs . best_estimator_ predicted_labels = pipeline1 . predict ( testing . full_content ) pipeline_performance ( testing . label , predicted_labels ) print_top_features ( pipeline1 , n_features = 10 ) {'classifier__C': 5.0, 'vectorizer__ngram_range': (1, 1), 'vectorizer__preprocessor': &lt;function mask_integers at 0x00000237491B67B8&gt;, 'vectorizer__stop_words': 'english'} 0.711180124224 Accuracy = 71.2% Confusion matrix, without normalization [[153 0] [ 62 0]] --------------------------------------------------------------------------- ValueError Traceback (most recent call last) &lt;ipython-input-9-3e0781e307fb&gt; in &lt;module&gt;() 25 pipeline_performance(testing.label, predicted_labels) 26 ---&gt; 27 print_top_features(pipeline1, n_features=10) C:\\Users\\Guillaume\\Documents\\Code\\recommandation\\utils\\plotting.py in print_top_features(pipeline, vectorizer_name, classifier_name, n_features) 81 def print_top_features(pipeline, vectorizer_name='vectorizer', classifier_name='classifier', n_features=7): 82 vocabulary = np.array(pipeline.named_steps[vectorizer_name].get_feature_names()) ---&gt; 83 coefs = pipeline.named_steps[classifier_name].coef_[0] 84 top_feature_idx = np.argsort(coefs) 85 top_features = vocabulary[top_feature_idx] C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\svm\\base.py in coef_(self) 483 def coef_(self): 484 if self.kernel != 'linear': --&gt; 485 raise ValueError('coef_ is only available when using a ' 486 'linear kernel') 487 ValueError: coef_ is only available when using a linear kernel from sklearn.naive_bayes import BernoulliNB steps = ( ( 'vectorizer' , TfidfVectorizer ()), ( 'classifier' , BernoulliNB ()) ) pipeline2 = Pipeline ( steps ) gs_params = { 'vectorizer__stop_words' : [ 'english' , None ], 'vectorizer__ngram_range' : [( 1 , 1 ), ( 1 , 2 ), ( 2 , 2 )], 'vectorizer__preprocessor' : [ mask_integers , None ], 'classifier__alpha' : np . linspace ( 0 , 1 , 5 ), 'classifier__fit_prior' : [ True , False ] } gs = GridSearchCV ( pipeline2 , gs_params , n_jobs = 1 ) gs . fit ( training . full_content , training . label ) print ( gs . best_params_ ) print ( gs . best_score_ ) pipeline2 = gs . best_estimator_ predicted_labels = pipeline2 . predict ( testing . full_content ) pipeline_performance ( testing . label , predicted_labels ) print_top_features ( pipeline2 , n_features = 10 ) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:820: RuntimeWarning: divide by zero encountered in log neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:823: RuntimeWarning: invalid value encountered in add jll += self.class_log_prior_ + neg_prob.sum(axis=1) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:801: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - {'classifier__alpha': 0.25, 'classifier__fit_prior': True, 'vectorizer__ngram_range': (1, 1), 'vectorizer__preprocessor': &lt;function mask_integers at 0x00000237491B67B8&gt;, 'vectorizer__stop_words': 'english'} 0.805900621118 Accuracy = 78.1% Confusion matrix, without normalization [[140 13] [ 34 28]] Top like features: ['use' 'just' 'year' 'price' 'time' 'Bitcoin' 'bitcoin' 'new' 'The' 'INTMASK'] --- Top dislike features: ['ABBA' 'cable' 'cab' 'byte' 'publication' 'bye' 'publications' 'publicity' 'buyer' 'publicizing'] from sklearn.naive_bayes import MultinomialNB steps = ( ( 'vectorizer' , TfidfVectorizer ()), ( 'classifier' , MultinomialNB ()) ) pipeline3 = Pipeline ( steps ) gs_params = { 'vectorizer__stop_words' : [ 'english' , None ], 'vectorizer__ngram_range' : [( 1 , 1 ), ( 1 , 2 ), ( 2 , 2 )], 'vectorizer__preprocessor' : [ mask_integers , None ], 'classifier__alpha' : np . linspace ( 0 , 1 , 5 ), 'classifier__fit_prior' : [ True , False ] } gs = GridSearchCV ( pipeline3 , gs_params , n_jobs = 1 ) gs . fit ( training . full_content , training . label ) print ( gs . best_params_ ) print ( gs . best_score_ ) pipeline3 = gs . best_estimator_ predicted_labels = pipeline3 . predict ( testing . full_content ) pipeline_performance ( testing . label , predicted_labels ) print_top_features ( pipeline3 , n_features = 10 ) C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - C:\\Users\\Guillaume\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = (np.log(smoothed_fc) - {'classifier__alpha': 0.5, 'classifier__fit_prior': False, 'vectorizer__ngram_range': (1, 1), 'vectorizer__preprocessor': &lt;function mask_integers at 0x00000237491B67B8&gt;, 'vectorizer__stop_words': 'english'} 0.80900621118 Accuracy = 79.1% Confusion matrix, without normalization [[141 12] [ 33 29]] Top like features: ['time' 'Google' 'Pro' 'Apple' 'new' 'The' 'Bitcoin' 'price' 'bitcoin' 'INTMASK'] --- Top dislike features: ['ABBA' 'categories' 'catching' 'catalyst' 'catalog' 'casually' 'casts' 'cast' 'cashier' 'ran']","tags":"Machine Learning","url":"redoules.github.io/machine-learning/Source code for the recommandation engine for articles.html","loc":"redoules.github.io/machine-learning/Source code for the recommandation engine for articles.html"}]}