Adding colour map to stacked bar plot

We start with the bar plot I created here. Which gave us this…

Now we want to give it some nice colours.

This we can do by importing the colour map from Matplotlib and extracting a range of colours. I like the Viridis colormap so I’ll use that. Also, we have 5 variables in our stacked bar plot, so we only want 5 colours from it.

from matplotlib import cm

# get the colormap and extract 5 colours
viridis = cm.get_cmap('viridis', 5)

You can get the individual values like this…

print(viridis.colors)

Which should give you something like this

[[0.267004 0.004874 0.329415 1.      ]
 [0.229739 0.322361 0.545706 1.      ]
 [0.127568 0.566949 0.550556 1.      ]
 [0.369214 0.788888 0.382914 1.      ]
 [0.993248 0.906157 0.143936 1.      ]]

Now we can allocate each of these values to out plot. Lets do the first one as an example…

p1 = plt.bar(xMain,
             dataVar1,
             color=viridis.colors[0])

Once you do the same for the other 4, you should get something that looks like this.

So much nicer!

The complete code looks like this:

#!/usr/local/bin/python3
# -*- coding: utf-8 -*-
#
# make a stackbar for data

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import cm

# open the file
df = pd.read_csv('data_file.csv')

# get the data you want to plot
dataVar1 = df['column1']
dataVar2 = df['column2']
dataVar3 = df['column3']
dataVar4 = df['column4']
dataVar5 = df['column5']

# number of entries (x axis)
xMain = np.arange(30)

# color map
viridis = cm.get_cmap('viridis', 5)

# plot each data
p1 = plt.bar(xMain, 
            dataVar1, 
            color=viridis.colors[0])

p2 = plt.bar(xMain, 
             dataVar2, 
             bottom=dataVar1, 
             color=viridis.colors[1])

p3 = plt.bar(xMain,
             dataVar3,
             bottom=dataVar1+dataVar2,
             color=viridis.colors[2])

p4 = plt.bar(xMain,
             dataVar4,
             bottom=dataVar1+dataVar2+dataVar3,
             color=viridis.colors[3])

p5 = plt.bar(xMain,
             dataVar4,
             bottom=dataVar1+dataVar2+dataVar3+dataVar4,
             color=viridis.colors[4])

# show the graph
plt.savefig('output_fig.png')

Stacked bar plot using matplotlib

How to create a simple stacked bar chart.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# open data file
df = pd.read_csv('data_file.csv')

# get the data you want to plot
dataVar1 = df['column1']
dataVar2 = df['column2']
dataVar3 = df['column3']
dataVar4 = df['column4']
dataVar5 = df['column5']

# number of entries (x axis)
xMain = np.arange(30)

# plot each data
p1 = plt.bar(xMain, 
             dataVar1)

p2 = plt.bar(xMain, 
             dataVar2,
             bottom=dataVar1)

p3 = plt.bar(xMain,
             dataVar3,
             bottom=dataVar1+dataVar2)

p4 = plt.bar(xMain,
             dataVar4,
             bottom=dataVar1+dataVar2+dataVar3)

p5 = plt.bar(xMain,
             dataVar4,
             bottom=dataVar1+dataVar2+dataVar3+dataVar4)

# save the graph
plt.savefig('output_fig.png')

The botttom=dataVarn option is needed to stack the data, otherwise it will draw the next set of data at y=0.

If you do it right, you’ll get something like this.

Changing mongo’s db location

Stoping mongo

If its already running, stop it

brew services stop mongo

If all goes well, it should return something like this

Stopping `mongodb`... (might take a while)
==> Successfully stopped `mongodb` (label: homebrew.mxcl.mongodb)

Locate the mongo config file and open it

should be here…

/usr/local/etc/mongo.conf

…which I will edit using Vim

sudo vim /usr/local/etc/mongo.conf

Edit the storage path

The file should look something like this….

systemLog:
  destination: file
  path: /usr/local/var/log/mongodb/mongo.log
  logAppend: true
storage:
  dbPath: /usr/local/var/mongodb
net:
  bindIp: 127.0.0.1

Line 6 is the one we are after, change that to the location you want.

dbPath: /my/new/path/is/here

Save the file and close it.

Start mongo again

brew services start mongo

If everything goes well, Mongo should have added a whole bunch of files to the new directory.

CSV files into SQLite database

This bit of code will transfer a whole bunch of csv files in the ‘logs/’ directory into a single SQLite db file called repo_data.db.

import pandas as pd
import re
import os
import sqlite3 as sql

# create list of all the different files
list_files = [x for x in os.listdir('logs') if x.endswith('.csv')]

# create a new file with a connection
db_con = sql.connect('repo_data.db')

# for 8-bit string instead of unicode...why?
db_con.text_factory = str

# now go through the things and create a table for each
for str_raw_name in list_files:

    # clean name and lower
    str_name = re.sub('.csv', '', str_raw_name).lower()

    # open the file and create 
    df = pd.read_csv('logs/%s' % str_raw_name)

    # pass on to the database
    df.to_sql(str_name, con = db_con, index=False)

    print 'DONE', str_name

NOTE: I kept getting an error about some 8-bit blah blah blah text thing. I googled it and fixed it (line 13), but I still have no idea what it means.

Create Word document with Python

How to create a simple word document using a python script.

First, download python-docx.

pip3 install python-docx

yes…I’m using Python3 now.

Then, this is how simple the rest is…

from docx import Document

str_title = 'This is the document Title'

# create the document
doc_output = Document()

# add the header
doc_output.add_heading(str_title, 0)

# add a paragraph
doc_output.add_paragraph('This is a paragraph')

# save the file
doc_output.save('testing.docx')

Using the .isin() function in Pandas


Searching and selecting data from a dataframe using the .isin() function.

The .isin() function is a powerful tool that can help you search search for a number of values in a data frame.

This is how it’s done.

We start by creating a simple data frame

import pandas as pd

df = pd.DataFrame({'col_1':[1,2,3,4],'col_2':[2,3,4,1]})

The data frame should look something like this

col_1  col_2
0      1      2
1      2      3
2      3      4
3      4      1

Now, we will use the .isin() function to select all the rows that have either the number 1 or 4 in col_1, and we’ll put them in a new data frame called df_14.

We do this by using a list of the values and placing that list in the .isin() function.

either like this, if we have a short list…

df_14 = df[df['col_1'].isin([1,4])]

…or like this, when we have a longer list.

list_numbers = [1,4]
df_14 = df[df['col_1'].isin(list_numbers)]

The new data frame should look like this

col_1  col_2
0      1      2
3      4      1

Simple.

But what if we want all the values that DO NOT match those in the list?

We can do this by adding ==False to the function. Like this

df_not_14 = df[df['col_1'].isin([1,4])==False]

and this new data frame would look like this

col_1  col_2
1      2      3
2      3      4

Finding and deleting empty file

A short Python script to find and delete all empty files.

#!/usr/local/bin/python
# -*- coding: utf-8 -*-
#
# search and delete empty files

import os

# this is the directory
str_directory = 'myfolder'

# get list of all files in the directory and remove possible hidden files
list_files = [x for x in os.listdir(str_directory) if x[0]!='.']

# now loop through the files and remove empty ones
for each_file in list_files:
    file_path = '%s/%s' % (str_directory, each_file)

    # check size and delete if 0
    if os.path.getsize(file_path)==0:
        os.remove(file_path)
    else:
        pass

From csv to json

EDIT: The original script is for idiots…smart people use the to_dict() function, then save as a JSON dump.

import json

dataDict = df.to_dict(orient=‘list’)

with open(‘path/to/file.json’, ‘w’) as fw:
    fw.write(json.dumps(dataDict))

This is the stupid version …

#!/usr/local/bin/python
# -*- coding: utf-8 -*-

import json
import pandas as pd

# this is the file that will be open
str_file_path = 'sample_pages_data.csv'

# get file name for output
str_name = str_file_path[:-4]

# open the dataframe
df_data = pd.read_csv(str_file_path, encoding='utf-8')

# list of columns
list_cols = df_data.columns

# the output list
list_output = []

# iterate through each row
for df_row in df_data.iterrows():

    # dictionary for each entry
    dic_data = {}

    # loop through each columns
    for str_column in list_cols:

        # add data to dictionary
        dic_data[str_column] = df_row[1][str_column]

    # append dictionary to list_output
    list_output.append(dic_data)

# from dictionary to json and save output file
json_data = json.dumps(list_output, indent=2)

# save json file
with open('%s.json' % str_name, 'w') as fw:
    fw.write(json_data)

Dealing with dates in Pandas

A few lines on how to deal with dates and date formating issues in Pandas.

First thing first, import the pandas module and read the csv file

>>> import pandas as pd
>>> df = pd.read_csv('path/to/file.csv', encoding='utf-8')

So, we have a simple data frame like this…

>>> df
        start         end
0  2001-06-01  2004-02-01
1  2001-11-01  2003-12-01
2  2005-04-01  2007-03-01
3  2005-05-01  2007-03-01

…and we want to calculate the amount of time between the start and the end.

The main problems is that the dates are read as string (when read from a file), and therefore there is very little we can do with this right now.

To change the column to dates, we can use the to_datetime function in pandas. We also don’t want to be completely destructive and risk messing up the data, so we put the newly formatted dates into new columns (start_d and end_d), like this…

>>> df['start_d'] = pd.to_datetime(df['start'])
>>> df['end_d'] = pd.to_datetime(df['end'])

Now, our data frame should look a bit like this

>>> df
        start         end    start_d      end_d
0  2001-06-01  2004-02-01 2001-06-01 2004-02-01
1  2001-11-01  2003-12-01 2001-11-01 2003-12-01
2  2005-04-01  2007-03-01 2005-04-01 2007-03-01
3  2005-05-01  2007-03-01 2005-05-01 2007-03-01

to calculate the length of time between start and end, we simply subtract start_d from end_d, like this

>>> df['len'] = df['end_d'] - df['start_d']

which will result in the difference being calculated in days, leaving the data frame looking like this

        start         end    start_d      end_d      len
0  2001-06-01  2004-02-01 2001-06-01 2004-02-01 975 days
1  2001-11-01  2003-12-01 2001-11-01 2003-12-01 760 days
2  2005-04-01  2007-03-01 2005-04-01 2007-03-01 699 days
3  2005-05-01  2007-03-01 2005-05-01 2007-03-01 669 days