CSV files into SQLite database

This bit of code will transfer a whole bunch of csv files in the ‘logs/’ directory into a single SQLite db file called repo_data.db.

import pandas as pd
import re
import os
import sqlite3 as sql

# create list of all the different files
list_files = [x for x in os.listdir('logs') if x.endswith('.csv')]

# create a new file with a connection
db_con = sql.connect('repo_data.db')

# for 8-bit string instead of unicode...why?
db_con.text_factory = str

# now go through the things and create a table for each
for str_raw_name in list_files:

    # clean name and lower
    str_name = re.sub('.csv', '', str_raw_name).lower()

    # open the file and create 
    df = pd.read_csv('logs/%s' % str_raw_name)

    # pass on to the database
    df.to_sql(str_name, con = db_con, index=False)

    print 'DONE', str_name

NOTE: I kept getting an error about some 8-bit blah blah blah text thing. I googled it and fixed it (line 13), but I still have no idea what it means.

Using the .isin() function in Pandas


Searching and selecting data from a dataframe using the .isin() function.

The .isin() function is a powerful tool that can help you search search for a number of values in a data frame.

This is how it’s done.

We start by creating a simple data frame

import pandas as pd

df = pd.DataFrame({'col_1':[1,2,3,4],'col_2':[2,3,4,1]})

The data frame should look something like this

col_1  col_2
0      1      2
1      2      3
2      3      4
3      4      1

Now, we will use the .isin() function to select all the rows that have either the number 1 or 4 in col_1, and we’ll put them in a new data frame called df_14.

We do this by using a list of the values and placing that list in the .isin() function.

either like this, if we have a short list…

df_14 = df[df['col_1'].isin([1,4])]

…or like this, when we have a longer list.

list_numbers = [1,4]
df_14 = df[df['col_1'].isin(list_numbers)]

The new data frame should look like this

col_1  col_2
0      1      2
3      4      1

Simple.

But what if we want all the values that DO NOT match those in the list?

We can do this by adding ==False to the function. Like this

df_not_14 = df[df['col_1'].isin([1,4])==False]

and this new data frame would look like this

col_1  col_2
1      2      3
2      3      4

Finding and deleting empty file

A short Python script to find and delete all empty files.

#!/usr/local/bin/python
# -*- coding: utf-8 -*-
#
# search and delete empty files

import os

# this is the directory
str_directory = 'myfolder'

# get list of all files in the directory and remove possible hidden files
list_files = [x for x in os.listdir(str_directory) if x[0]!='.']

# now loop through the files and remove empty ones
for each_file in list_files:
    file_path = '%s/%s' % (str_directory, each_file)

    # check size and delete if 0
    if os.path.getsize(file_path)==0:
        os.remove(file_path)
    else:
        pass

Dealing with dates in Pandas

A few lines on how to deal with dates and date formating issues in Pandas.

First thing first, import the pandas module and read the csv file

>>> import pandas as pd
>>> df = pd.read_csv('path/to/file.csv', encoding='utf-8')

So, we have a simple data frame like this…

>>> df
        start         end
0  2001-06-01  2004-02-01
1  2001-11-01  2003-12-01
2  2005-04-01  2007-03-01
3  2005-05-01  2007-03-01

…and we want to calculate the amount of time between the start and the end.

The main problems is that the dates are read as string (when read from a file), and therefore there is very little we can do with this right now.

To change the column to dates, we can use the to_datetime function in pandas. We also don’t want to be completely destructive and risk messing up the data, so we put the newly formatted dates into new columns (start_d and end_d), like this…

>>> df['start_d'] = pd.to_datetime(df['start'])
>>> df['end_d'] = pd.to_datetime(df['end'])

Now, our data frame should look a bit like this

>>> df
        start         end    start_d      end_d
0  2001-06-01  2004-02-01 2001-06-01 2004-02-01
1  2001-11-01  2003-12-01 2001-11-01 2003-12-01
2  2005-04-01  2007-03-01 2005-04-01 2007-03-01
3  2005-05-01  2007-03-01 2005-05-01 2007-03-01

to calculate the length of time between start and end, we simply subtract start_d from end_d, like this

>>> df['len'] = df['end_d'] - df['start_d']

which will result in the difference being calculated in days, leaving the data frame looking like this

        start         end    start_d      end_d      len
0  2001-06-01  2004-02-01 2001-06-01 2004-02-01 975 days
1  2001-11-01  2003-12-01 2001-11-01 2003-12-01 760 days
2  2005-04-01  2007-03-01 2005-04-01 2007-03-01 699 days
3  2005-05-01  2007-03-01 2005-05-01 2007-03-01 669 days