Dealing with dates in Pandas

A few lines on how to deal with dates and date formating issues in Pandas.

First thing first, import the pandas module and read the csv file

>>> import pandas as pd
>>> df = pd.read_csv('path/to/file.csv', encoding='utf-8')

So, we have a simple data frame like this…

>>> df
        start         end
0  2001-06-01  2004-02-01
1  2001-11-01  2003-12-01
2  2005-04-01  2007-03-01
3  2005-05-01  2007-03-01

…and we want to calculate the amount of time between the start and the end.

The main problems is that the dates are read as string (when read from a file), and therefore there is very little we can do with this right now.

To change the column to dates, we can use the to_datetime function in pandas. We also don’t want to be completely destructive and risk messing up the data, so we put the newly formatted dates into new columns (start_d and end_d), like this…

>>> df['start_d'] = pd.to_datetime(df['start'])
>>> df['end_d'] = pd.to_datetime(df['end'])

Now, our data frame should look a bit like this

>>> df
        start         end    start_d      end_d
0  2001-06-01  2004-02-01 2001-06-01 2004-02-01
1  2001-11-01  2003-12-01 2001-11-01 2003-12-01
2  2005-04-01  2007-03-01 2005-04-01 2007-03-01
3  2005-05-01  2007-03-01 2005-05-01 2007-03-01

to calculate the length of time between start and end, we simply subtract start_d from end_d, like this

>>> df['len'] = df['end_d'] - df['start_d']

which will result in the difference being calculated in days, leaving the data frame looking like this

        start         end    start_d      end_d      len
0  2001-06-01  2004-02-01 2001-06-01 2004-02-01 975 days
1  2001-11-01  2003-12-01 2001-11-01 2003-12-01 760 days
2  2005-04-01  2007-03-01 2005-04-01 2007-03-01 699 days
3  2005-05-01  2007-03-01 2005-05-01 2007-03-01 669 days