5.3 working with time series#

Important

This lesson is still under development.

In this file you will learn about following concepts of pandas

  • DateTimeIndex

  • TimeStamp

  • freq

  • Timedelta

  • offsets

  • resampling

  • Period

import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 20)

print(np.__version__)
print(pd.__version__)
1.26.3
1.5.3

Let’s define a dataframe and check its index

df = pd.DataFrame(np.arange(31))

print(df)
     0
0    0
1    1
2    2
3    3
4    4
..  ..
26  26
27  27
28  28
29  29
30  30

[31 rows x 1 columns]

Since dataframe is nothing but numpy arrays with indexes which means each row and column has a label (index). Therefore, we can also interpret dataframes as indexed numpy arrays. When we create a dataframe, pandas automatically assigns a suitable index (row labels) to it.

print(df.index)
RangeIndex(start=0, stop=31, step=1)

The index is a range from 0 to 1 with step size of 1 and is of of type RangeIndex.

we can verify the type of index

print(type(df.index))
<class 'pandas.core.indexes.range.RangeIndex'>

DateTimeIndex#

Let’s define a more useful index for the dataframe i.e., dates with daily time step

index = [f"2011-01-{i}" for i in range(1, 32)]

print(index)
['2011-01-1', '2011-01-2', '2011-01-3', '2011-01-4', '2011-01-5', '2011-01-6', '2011-01-7', '2011-01-8', '2011-01-9', '2011-01-10', '2011-01-11', '2011-01-12', '2011-01-13', '2011-01-14', '2011-01-15', '2011-01-16', '2011-01-17', '2011-01-18', '2011-01-19', '2011-01-20', '2011-01-21', '2011-01-22', '2011-01-23', '2011-01-24', '2011-01-25', '2011-01-26', '2011-01-27', '2011-01-28', '2011-01-29', '2011-01-30', '2011-01-31']

At this point the index is a list of strings where each string indicates a day/date.

Now let’s assign this index to our dataframe

df.index = index

print(df)
             0
2011-01-1    0
2011-01-2    1
2011-01-3    2
2011-01-4    3
2011-01-5    4
...         ..
2011-01-27  26
2011-01-28  27
2011-01-29  28
2011-01-30  29
2011-01-31  30

[31 rows x 1 columns]

We can see that the index of our the dataframe is now the date. But does pandas recognizes this new index as date or does it considers it still as strings?

print(type(df.index))
<class 'pandas.core.indexes.base.Index'>

So it turns out that pandas does not recognize the index as date/time

Therefore, we can explicitly tell pandas that the new index is a date and time index

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
               '2011-01-09', '2011-01-10', '2011-01-11', '2011-01-12',
               '2011-01-13', '2011-01-14', '2011-01-15', '2011-01-16',
               '2011-01-17', '2011-01-18', '2011-01-19', '2011-01-20',
               '2011-01-21', '2011-01-22', '2011-01-23', '2011-01-24',
               '2011-01-25', '2011-01-26', '2011-01-27', '2011-01-28',
               '2011-01-29', '2011-01-30', '2011-01-31'],
              dtype='datetime64[ns]', freq=None)

The to_datetime function of pandas converts an array of dates into DateTimeIndex object. It can accepts dates in a wide range of formats. We can verify the type of our new index.

print(type(index))
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

now we have created an index whose type is DateTimeIndex i.e. pandas recognizes it as date/time. Let’s assign this as index to the dataframe.

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
               '2011-01-09', '2011-01-10', '2011-01-11', '2011-01-12',
               '2011-01-13', '2011-01-14', '2011-01-15', '2011-01-16',
               '2011-01-17', '2011-01-18', '2011-01-19', '2011-01-20',
               '2011-01-21', '2011-01-22', '2011-01-23', '2011-01-24',
               '2011-01-25', '2011-01-26', '2011-01-27', '2011-01-28',
               '2011-01-29', '2011-01-30', '2011-01-31'],
              dtype='datetime64[ns]', freq=None)

So now the type of index of the dataframe is date/time. Now we can perform slicing based upon time, for example we can ask pandas to return rows which are after 15 January 2011 as below

print(df[df.index>pd.Timestamp("20110115")])
             0
2011-01-16  15
2011-01-17  16
2011-01-18  17
2011-01-19  18
2011-01-20  19
2011-01-21  20
2011-01-22  21
2011-01-23  22
2011-01-24  23
2011-01-25  24
2011-01-26  25
2011-01-27  26
2011-01-28  27
2011-01-29  28
2011-01-30  29
2011-01-31  30

Had we done it earlier (before converting our index to pd.DateTimeIndex, we would have got error

creating datetime index#

Above we converted a normal index which was of type list into DateTimeIndex using to_datetime function. We can directly create DateTimeIndex using date_range function.

index = pd.date_range(start="20110101", freq="D", periods=31)

print(type(index))
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
print(index)
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
               '2011-01-09', '2011-01-10', '2011-01-11', '2011-01-12',
               '2011-01-13', '2011-01-14', '2011-01-15', '2011-01-16',
               '2011-01-17', '2011-01-18', '2011-01-19', '2011-01-20',
               '2011-01-21', '2011-01-22', '2011-01-23', '2011-01-24',
               '2011-01-25', '2011-01-26', '2011-01-27', '2011-01-28',
               '2011-01-29', '2011-01-30', '2011-01-31'],
              dtype='datetime64[ns]', freq='D')

We can also define the frequency or time-step of our DateTimeIndex.

index = pd.date_range(start="20110101", end="20110131", freq="D")

print(type(index))
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
print(index)
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
               '2011-01-09', '2011-01-10', '2011-01-11', '2011-01-12',
               '2011-01-13', '2011-01-14', '2011-01-15', '2011-01-16',
               '2011-01-17', '2011-01-18', '2011-01-19', '2011-01-20',
               '2011-01-21', '2011-01-22', '2011-01-23', '2011-01-24',
               '2011-01-25', '2011-01-26', '2011-01-27', '2011-01-28',
               '2011-01-29', '2011-01-30', '2011-01-31'],
              dtype='datetime64[ns]', freq='D')

TimeStamps#

The DateTimeIndex is indeed an array of TimeStamps i.e. each member of DateTimeIndex is a TimeStamp.

print(df.index[0])
2011-01-01 00:00:00
print(type(df.index[0]))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

we can check whether a TimeStamp is in a DateTimeIndex or not

print(index[1] in index)
True

we can also compare two TimeStamps

print(index[0] > index[1])
False

freq#

the index (of dataframe) has a special attribute called freq which defines the time-step of the index. It is only available for the index of type DateTimeIndex.

None

There can be two reasons for the freq to be None. Either the data/DateTimeIndex does not have constant time-steps. In such a case freq (time-step) can not be computed. But sometimes even if the index is of type DateTimeIndex and has constant time-step but it can have None freq. This is what happened above. In both cases we can ask pandas to infer the freq/time-step of the index.

D

Now we can assign the frequency to the DataFrame.index (not DataFrame). This is kind of reminding the DataFrame that this is the time-step of your index.

<Day>

we can see once ‘reminded’, the pandas now tells us the frequency of its index.

print(type(df.index.freq))
<class 'pandas._libs.tslibs.offsets.Day'>
print(type(df.index.freqstr))
<class 'str'>

forcing a frequency#

df = pd.DataFrame(np.arange(31), index=pd.date_range("20110101", periods=31, freq="D"))
print(df)
             0
2011-01-01   0
2011-01-02   1
2011-01-03   2
2011-01-04   3
2011-01-05   4
...         ..
2011-01-27  26
2011-01-28  27
2011-01-29  28
2011-01-30  29
2011-01-31  30

[31 rows x 1 columns]
<Day>
df = df.drop(labels="2011-01-03")

print(df)
             0
2011-01-01   0
2011-01-02   1
2011-01-04   3
2011-01-05   4
2011-01-06   5
...         ..
2011-01-27  26
2011-01-28  27
2011-01-29  28
2011-01-30  29
2011-01-31  30

[30 rows x 1 columns]
None

if we forcefully try to assign a frequency, pandas will throw ValueError.

# Try by uncommenting following line

# df.index.freq = "D"  # -> ValueError

Resampling#

Resampling means changing the frequency of time series.

One major advantage of having a frequency i.e freq attribute defined is that we can easily change the frequency/time-step of the data (time series).

df.asfreq('D')
0
2011-01-01 0.0
2011-01-02 1.0
2011-01-03 NaN
2011-01-04 3.0
2011-01-05 4.0
... ...
2011-01-27 26.0
2011-01-28 27.0
2011-01-29 28.0
2011-01-30 29.0
2011-01-31 30.0

31 rows × 1 columns



Above when we tried to resample our time series data at daily time step, the time steps where we did not have any value, were assigned NaN values.

upsampling#

This refers to changing the time step from larger to smaller such as from daily to hourly

df = pd.DataFrame(np.random.randint(0, 5, 5),
                  index=pd.date_range("20110101", periods=5, freq="D"),
                  columns=['a'])
print(df)
            a
2011-01-01  3
2011-01-02  1
2011-01-03  1
2011-01-04  3
2011-01-05  0
<pandas.core.resample.DatetimeIndexResampler object at 0x7ffad3f50100>

Until now we have told pandas to resample at a particular time step but we have not told which method to use. We can as an example use the mean to resample.

df.resample("6H").mean()
a
2011-01-01 00:00:00 3.0
2011-01-01 06:00:00 NaN
2011-01-01 12:00:00 NaN
2011-01-01 18:00:00 NaN
2011-01-02 00:00:00 1.0
2011-01-02 06:00:00 NaN
2011-01-02 12:00:00 NaN
2011-01-02 18:00:00 NaN
2011-01-03 00:00:00 1.0
2011-01-03 06:00:00 NaN
2011-01-03 12:00:00 NaN
2011-01-03 18:00:00 NaN
2011-01-04 00:00:00 3.0
2011-01-04 06:00:00 NaN
2011-01-04 12:00:00 NaN
2011-01-04 18:00:00 NaN
2011-01-05 00:00:00 0.0


But this did not fill the NaNs in our new data.

df.resample("6H").ffill()
a
2011-01-01 00:00:00 3
2011-01-01 06:00:00 3
2011-01-01 12:00:00 3
2011-01-01 18:00:00 3
2011-01-02 00:00:00 1
2011-01-02 06:00:00 1
2011-01-02 12:00:00 1
2011-01-02 18:00:00 1
2011-01-03 00:00:00 1
2011-01-03 06:00:00 1
2011-01-03 12:00:00 1
2011-01-03 18:00:00 1
2011-01-04 00:00:00 3
2011-01-04 06:00:00 3
2011-01-04 12:00:00 3
2011-01-04 18:00:00 3
2011-01-05 00:00:00 0


.mean() returns us a pandas object. We can in fact call ffill on it as well.

df.resample("6H").mean().ffill()
a
2011-01-01 00:00:00 3.0
2011-01-01 06:00:00 3.0
2011-01-01 12:00:00 3.0
2011-01-01 18:00:00 3.0
2011-01-02 00:00:00 1.0
2011-01-02 06:00:00 1.0
2011-01-02 12:00:00 1.0
2011-01-02 18:00:00 1.0
2011-01-03 00:00:00 1.0
2011-01-03 06:00:00 1.0
2011-01-03 12:00:00 1.0
2011-01-03 18:00:00 1.0
2011-01-04 00:00:00 3.0
2011-01-04 06:00:00 3.0
2011-01-04 12:00:00 3.0
2011-01-04 18:00:00 3.0
2011-01-05 00:00:00 0.0


A better way to resample is apply some interpolation method. For example linear interpolation.

df.resample("6H").interpolate(method="linear")
a
2011-01-01 00:00:00 3.00
2011-01-01 06:00:00 2.50
2011-01-01 12:00:00 2.00
2011-01-01 18:00:00 1.50
2011-01-02 00:00:00 1.00
2011-01-02 06:00:00 1.00
2011-01-02 12:00:00 1.00
2011-01-02 18:00:00 1.00
2011-01-03 00:00:00 1.00
2011-01-03 06:00:00 1.50
2011-01-03 12:00:00 2.00
2011-01-03 18:00:00 2.50
2011-01-04 00:00:00 3.00
2011-01-04 06:00:00 2.25
2011-01-04 12:00:00 1.50
2011-01-04 18:00:00 0.75
2011-01-05 00:00:00 0.00


Sometimes, we may wish to equally distribute a quantity during upsampling For example if we have total amount of rainfall for a day, then linearly interpolating daily rainfall values to hourly will be wrong. In such a case we will wish to distribute daily rainfall to equally to hourly steps

df1 = df.resample('6H').mean().ffill()
df1['a'] = df1['a'] / df1.groupby('a')['a'].transform(len)  # len/'size'
print(df1)
                         a
2011-01-01 00:00:00  0.375
2011-01-01 06:00:00  0.375
2011-01-01 12:00:00  0.375
2011-01-01 18:00:00  0.375
2011-01-02 00:00:00  0.125
2011-01-02 06:00:00  0.125
2011-01-02 12:00:00  0.125
2011-01-02 18:00:00  0.125
2011-01-03 00:00:00  0.125
2011-01-03 06:00:00  0.125
2011-01-03 12:00:00  0.125
2011-01-03 18:00:00  0.125
2011-01-04 00:00:00  0.375
2011-01-04 06:00:00  0.375
2011-01-04 12:00:00  0.375
2011-01-04 18:00:00  0.375
2011-01-05 00:00:00  0.000

downsampling#

It refers to resampling from low time step to high time step e.g. from hourly to daily

df = pd.DataFrame(np.random.randint(0, 5, 24),
                  index=pd.date_range("20110101", periods=24, freq="H"),
                  columns=['a'])
print(df)
                     a
2011-01-01 00:00:00  1
2011-01-01 01:00:00  4
2011-01-01 02:00:00  3
2011-01-01 03:00:00  1
2011-01-01 04:00:00  4
...                 ..
2011-01-01 19:00:00  3
2011-01-01 20:00:00  4
2011-01-01 21:00:00  3
2011-01-01 22:00:00  1
2011-01-01 23:00:00  1

[24 rows x 1 columns]
df.resample("6H").mean()
a
2011-01-01 00:00:00 2.500000
2011-01-01 06:00:00 2.666667
2011-01-01 12:00:00 1.666667
2011-01-01 18:00:00 2.000000


df.resample("6H").sum()
a
2011-01-01 00:00:00 15
2011-01-01 06:00:00 16
2011-01-01 12:00:00 10
2011-01-01 18:00:00 12


inconsistent time step#

Sometimes we have quantities, which are not measured at exactly the same frequency where we want. For example below data is measured with inconsistent time steps.

df = pd.DataFrame([np.nan, 1100, 1400, np.nan, 14000],
                   index=pd.to_datetime(["2011-05-25 10:00:00",
                                         "2011-05-25 16:40:00",
                                         "2011-05-25 17:06:00",
                                         "2011-05-25 17:10:00",
                                         "2011-05-25 17:24:00"]),
                  columns=['a'])

print(df)
                           a
2011-05-25 10:00:00      NaN
2011-05-25 16:40:00   1100.0
2011-05-25 17:06:00   1400.0
2011-05-25 17:10:00      NaN
2011-05-25 17:24:00  14000.0

Our target is to convert this data to 6 minute. A naive way would be to change the frequency and do not fill the new nans.

df.resample('6Min').first()
a
2011-05-25 10:00:00 NaN
2011-05-25 10:06:00 NaN
2011-05-25 10:12:00 NaN
2011-05-25 10:18:00 NaN
2011-05-25 10:24:00 NaN
... ...
2011-05-25 17:00:00 NaN
2011-05-25 17:06:00 1400.0
2011-05-25 17:12:00 NaN
2011-05-25 17:18:00 NaN
2011-05-25 17:24:00 14000.0

75 rows × 1 columns



You see the number of values change from 5 to 75

a better option will be to do backfill or forward fill

df.resample('6min').bfill(limit=1)
a
2011-05-25 10:00:00 NaN
2011-05-25 10:06:00 NaN
2011-05-25 10:12:00 NaN
2011-05-25 10:18:00 NaN
2011-05-25 10:24:00 NaN
... ...
2011-05-25 17:00:00 1400.0
2011-05-25 17:06:00 1400.0
2011-05-25 17:12:00 NaN
2011-05-25 17:18:00 14000.0
2011-05-25 17:24:00 14000.0

75 rows × 1 columns



it will be even better to do a linear interpolation between available values.

df.resample('6min').interpolate()
a
2011-05-25 10:00:00 NaN
2011-05-25 10:06:00 NaN
2011-05-25 10:12:00 NaN
2011-05-25 10:18:00 NaN
2011-05-25 10:24:00 NaN
... ...
2011-05-25 17:00:00 NaN
2011-05-25 17:06:00 1400.0
2011-05-25 17:12:00 5600.0
2011-05-25 17:18:00 9800.0
2011-05-25 17:24:00 14000.0

75 rows × 1 columns



df.resample('6min').interpolate('nearest')
a
2011-05-25 10:00:00 NaN
2011-05-25 10:06:00 NaN
2011-05-25 10:12:00 NaN
2011-05-25 10:18:00 NaN
2011-05-25 10:24:00 NaN
... ...
2011-05-25 17:00:00 NaN
2011-05-25 17:06:00 1400.0
2011-05-25 17:12:00 1400.0
2011-05-25 17:18:00 14000.0
2011-05-25 17:24:00 14000.0

75 rows × 1 columns



Period#

A Period is an interval between two TimeStamps. Therefore a Period has start_time and end_time attributes

p = pd.Period("1979-02-01")

print(type(p))
<class 'pandas._libs.tslibs.period.Period'>
print(p.start_time)
1979-02-01 00:00:00
print(p.end_time)
1979-02-01 23:59:59.999999999
print(type(p.start_time)), print(type(p.end_time))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

(None, None)
print(p.freq)
<Day>

PeriodIndex#

Similar to the concept of DateTimeIndex is the concept of PeriodIndex. Just as a DateTimeIndex can be considered as an array of TimeStamps, a PeriodIndex is array of Period.

pidx = pd.period_range("20110101", "20121231", freq="M")
print(pidx)
PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05', '2011-06',
             '2011-07', '2011-08', '2011-09', '2011-10', '2011-11', '2011-12',
             '2012-01', '2012-02', '2012-03', '2012-04', '2012-05', '2012-06',
             '2012-07', '2012-08', '2012-09', '2012-10', '2012-11', '2012-12'],
            dtype='period[M]')
print(type(pidx))
<class 'pandas.core.indexes.period.PeriodIndex'>

each member of PeriodIndex array i.e., pidx is a Period

print(type(pidx[0]))
<class 'pandas._libs.tslibs.period.Period'>

For an overview of difference between TimeStamp and PeriodIndex, see this

Total running time of the script: ( 0 minutes 0.268 seconds)

Gallery generated by Sphinx-Gallery