5.1 introduction#

import time
import numpy as np
import pandas as pd

print(time.asctime())
print(pd.__version__, np.__version__)
Mon Nov 11 19:32:24 2024
1.5.3 1.26.4

Suppose we have an array [0.4, 0.3, 0.5, 0.2, 0.6, 0.3]. Let’s say the values in this array represent concentrations in water measured every hour from 13 pm to 19 pm. However, with just an array, we don’t have the ability to encode this information. If we want to add the (temporal) reference of each value we have to add it ourself for example by saving that in a separate array. Pandas comes with this in-built ability that we can add reference or labels to arrays. Every array in pandas has two kinds of references. The reference for the rows which is called index and the reference for the columns which is called columns. Therefore we can call pandas a library which have referenced/labelled arrays.

The core data structure in pandas is DataFrame which consists of one or more columns. A single column in a DataFrame is a Series.

df = pd.DataFrame(np.random.random((10, 3)))
print(df)
          0         1         2
0  0.819711  0.469719  0.047138
1  0.995136  0.799338  0.014367
2  0.220825  0.169214  0.077614
3  0.472161  0.297704  0.213042
4  0.096419  0.898009  0.074485
5  0.210453  0.086511  0.951945
6  0.461789  0.427172  0.143734
7  0.906187  0.200208  0.497694
8  0.477416  0.382160  0.434236
9  0.428262  0.910327  0.273702

The data in columns is stored as numpy arrays. Therefore, a DataFrames and Series have a lot of characteristics similar to that of numpy arrays.

print(df.shape)
(10, 3)

By default the columns names are just integers starting from 0, however we can define the column names ourselves as well.

df = pd.DataFrame(np.random.random((10, 3)), columns=['a', 'b', 'c'])
print(df)
          a         b         c
0  0.643263  0.594007  0.442692
1  0.286817  0.179552  0.060201
2  0.463696  0.816541  0.600126
3  0.150433  0.366931  0.906603
4  0.288914  0.052968  0.615138
5  0.158648  0.694414  0.973317
6  0.744459  0.436719  0.882664
7  0.175492  0.733806  0.173178
8  0.051456  0.641243  0.885918
9  0.474923  0.216120  0.883705
print(df.columns)
Index(['a', 'b', 'c'], dtype='object')

The columns are list like structures. However they are not exactly lists.

print(type(df.columns))
<class 'pandas.core.indexes.base.Index'>

We can however, convert the columns to list though.

['a', 'b', 'c']
print(type(df.columns.to_list()))
<class 'list'>

The default label for the rows i.e. index consists of numbers starting from 0.

print(df.index)
RangeIndex(start=0, stop=10, step=1)

However, we can set index of our choice as well.

df = pd.DataFrame(np.random.random((10, 3)),
                  columns=['a', 'b', 'c'],
                 index=[2000+i for i in range(10)])
print(df)
             a         b         c
2000  0.978551  0.820233  0.227459
2001  0.317172  0.157027  0.695845
2002  0.767413  0.547235  0.369868
2003  0.425290  0.032933  0.923786
2004  0.251980  0.580116  0.266811
2005  0.947292  0.581080  0.211948
2006  0.868994  0.163130  0.771538
2007  0.627722  0.624592  0.296365
2008  0.337227  0.355143  0.516186
2009  0.441642  0.761122  0.799617
print(df.index)
Int64Index([2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009], dtype='int64')

The default name of index is None.

None

However, we can set the name of index as well.

df.index.name = 'years'
print(df)
              a         b         c
years
2000   0.978551  0.820233  0.227459
2001   0.317172  0.157027  0.695845
2002   0.767413  0.547235  0.369868
2003   0.425290  0.032933  0.923786
2004   0.251980  0.580116  0.266811
2005   0.947292  0.581080  0.211948
2006   0.868994  0.163130  0.771538
2007   0.627722  0.624592  0.296365
2008   0.337227  0.355143  0.516186
2009   0.441642  0.761122  0.799617
years
print(type(df))
<class 'pandas.core.frame.DataFrame'>
df = pd.DataFrame(np.random.randint(0, 10, (10, 1)),
                  columns=['a'],
                 index=[2000+i for i in range(10)])
print(df)
      a
2000  1
2001  6
2002  8
2003  5
2004  0
2005  6
2006  2
2007  8
2008  2
2009  3
print(type(df))
<class 'pandas.core.frame.DataFrame'>
print(df.columns)
Index(['a'], dtype='object')

Series#

A Series consists of a single column. It can be constructed using pd.Series.

s = pd.Series(np.random.random(10))
print(s)
0    0.240092
1    0.316543
2    0.336670
3    0.256285
4    0.662836
5    0.947630
6    0.188108
7    0.525331
8    0.612323
9    0.212473
dtype: float64
print(type(s))
<class 'pandas.core.series.Series'>
print(s.shape)
(10,)
print(s.name)
None
s = pd.Series(np.random.random(10),
              name="a")
print(s)
0    0.342054
1    0.232343
2    0.547047
3    0.942240
4    0.766106
5    0.313809
6    0.783459
7    0.325427
8    0.380389
9    0.418716
Name: a, dtype: float64
print(s.name)
a

the Series is literally the data structure for a single column of a DataFrame.

df = pd.DataFrame(np.random.random((10, 3)),
                  columns=['a', 'b', 'c'],
                 index=[2000+i for i in range(10)])
print(df)
             a         b         c
2000  0.611481  0.646587  0.884365
2001  0.884323  0.265456  0.499323
2002  0.035322  0.413177  0.404945
2003  0.138943  0.159374  0.096236
2004  0.007718  0.173929  0.156492
2005  0.244785  0.159650  0.829702
2006  0.226925  0.632816  0.999603
2007  0.251388  0.524466  0.769429
2008  0.024436  0.454759  0.473400
2009  0.611078  0.184098  0.176071

A single column in a DataFrame is a Series.

print(type(df['a']))
<class 'pandas.core.series.Series'>
s = pd.Series(np.random.random(10),
              index=[2000+i for i in range(10)],
              name="a")
print(s)
2000    0.520191
2001    0.355112
2002    0.645158
2003    0.384966
2004    0.996232
2005    0.068262
2006    0.720012
2007    0.611447
2008    0.931790
2009    0.001561
Name: a, dtype: float64

Since pandas is based upon numpy arrays. We can extract actual numpy arrays from DataFrame using .values method.

print(df.values)
[[0.61148101 0.64658729 0.8843646 ]
 [0.88432322 0.26545644 0.499323  ]
 [0.03532233 0.41317685 0.40494501]
 [0.13894266 0.15937354 0.0962357 ]
 [0.00771782 0.17392933 0.15649219]
 [0.24478489 0.15965015 0.82970221]
 [0.22692541 0.63281557 0.99960347]
 [0.25138839 0.52446576 0.76942873]
 [0.0244364  0.4547589  0.47339962]
 [0.61107766 0.1840978  0.17607135]]
print(type(df.values))
<class 'numpy.ndarray'>
df = pd.DataFrame(np.random.randint(0, 14, (10, 3)),
                  columns=['a', 'b', 'c'],
                 index=[2000+i for i in range(10)])
print(df)
       a   b   c
2000  12  10   1
2001   6   5   2
2002   2   0   4
2003  12   1  12
2004  11   1   0
2005   1   9  13
2006   6   3  11
2007   9   0  11
2008   7   7   0
2009   5   8   2
print(type(df.values))
<class 'numpy.ndarray'>
(10, 3)
a b c
count 10.000000 10.000000 10.000000
mean 7.100000 4.400000 5.600000
std 3.900142 3.893014 5.440588
min 1.000000 0.000000 0.000000
25% 5.250000 1.000000 1.250000
50% 6.500000 4.000000 3.000000
75% 10.500000 7.750000 11.000000
max 12.000000 10.000000 13.000000


a b c
2000 12 10 1
2001 6 5 2
2002 2 0 4
2003 12 1 12
2004 11 1 0


a b c
2000 12 10 1
2001 6 5 2
2002 2 0 4
2003 12 1 12
2004 11 1 0
2005 1 9 13
2006 6 3 11
2007 9 0 11


Get the last N rows of a DataFrame

a b c
2005 1 9 13
2006 6 3 11
2007 9 0 11
2008 7 7 0
2009 5 8 2


a b c
2003 12 1 12
2004 11 1 0
2005 1 9 13
2006 6 3 11
2007 9 0 11
2008 7 7 0
2009 5 8 2


a    7.1
b    4.4
c    5.6
dtype: float64
{'a': {2000: 12, 2001: 6, 2002: 2, 2003: 12, 2004: 11, 2005: 1, 2006: 6, 2007: 9, 2008: 7, 2009: 5}, 'b': {2000: 10, 2001: 5, 2002: 0, 2003: 1, 2004: 1, 2005: 9, 2006: 3, 2007: 0, 2008: 7, 2009: 8}, 'c': {2000: 1, 2001: 2, 2002: 4, 2003: 12, 2004: 0, 2005: 13, 2006: 11, 2007: 11, 2008: 0, 2009: 2}}
df.to_dict('list')
{'a': [12, 6, 2, 12, 11, 1, 6, 9, 7, 5], 'b': [10, 5, 0, 1, 1, 9, 3, 0, 7, 8], 'c': [1, 2, 4, 12, 0, 13, 11, 11, 0, 2]}
df['d'] = np.random.randint(0, 10, (10,))
print(df)
       a   b   c  d
2000  12  10   1  4
2001   6   5   2  4
2002   2   0   4  6
2003  12   1  12  9
2004  11   1   0  5
2005   1   9  13  5
2006   6   3  11  5
2007   9   0  11  5
2008   7   7   0  0
2009   5   8   2  8
df.pop('d')
print(df)
       a   b   c
2000  12  10   1
2001   6   5   2
2002   2   0   4
2003  12   1  12
2004  11   1   0
2005   1   9  13
2006   6   3  11
2007   9   0  11
2008   7   7   0
2009   5   8   2
df.columns = ['x', 'y', 'z']
print(df)
       x   y   z
2000  12  10   1
2001   6   5   2
2002   2   0   4
2003  12   1  12
2004  11   1   0
2005   1   9  13
2006   6   3  11
2007   9   0  11
2008   7   7   0
2009   5   8   2

row count of pandas dataframe

len(df.index)
10
print(df.shape[0])
10

change the order of DataFrame columns

cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
print(df)
       z   x   y
2000   1  12  10
2001   2   6   5
2002   4   2   0
2003  12  12   1
2004   0  11   1
2005  13   1   9
2006  11   6   3
2007  11   9   0
2008   0   7   7
2009   2   5   8

drop rows of Pandas DataFrame whose value in a certain column is NaN

          0         1         2
0  0.792738 -2.566105  0.434209
1  1.059024 -0.031717  0.358016
2  0.301728  0.249756 -0.682732
3 -0.944110  1.394479  1.437554
4  1.052420  1.079581  0.149270
5 -1.030245 -0.500275 -2.613331
df.iloc[::2,0] = np.nan; df.iloc[::4,2] = np.nan; df.iloc[::3,2] = np.nan
print(df)
          0         1         2
0       NaN -2.566105       NaN
1  1.059024 -0.031717  0.358016
2       NaN  0.249756 -0.682732
3 -0.944110  1.394479       NaN
4       NaN  1.079581       NaN
5 -1.030245 -0.500275 -2.613331

dropping all rows having NaN values

0 1 2
1 1.059024 -0.031717 0.358016
5 -1.030245 -0.500275 -2.613331


dropping NaN in specific columns

print(df[df[2].notna()])
          0         1         2
1  1.059024 -0.031717  0.358016
2       NaN  0.249756 -0.682732
5 -1.030245 -0.500275 -2.613331

count the NaN values in a column in DataFrame

df = pd.DataFrame(np.random.randn(6,3))
df.iloc[::2,0] = np.nan; df.iloc[::4,2] = np.nan; df.iloc[::3,2] = np.nan
print(df)
          0         1         2
0       NaN  0.139581       NaN
1 -1.509370  1.762801 -1.067253
2       NaN  0.095276 -0.777572
3  0.492278  0.061235       NaN
4       NaN -0.038476       NaN
5 -0.067574 -0.885610 -0.721969
df.isna().sum()
0    3
1    0
2    3
dtype: int64

for columns

df.isnull().sum(axis = 0)
0    3
1    0
2    3
dtype: int64

for rows

df.isnull().sum(axis = 1)
0    2
1    0
2    1
3    1
4    2
5    0
dtype: int64

check if any value is NaN in a DataFrame

df = pd.DataFrame(np.random.randn(6,3))
df.iloc[::2,0] = np.nan; df.iloc[::4,2] = np.nan; df.iloc[::3,2] = np.nan
print(df)
          0         1         2
0       NaN  0.163625       NaN
1 -1.087252  2.721640 -0.415199
2       NaN  1.432081 -0.653284
3  1.336880  0.197294       NaN
4       NaN  0.015589       NaN
5  1.388623  0.203641  1.691277

how many NaN

0 1 2
0 True False True
1 False False False
2 True False False
3 False False True
4 True False True
5 False False False


column wise

df.isnull().any()
0     True
1    False
2     True
dtype: bool

if there is any NaN in entire data

df.isnull().any().any()
True

replace NaN values by Zeroes in a column of a Dataframe?

df = pd.DataFrame(np.random.randn(6,3))
df.iloc[::2,0] = np.nan; df.iloc[::4,2] = np.nan; df.iloc[::3,2] = np.nan
print(df)
          0         1         2
0       NaN -0.168681       NaN
1  0.563927  0.017890 -1.375824
2       NaN -0.810597 -1.174800
3 -0.722840  0.883346       NaN
4       NaN -0.310959       NaN
5  0.940451 -1.261573 -1.127101
0 1 2
0 0.000000 -0.168681 0.000000
1 0.563927 0.017890 -1.375824
2 0.000000 -0.810597 -1.174800
3 -0.722840 0.883346 0.000000
4 0.000000 -0.310959 0.000000
5 0.940451 -1.261573 -1.127101


To fill the NaNs in only one column

df[2].fillna(0, inplace=True)
print(df)
          0         1         2
0       NaN -0.168681  0.000000
1  0.563927  0.017890 -1.375824
2       NaN -0.810597 -1.174800
3 -0.722840  0.883346  0.000000
4       NaN -0.310959  0.000000
5  0.940451 -1.261573 -1.127101

check if a column exists in Pandas

          0         1         2
0  0.775496  0.028101  0.380929
1 -0.932216  0.469528 -0.663859
2 -0.616558 -1.267830  0.580904
3 -1.582028 -0.355916  0.460871
4  0.672658 -1.117510  0.144625
5  0.613005  0.732261 -0.276066
if 0 in df.columns:
     print("true")
true

Python dict into a dataframe

d = {
    '2012-06-08': 388,
    '2012-06-09': 388,
    '2012-06-10': 388,
    '2012-06-11': 389,
    '2012-06-12': 389,
    '2012-06-13': 389,
    '2012-06-14': 389,
    '2012-06-15': 389,
    '2012-06-16': 389,
    '2012-06-17': 389,
    '2012-06-18': 390,
    '2012-06-19': 390,
    '2012-06-20': 390,
}
pd.DataFrame(d.items())
0 1
0 2012-06-08 388
1 2012-06-09 388
2 2012-06-10 388
3 2012-06-11 389
4 2012-06-12 389
5 2012-06-13 389
6 2012-06-14 389
7 2012-06-15 389
8 2012-06-16 389
9 2012-06-17 389
10 2012-06-18 390
11 2012-06-19 390
12 2012-06-20 390


pd.DataFrame(d.items(), columns=['Date', 'DateValue'])
Date DateValue
0 2012-06-08 388
1 2012-06-09 388
2 2012-06-10 388
3 2012-06-11 389
4 2012-06-12 389
5 2012-06-13 389
6 2012-06-14 389
7 2012-06-15 389
8 2012-06-16 389
9 2012-06-17 389
10 2012-06-18 390
11 2012-06-19 390
12 2012-06-20 390


uncomment following line pd.DataFrame(d) # ValueError: If using all scalar values, you must pass an index

2012-06-08 2012-06-09 2012-06-10 2012-06-11 2012-06-12 2012-06-13 2012-06-14 2012-06-15 2012-06-16 2012-06-17 2012-06-18 2012-06-19 2012-06-20
0 388 388 388 389 389 389 389 389 389 389 390 390 390


pd.DataFrame.from_dict(d, orient='index', columns=['DateVaue'])
DateVaue
2012-06-08 388
2012-06-09 388
2012-06-10 388
2012-06-11 389
2012-06-12 389
2012-06-13 389
2012-06-14 389
2012-06-15 389
2012-06-16 389
2012-06-17 389
2012-06-18 390
2012-06-19 390
2012-06-20 390


Count the frequency that a value occurs in a dataframe column

df = pd.DataFrame(np.random.randint(0, 14, (10, 3)),
                  columns=['a', 'b', 'c'],
                 index=[2000+i for i in range(10)])
df['a'].value_counts()
6     2
2     2
12    2
9     1
11    1
0     1
5     1
Name: a, dtype: int64
for index, row in df.iterrows():
    print(index, row, '\n')
2000 a     6
b    12
c     5
Name: 2000, dtype: int64

2001 a     2
b     1
c    10
Name: 2001, dtype: int64

2002 a     6
b     0
c    12
Name: 2002, dtype: int64

2003 a    9
b    8
c    7
Name: 2003, dtype: int64

2004 a    11
b    11
c     4
Name: 2004, dtype: int64

2005 a    0
b    0
c    7
Name: 2005, dtype: int64

2006 a    12
b     6
c     8
Name: 2006, dtype: int64

2007 a    12
b     5
c     1
Name: 2007, dtype: int64

2008 a     2
b     2
c    13
Name: 2008, dtype: int64

2009 a     5
b     2
c    13
Name: 2009, dtype: int64
df = pd.DataFrame(np.random.randint(0, 14, (10, 3)),
                  columns=['a', 'b', 'c'])
print(df)
    a   b   c
0   9  10  13
1   7   7   3
2   5   2  13
3   8   8   7
4   7   6   3
5   3   1   5
6   4  10   7
7   0   8   5
8  12  12  12
9  13   4   9
print(df['a']/df['b'])
0    0.900000
1    1.000000
2    2.500000
3    1.000000
4    1.166667
5    3.000000
6    0.400000
7    0.000000
8    1.000000
9    3.250000
dtype: float64

add an empty column to a dataframe?

df["d"] = ""
print(df)
    a   b   c d
0   9  10  13
1   7   7   3
2   5   2  13
3   8   8   7
4   7   6   3
5   3   1   5
6   4  10   7
7   0   8   5
8  12  12  12
9  13   4   9
print(df['d'])
0
1
2
3
4
5
6
7
8
9
Name: d, dtype: object
df["d"] = np.nan
print(df)
    a   b   c   d
0   9  10  13 NaN
1   7   7   3 NaN
2   5   2  13 NaN
3   8   8   7 NaN
4   7   6   3 NaN
5   3   1   5 NaN
6   4  10   7 NaN
7   0   8   5 NaN
8  12  12  12 NaN
9  13   4   9 NaN

What does axis in pandas mean?

df.mean(axis=0)
a    6.8
b    6.8
c    7.7
d    NaN
dtype: float64
df.mean(axis=1)
0    10.666667
1     5.666667
2     6.666667
3     7.666667
4     5.333333
5     3.000000
6     7.000000
7     4.333333
8    12.000000
9     8.666667
dtype: float64

Replace NaN with blank/empty string

a b c d
0 NaN 10 13.0 NaN
1 7.0 7 3.0 NaN
2 5.0 2 13.0 NaN
3 8.0 8 7.0 NaN
4 7.0 6 3.0 NaN
5 3.0 1 5.0 NaN
6 4.0 10 7.0 NaN
7 0.0 8 5.0 NaN
8 12.0 12 12.0 NaN
9 13.0 4 NaN NaN


a b c d
0 9 10 13
1 7 7 3
2 5 2 13
3 8 8 7
4 7 6 3
5 3 1 5
6 4 10 7
7 0 8 5
8 12 12 12
9 13 4 9


Rename specific column(s) in pandas

df = pd.DataFrame(np.random.randint(0, 14, (10, 3)), columns=['a', 'b', 'c'])
print(df)
    a   b   c
0   2   4  12
1   9  13   2
2  11   1   4
3   9   9   5
4   9   5   4
5  12   0   1
6  10   2  11
7   6   6   1
8  12   7   4
9   6   9   7
df.rename(columns={'a':'log(A)'}, inplace=True)
print(df)
   log(A)   b   c
0       2   4  12
1       9  13   2
2      11   1   4
3       9   9   5
4       9   5   4
5      12   0   1
6      10   2  11
7       6   6   1
8      12   7   4
9       6   9   7

print DataFrame without index

print(df)
   log(A)   b   c
0       2   4  12
1       9  13   2
2      11   1   4
3       9   9   5
4       9   5   4
5      12   0   1
6      10   2  11
7       6   6   1
8      12   7   4
9       6   9   7
/home/docs/checkouts/readthedocs.org/user_builds/python-seekho/checkouts/latest/scripts/pandas/dataframe_vs_series.py:478: FutureWarning: this method is deprecated in favour of `Styler.hide(axis="index")`
  df.style.hide_index()
log(A) b c
2 4 12
9 13 2
11 1 4
9 9 5
9 5 4
12 0 1
10 2 11
6 6 1
12 7 4
6 9 7


replace nan values with average of columns

log(A) b c
0 2 4 12
1 9 13 2
2 11 1 4
3 9 9 5
4 9 5 4
5 12 0 1
6 10 2 11
7 6 6 1
8 12 7 4
9 6 9 7


retrieve the number of columns in a dataframe?

3
print(df.shape[1])
3

We can create empty DataFrame by telling how many columns should exist or how many rows should exist.

df = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
print(df)
Empty DataFrame
Columns: [A, B, C, D, E, F, G]
Index: []
print(df.shape)
(0, 7)
df = pd.DataFrame(index=range(1,8))
print(df)
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5, 6, 7]
print(df.shape)
(7, 0)

Total running time of the script: ( 0 minutes 0.461 seconds)

Gallery generated by Sphinx-Gallery