Note

Click here to download the full example code or to run this example in your browser via Binder

5.1 introduction#

import time
import numpy as np
import pandas as pd

print(time.asctime())
print(pd.__version__, np.__version__)

Mon Nov 11 07:57:06 2024
1.5.3 1.26.4

Suppose we have an array [0.4, 0.3, 0.5, 0.2, 0.6, 0.3]. Let’s say the values in this array represent concentrations in water measured every hour from 13 pm to 19 pm. However, with just an array, we don’t have the ability to encode this information. If we want to add the (temporal) reference of each value we have to add it ourself for example by saving that in a separate array. Pandas comes with this in-built ability that we can add reference or labels to arrays. Every array in pandas has two kinds of references. The reference for the rows which is called index and the reference for the columns which is called columns. Therefore we can call pandas a library which have referenced/labelled arrays.

The core data structure in pandas is DataFrame which consists of one or more columns. A single column in a DataFrame is a Series.

df = pd.DataFrame(np.random.random((10, 3)))
print(df)

          0         1         2
0.498174  0.649376  0.048874
0.353728  0.448276  0.431891
0.295985  0.655199  0.701662
0.666585  0.655752  0.832106
0.856753  0.005483  0.801618
0.640038  0.006691  0.430169
0.061012  0.097369  0.979786
0.412577  0.444025  0.203257
0.787196  0.781598  0.108454
0.554003  0.492394  0.228487

The data in columns is stored as numpy arrays. Therefore, a DataFrames and Series have a lot of characteristics similar to that of numpy arrays.

print(df.shape)

(10, 3)

By default the columns names are just integers starting from 0, however we can define the column names ourselves as well.

df = pd.DataFrame(np.random.random((10, 3)), columns=['a', 'b', 'c'])
print(df)

          a         b         c
0.035141  0.473271  0.811923
0.153618  0.477829  0.335278
0.374927  0.574451  0.088222
0.658233  0.904410  0.100726
0.837925  0.099438  0.436024
0.004975  0.474035  0.296443
0.058987  0.768426  0.999078
0.009682  0.161709  0.496965
0.297341  0.461146  0.764938
0.027988  0.193115  0.286918

print(df.columns)

Index(['a', 'b', 'c'], dtype='object')

The columns are list like structures. However they are not exactly lists.

print(type(df.columns))

<class 'pandas.core.indexes.base.Index'>

We can however, convert the columns to list though.

df.columns.to_list()

['a', 'b', 'c']

print(type(df.columns.to_list()))

<class 'list'>

The default label for the rows i.e. index consists of numbers starting from 0.

print(df.index)

RangeIndex(start=0, stop=10, step=1)

However, we can set index of our choice as well.

df = pd.DataFrame(np.random.random((10, 3)),
                  columns=['a', 'b', 'c'],
                 index=[2000+i for i in range(10)])
print(df)

             a         b         c
0.037827  0.138579  0.172953
0.834903  0.210243  0.097174
0.595763  0.158262  0.736529
0.765841  0.699503  0.640029
0.647052  0.274776  0.680227
0.749939  0.162144  0.606576
0.899868  0.916047  0.322556
0.664681  0.451863  0.801402
0.137936  0.706174  0.185160
0.396465  0.250700  0.834271

print(df.index)

Int64Index([2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009], dtype='int64')

The default name of index is None.

print(df.index.name)

None

However, we can set the name of index as well.

df.index.name = 'years'
print(df)

              a         b         c
years
 0.037827  0.138579  0.172953
 0.834903  0.210243  0.097174
 0.595763  0.158262  0.736529
 0.765841  0.699503  0.640029
 0.647052  0.274776  0.680227
 0.749939  0.162144  0.606576
 0.899868  0.916047  0.322556
 0.664681  0.451863  0.801402
 0.137936  0.706174  0.185160
 0.396465  0.250700  0.834271

print(df.index.name)

years

print(type(df))

<class 'pandas.core.frame.DataFrame'>

df = pd.DataFrame(np.random.randint(0, 10, (10, 1)),
                  columns=['a'],
                 index=[2000+i for i in range(10)])
print(df)

print(type(df))

<class 'pandas.core.frame.DataFrame'>

print(df.columns)

Index(['a'], dtype='object')

Series#

A Series consists of a single column. It can be constructed using pd.Series.

s = pd.Series(np.random.random(10))
print(s)

  0.086779
  0.323645
  0.813329
  0.955254
  0.254758
  0.273233
  0.799228
  0.687978
  0.835114
  0.279034
dtype: float64

print(type(s))

<class 'pandas.core.series.Series'>

print(s.shape)

(10,)

print(s.name)

None

s = pd.Series(np.random.random(10),
              name="a")
print(s)

  0.304266
  0.526392
  0.532155
  0.050524
  0.561465
  0.206713
  0.119534
  0.404703
  0.312129
  0.055173
Name: a, dtype: float64

print(s.name)

the Series is literally the data structure for a single column of a DataFrame.

df = pd.DataFrame(np.random.random((10, 3)),
                  columns=['a', 'b', 'c'],
                 index=[2000+i for i in range(10)])
print(df)

             a         b         c
0.664250  0.476066  0.694364
0.918895  0.977018  0.727915
0.220773  0.545780  0.505667
0.786437  0.937649  0.616812
0.362441  0.090276  0.569802
0.080560  0.028406  0.646057
0.194305  0.195712  0.869170
0.416491  0.861553  0.002006
0.030182  0.022711  0.550587
0.984036  0.294418  0.039045

A single column in a DataFrame is a Series.

print(type(df['a']))

<class 'pandas.core.series.Series'>

s = pd.Series(np.random.random(10),
              index=[2000+i for i in range(10)],
              name="a")
print(s)

  0.979264
  0.551812
  0.360790
  0.243930
  0.280986
  0.941323
  0.795894
  0.685180
  0.845721
  0.569461
Name: a, dtype: float64

Since pandas is based upon numpy arrays. We can extract actual numpy arrays from DataFrame using .values method.

print(df.values)

[[0.66424986 0.47606614 0.6943637 ]
 [0.91889498 0.97701754 0.72791462]
 [0.2207728  0.54577997 0.50566695]
 [0.78643683 0.93764933 0.6168125 ]
 [0.36244087 0.09027634 0.56980208]
 [0.08056014 0.02840586 0.64605661]
 [0.19430505 0.19571235 0.86917014]
 [0.41649121 0.86155327 0.00200583]
 [0.03018163 0.02271147 0.55058707]
 [0.9840357  0.29441764 0.03904548]]

print(type(df.values))

<class 'numpy.ndarray'>

df = pd.DataFrame(np.random.randint(0, 14, (10, 3)),
                  columns=['a', 'b', 'c'],
                 index=[2000+i for i in range(10)])
print(df)

print(type(df.values))

<class 'numpy.ndarray'>

print(df.values.shape)

(10, 3)

df.describe()

	a	b	c
count	10.000000	10.000000	10.000000
mean	5.600000	8.500000	7.400000
std	4.060651	3.807887	4.299871
min	0.000000	0.000000	1.000000
25%	2.250000	6.250000	4.250000
50%	6.000000	10.000000	7.000000
75%	8.500000	11.500000	11.250000
max	12.000000	12.000000	13.000000

df.head()

	a	b	c
2000	7	12	13
2001	2	10	3
2002	9	10	8
2003	5	6	13
2004	0	10	12

df.head(8)

	a	b	c
2000	7	12	13
2001	2	10	3
2002	9	10	8
2003	5	6	13
2004	0	10	12
2005	10	6	1
2006	1	12	4
2007	12	7	6

Get the last N rows of a DataFrame

df.tail()

	a	b	c
2005	10	6	1
2006	1	12	4
2007	12	7	6
2008	3	12	5
2009	7	0	9

df.tail(7)

	a	b	c
2003	5	6	13
2004	0	10	12
2005	10	6	1
2006	1	12	4
2007	12	7	6
2008	3	12	5
2009	7	0	9

df.mean()

a    5.6
b    8.5
c    7.4
dtype: float64

df.to_dict()

{'a': {2000: 7, 2001: 2, 2002: 9, 2003: 5, 2004: 0, 2005: 10, 2006: 1, 2007: 12, 2008: 3, 2009: 7}, 'b': {2000: 12, 2001: 10, 2002: 10, 2003: 6, 2004: 10, 2005: 6, 2006: 12, 2007: 7, 2008: 12, 2009: 0}, 'c': {2000: 13, 2001: 3, 2002: 8, 2003: 13, 2004: 12, 2005: 1, 2006: 4, 2007: 6, 2008: 5, 2009: 9}}

df.to_dict('list')

{'a': [7, 2, 9, 5, 0, 10, 1, 12, 3, 7], 'b': [12, 10, 10, 6, 10, 6, 12, 7, 12, 0], 'c': [13, 3, 8, 13, 12, 1, 4, 6, 5, 9]}

df['d'] = np.random.randint(0, 10, (10,))
print(df)

       a   b   c  d
 7  12  13  2
 2  10   3  1
 9  10   8  0
 5   6  13  7
 0  10  12  5
10   6   1  3
 1  12   4  3
12   7   6  9
 3  12   5  5
 7   0   9  4

df.pop('d')
print(df)

df.columns = ['x', 'y', 'z']
print(df)

row count of pandas dataframe

len(df.index)

print(df.shape[0])

change the order of DataFrame columns

cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
print(df)

drop rows of Pandas DataFrame whose value in a certain column is NaN

df = pd.DataFrame(np.random.randn(6,3))
print(df)

          0         1         2
1.660940  0.874616 -0.500677
-1.534451  0.513203  0.392755
0.409555 -0.577343 -0.097435
-1.708633  1.218809 -1.094980
0.059053  0.036667 -0.318926
-0.300077 -1.311863 -0.806702

df.iloc[::2,0] = np.nan; df.iloc[::4,2] = np.nan; df.iloc[::3,2] = np.nan
print(df)

          0         1         2
     NaN  0.874616       NaN
-1.534451  0.513203  0.392755
     NaN -0.577343 -0.097435
-1.708633  1.218809       NaN
     NaN  0.036667       NaN
-0.300077 -1.311863 -0.806702

dropping all rows having NaN values

df.dropna()

	0	1	2
1	-1.534451	0.513203	0.392755
5	-0.300077	-1.311863	-0.806702

dropping NaN in specific columns

print(df[df[2].notna()])

          0         1         2
-1.534451  0.513203  0.392755
     NaN -0.577343 -0.097435
-0.300077 -1.311863 -0.806702

count the NaN values in a column in DataFrame

df = pd.DataFrame(np.random.randn(6,3))
df.iloc[::2,0] = np.nan; df.iloc[::4,2] = np.nan; df.iloc[::3,2] = np.nan
print(df)

          0         1         2
     NaN -0.040283       NaN
1.411649  0.837492 -0.490277
     NaN -1.853904  1.035163
0.230543 -0.365505       NaN
     NaN -0.364073       NaN
-0.146592  0.381510  0.868098

df.isna().sum()

  3
  0
  3
dtype: int64

for columns

df.isnull().sum(axis = 0)

  3
  0
  3
dtype: int64

for rows

df.isnull().sum(axis = 1)

  2
  0
  1
  1
  2
  0
dtype: int64

check if any value is NaN in a DataFrame

df = pd.DataFrame(np.random.randn(6,3))
df.iloc[::2,0] = np.nan; df.iloc[::4,2] = np.nan; df.iloc[::3,2] = np.nan
print(df)

          0         1         2
     NaN -2.701309       NaN
-1.346370  2.021702  0.518913
     NaN -0.220619 -0.083774
0.706715 -0.475726       NaN
     NaN  0.607583       NaN
1.864375  1.427532 -0.553547

how many NaN

df.isnull()

	0	1	2
0	True	False	True
1	False	False	False
2	True	False	False
3	False	False	True
4	True	False	True
5	False	False	False

column wise

df.isnull().any()

   True
  False
   True
dtype: bool

if there is any NaN in entire data

df.isnull().any().any()

True

replace NaN values by Zeroes in a column of a Dataframe?

df = pd.DataFrame(np.random.randn(6,3))
df.iloc[::2,0] = np.nan; df.iloc[::4,2] = np.nan; df.iloc[::3,2] = np.nan
print(df)

          0         1         2
     NaN  0.586977       NaN
-2.020800  2.246211 -0.138082
     NaN -0.003072 -0.902660
-0.455902  1.308343       NaN
     NaN  2.010836       NaN
-0.598552 -0.270577  0.004179

df.fillna(0)

	0	1	2
0	0.000000	0.586977	0.000000
1	-2.020800	2.246211	-0.138082
2	0.000000	-0.003072	-0.902660
3	-0.455902	1.308343	0.000000
4	0.000000	2.010836	0.000000
5	-0.598552	-0.270577	0.004179

To fill the NaNs in only one column

df[2].fillna(0, inplace=True)
print(df)

          0         1         2
     NaN  0.586977  0.000000
-2.020800  2.246211 -0.138082
     NaN -0.003072 -0.902660
-0.455902  1.308343  0.000000
     NaN  2.010836  0.000000
-0.598552 -0.270577  0.004179

check if a column exists in Pandas

df = pd.DataFrame(np.random.randn(6,3))
print(df)

          0         1         2
0.322260 -0.163406  0.755192
-1.429486  1.118667 -0.738101
-0.137188  1.157078 -0.016540
-1.187170 -0.086152  0.832547
-1.666758  0.856316  1.873881
-3.547606  2.329907  1.286130

if 0 in df.columns:
     print("true")

true

Python dict into a dataframe

d = {
    '2012-06-08': 388,
    '2012-06-09': 388,
    '2012-06-10': 388,
    '2012-06-11': 389,
    '2012-06-12': 389,
    '2012-06-13': 389,
    '2012-06-14': 389,
    '2012-06-15': 389,
    '2012-06-16': 389,
    '2012-06-17': 389,
    '2012-06-18': 390,
    '2012-06-19': 390,
    '2012-06-20': 390,
}

pd.DataFrame(d.items())

	0	1
0	2012-06-08	388
1	2012-06-09	388
2	2012-06-10	388
3	2012-06-11	389
4	2012-06-12	389
5	2012-06-13	389
6	2012-06-14	389
7	2012-06-15	389
8	2012-06-16	389
9	2012-06-17	389
10	2012-06-18	390
11	2012-06-19	390
12	2012-06-20	390

pd.DataFrame(d.items(), columns=['Date', 'DateValue'])

	Date	DateValue
0	2012-06-08	388
1	2012-06-09	388
2	2012-06-10	388
3	2012-06-11	389
4	2012-06-12	389
5	2012-06-13	389
6	2012-06-14	389
7	2012-06-15	389
8	2012-06-16	389
9	2012-06-17	389
10	2012-06-18	390
11	2012-06-19	390
12	2012-06-20	390

uncomment following line pd.DataFrame(d) # ValueError: If using all scalar values, you must pass an index

pd.DataFrame([d])

	2012-06-08	2012-06-09	2012-06-10	2012-06-11	2012-06-12	2012-06-13	2012-06-14	2012-06-15	2012-06-16	2012-06-17	2012-06-18	2012-06-19	2012-06-20
0	388	388	388	389	389	389	389	389	389	389	390	390	390

pd.DataFrame.from_dict(d, orient='index', columns=['DateVaue'])

	DateVaue
2012-06-08	388
2012-06-09	388
2012-06-10	388
2012-06-11	389
2012-06-12	389
2012-06-13	389
2012-06-14	389
2012-06-15	389
2012-06-16	389
2012-06-17	389
2012-06-18	390
2012-06-19	390
2012-06-20	390

Count the frequency that a value occurs in a dataframe column

df = pd.DataFrame(np.random.randint(0, 14, (10, 3)),
                  columns=['a', 'b', 'c'],
                 index=[2000+i for i in range(10)])
df['a'].value_counts()

  4
  2
   1
   1
   1
   1
Name: a, dtype: int64

for index, row in df.iterrows():
    print(index, row, '\n')

2000 a    12
b     8
c     5
Name: 2000, dtype: int64

2001 a    10
b     9
c     2
Name: 2001, dtype: int64

2002 a    8
b    1
c    6
Name: 2002, dtype: int64

2003 a    5
b    3
c    4
Name: 2003, dtype: int64

2004 a    10
b     3
c     5
Name: 2004, dtype: int64

2005 a    12
b     8
c     9
Name: 2005, dtype: int64

2006 a    10
b     2
c     3
Name: 2006, dtype: int64

2007 a     1
b    13
c     2
Name: 2007, dtype: int64

2008 a    0
b    8
c    4
Name: 2008, dtype: int64

2009 a    10
b     3
c     1
Name: 2009, dtype: int64

df = pd.DataFrame(np.random.randint(0, 14, (10, 3)),
                  columns=['a', 'b', 'c'])
print(df)

print(df['a']/df['b'])

  2.000000
  0.416667
       inf
       inf
  0.166667
  1.600000
  0.200000
  1.000000
  2.600000
  9.000000
dtype: float64

add an empty column to a dataframe?

df["d"] = ""
print(df)

    a   b   c d
10   5  10
 5  12   8
 1   0   0
10   0   2
 1   6   4
 8   5   0
 1   5   7
 3   3   6
13   5   3
 9   1   1

print(df['d'])

0
1
2
3
4
5
6
7
8
9
Name: d, dtype: object

df["d"] = np.nan
print(df)

    a   b   c   d
10   5  10 NaN
 5  12   8 NaN
 1   0   0 NaN
10   0   2 NaN
 1   6   4 NaN
 8   5   0 NaN
 1   5   7 NaN
 3   3   6 NaN
13   5   3 NaN
 9   1   1 NaN

What does axis in pandas mean?

df.mean(axis=0)

a    6.1
b    4.2
c    4.1
d    NaN
dtype: float64

df.mean(axis=1)

  8.333333
  8.333333
  0.333333
  4.000000
  3.666667
  4.333333
  4.333333
  4.000000
  7.000000
  3.666667
dtype: float64

Replace NaN with blank/empty string

df.replace(9, np.nan)

	a	b	c	d
0	10.0	5	10	NaN
1	5.0	12	8	NaN
2	1.0	0	0	NaN
3	10.0	0	2	NaN
4	1.0	6	4	NaN
5	8.0	5	0	NaN
6	1.0	5	7	NaN
7	3.0	3	6	NaN
8	13.0	5	3	NaN
9	NaN	1	1	NaN

df.replace(np.nan, '')

	a	b	c
0	10	5	10
1	5	12	8
2	1	0	0
3	10	0	2
4	1	6	4
5	8	5	0
6	1	5	7
7	3	3	6
8	13	5	3
9	9	1	1

Rename specific column(s) in pandas

df = pd.DataFrame(np.random.randint(0, 14, (10, 3)), columns=['a', 'b', 'c'])
print(df)

df.rename(columns={'a':'log(A)'}, inplace=True)
print(df)

   log(A)  b   c
    13  5  13
     8  8   2
     9  9   6
    11  7   9
    12  2   3
     9  9   1
    10  9   7
     7  3   1
     1  5  12
     4  3   9

print DataFrame without index

print(df)

   log(A)  b   c
    13  5  13
     8  8   2
     9  9   6
    11  7   9
    12  2   3
     9  9   1
    10  9   7
     7  3   1
     1  5  12
     4  3   9

df.style.hide_index()

/home/docs/checkouts/readthedocs.org/user_builds/python-seekho/checkouts/dev/scripts/pandas/dataframe_vs_series.py:478: FutureWarning: this method is deprecated in favour of `Styler.hide(axis="index")`
  df.style.hide_index()

log(A)	b	c
13	5	13
8	8	2
9	9	6
11	7	9
12	2	3
9	9	1
10	9	7
7	3	1
1	5	12
4	3	9

replace nan values with average of columns

df.fillna(df.mean())

	log(A)	b	c
0	13	5	13
1	8	8	2
2	9	9	6
3	11	7	9
4	12	2	3
5	9	9	1
6	10	9	7
7	7	3	1
8	1	5	12
9	4	3	9

retrieve the number of columns in a dataframe?

len(df.columns)

print(df.shape[1])

We can create empty DataFrame by telling how many columns should exist or how many rows should exist.

df = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
print(df)

Empty DataFrame
Columns: [A, B, C, D, E, F, G]
Index: []

print(df.shape)

(0, 7)

df = pd.DataFrame(index=range(1,8))
print(df)

Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5, 6, 7]

print(df.shape)

(7, 0)

Total running time of the script: ( 0 minutes 0.480 seconds)

Gallery generated by Sphinx-Gallery

5.1 introduction

Contents

5.1 introduction#

Series#