5.1 introduction
Contents
Note
Click here to download the full example code or to run this example in your browser via Binder
5.1 introduction#
import time
import numpy as np
import pandas as pd
print(time.asctime())
print(pd.__version__, np.__version__)
Mon Nov 11 07:57:06 2024
1.5.3 1.26.4
Suppose we have an array [0.4, 0.3, 0.5, 0.2, 0.6, 0.3]. Let’s say
the values in this array represent concentrations in water measured
every hour from 13 pm to 19 pm. However, with just an array, we don’t
have the ability to encode this information. If we want to add the (temporal) reference of each value
we have to add it ourself for example by saving that in a separate array.
Pandas comes with this in-built ability that we can add reference or labels to arrays.
Every array in pandas has two kinds of references. The reference for the rows which
is called index and the reference for the columns which is called columns.
Therefore we can call pandas a library which have referenced/labelled arrays.
The core data structure in pandas is DataFrame which consists of one or more
columns. A single column in a DataFrame is a Series.
df = pd.DataFrame(np.random.random((10, 3)))
print(df)
0 1 2
0 0.498174 0.649376 0.048874
1 0.353728 0.448276 0.431891
2 0.295985 0.655199 0.701662
3 0.666585 0.655752 0.832106
4 0.856753 0.005483 0.801618
5 0.640038 0.006691 0.430169
6 0.061012 0.097369 0.979786
7 0.412577 0.444025 0.203257
8 0.787196 0.781598 0.108454
9 0.554003 0.492394 0.228487
The data in columns is stored as numpy arrays. Therefore, a DataFrames and Series have a lot of characteristics similar to that of numpy arrays.
print(df.shape)
(10, 3)
By default the columns names are just integers starting from 0, however we can define the column names ourselves as well.
df = pd.DataFrame(np.random.random((10, 3)), columns=['a', 'b', 'c'])
print(df)
a b c
0 0.035141 0.473271 0.811923
1 0.153618 0.477829 0.335278
2 0.374927 0.574451 0.088222
3 0.658233 0.904410 0.100726
4 0.837925 0.099438 0.436024
5 0.004975 0.474035 0.296443
6 0.058987 0.768426 0.999078
7 0.009682 0.161709 0.496965
8 0.297341 0.461146 0.764938
9 0.027988 0.193115 0.286918
print(df.columns)
Index(['a', 'b', 'c'], dtype='object')
The columns are list like structures. However they are not exactly lists.
print(type(df.columns))
<class 'pandas.core.indexes.base.Index'>
We can however, convert the columns to list though.
['a', 'b', 'c']
print(type(df.columns.to_list()))
<class 'list'>
The default label for the rows i.e. index consists of numbers starting from 0.
print(df.index)
RangeIndex(start=0, stop=10, step=1)
However, we can set index of our choice as well.
df = pd.DataFrame(np.random.random((10, 3)),
columns=['a', 'b', 'c'],
index=[2000+i for i in range(10)])
print(df)
a b c
2000 0.037827 0.138579 0.172953
2001 0.834903 0.210243 0.097174
2002 0.595763 0.158262 0.736529
2003 0.765841 0.699503 0.640029
2004 0.647052 0.274776 0.680227
2005 0.749939 0.162144 0.606576
2006 0.899868 0.916047 0.322556
2007 0.664681 0.451863 0.801402
2008 0.137936 0.706174 0.185160
2009 0.396465 0.250700 0.834271
print(df.index)
Int64Index([2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009], dtype='int64')
The default name of index is None.
print(df.index.name)
None
However, we can set the name of index as well.
df.index.name = 'years'
print(df)
a b c
years
2000 0.037827 0.138579 0.172953
2001 0.834903 0.210243 0.097174
2002 0.595763 0.158262 0.736529
2003 0.765841 0.699503 0.640029
2004 0.647052 0.274776 0.680227
2005 0.749939 0.162144 0.606576
2006 0.899868 0.916047 0.322556
2007 0.664681 0.451863 0.801402
2008 0.137936 0.706174 0.185160
2009 0.396465 0.250700 0.834271
print(df.index.name)
years
print(type(df))
<class 'pandas.core.frame.DataFrame'>
df = pd.DataFrame(np.random.randint(0, 10, (10, 1)),
columns=['a'],
index=[2000+i for i in range(10)])
print(df)
a
2000 0
2001 9
2002 0
2003 0
2004 6
2005 7
2006 0
2007 7
2008 1
2009 7
print(type(df))
<class 'pandas.core.frame.DataFrame'>
print(df.columns)
Index(['a'], dtype='object')
Series#
A Series consists of a single column. It can be constructed using pd.Series.
s = pd.Series(np.random.random(10))
print(s)
0 0.086779
1 0.323645
2 0.813329
3 0.955254
4 0.254758
5 0.273233
6 0.799228
7 0.687978
8 0.835114
9 0.279034
dtype: float64
print(type(s))
<class 'pandas.core.series.Series'>
print(s.shape)
(10,)
print(s.name)
None
s = pd.Series(np.random.random(10),
name="a")
print(s)
0 0.304266
1 0.526392
2 0.532155
3 0.050524
4 0.561465
5 0.206713
6 0.119534
7 0.404703
8 0.312129
9 0.055173
Name: a, dtype: float64
print(s.name)
a
the Series is literally the data structure for a single column of a DataFrame.
df = pd.DataFrame(np.random.random((10, 3)),
columns=['a', 'b', 'c'],
index=[2000+i for i in range(10)])
print(df)
a b c
2000 0.664250 0.476066 0.694364
2001 0.918895 0.977018 0.727915
2002 0.220773 0.545780 0.505667
2003 0.786437 0.937649 0.616812
2004 0.362441 0.090276 0.569802
2005 0.080560 0.028406 0.646057
2006 0.194305 0.195712 0.869170
2007 0.416491 0.861553 0.002006
2008 0.030182 0.022711 0.550587
2009 0.984036 0.294418 0.039045
A single column in a DataFrame is a Series.
print(type(df['a']))
<class 'pandas.core.series.Series'>
s = pd.Series(np.random.random(10),
index=[2000+i for i in range(10)],
name="a")
print(s)
2000 0.979264
2001 0.551812
2002 0.360790
2003 0.243930
2004 0.280986
2005 0.941323
2006 0.795894
2007 0.685180
2008 0.845721
2009 0.569461
Name: a, dtype: float64
Since pandas is based upon numpy arrays. We can extract actual numpy arrays from DataFrame using .values method.
print(df.values)
[[0.66424986 0.47606614 0.6943637 ]
[0.91889498 0.97701754 0.72791462]
[0.2207728 0.54577997 0.50566695]
[0.78643683 0.93764933 0.6168125 ]
[0.36244087 0.09027634 0.56980208]
[0.08056014 0.02840586 0.64605661]
[0.19430505 0.19571235 0.86917014]
[0.41649121 0.86155327 0.00200583]
[0.03018163 0.02271147 0.55058707]
[0.9840357 0.29441764 0.03904548]]
print(type(df.values))
<class 'numpy.ndarray'>
df = pd.DataFrame(np.random.randint(0, 14, (10, 3)),
columns=['a', 'b', 'c'],
index=[2000+i for i in range(10)])
print(df)
a b c
2000 7 12 13
2001 2 10 3
2002 9 10 8
2003 5 6 13
2004 0 10 12
2005 10 6 1
2006 1 12 4
2007 12 7 6
2008 3 12 5
2009 7 0 9
print(type(df.values))
<class 'numpy.ndarray'>
print(df.values.shape)
(10, 3)
df.head()
df.head(8)
Get the last N rows of a DataFrame
df.tail()
df.tail(7)
df.mean()
a 5.6
b 8.5
c 7.4
dtype: float64
{'a': {2000: 7, 2001: 2, 2002: 9, 2003: 5, 2004: 0, 2005: 10, 2006: 1, 2007: 12, 2008: 3, 2009: 7}, 'b': {2000: 12, 2001: 10, 2002: 10, 2003: 6, 2004: 10, 2005: 6, 2006: 12, 2007: 7, 2008: 12, 2009: 0}, 'c': {2000: 13, 2001: 3, 2002: 8, 2003: 13, 2004: 12, 2005: 1, 2006: 4, 2007: 6, 2008: 5, 2009: 9}}
df.to_dict('list')
{'a': [7, 2, 9, 5, 0, 10, 1, 12, 3, 7], 'b': [12, 10, 10, 6, 10, 6, 12, 7, 12, 0], 'c': [13, 3, 8, 13, 12, 1, 4, 6, 5, 9]}
df['d'] = np.random.randint(0, 10, (10,))
print(df)
a b c d
2000 7 12 13 2
2001 2 10 3 1
2002 9 10 8 0
2003 5 6 13 7
2004 0 10 12 5
2005 10 6 1 3
2006 1 12 4 3
2007 12 7 6 9
2008 3 12 5 5
2009 7 0 9 4
a b c
2000 7 12 13
2001 2 10 3
2002 9 10 8
2003 5 6 13
2004 0 10 12
2005 10 6 1
2006 1 12 4
2007 12 7 6
2008 3 12 5
2009 7 0 9
df.columns = ['x', 'y', 'z']
print(df)
x y z
2000 7 12 13
2001 2 10 3
2002 9 10 8
2003 5 6 13
2004 0 10 12
2005 10 6 1
2006 1 12 4
2007 12 7 6
2008 3 12 5
2009 7 0 9
row count of pandas dataframe
len(df.index)
10
print(df.shape[0])
10
change the order of DataFrame columns
z x y
2000 13 7 12
2001 3 2 10
2002 8 9 10
2003 13 5 6
2004 12 0 10
2005 1 10 6
2006 4 1 12
2007 6 12 7
2008 5 3 12
2009 9 7 0
drop rows of Pandas DataFrame whose value in a certain column is NaN
df = pd.DataFrame(np.random.randn(6,3))
print(df)
0 1 2
0 1.660940 0.874616 -0.500677
1 -1.534451 0.513203 0.392755
2 0.409555 -0.577343 -0.097435
3 -1.708633 1.218809 -1.094980
4 0.059053 0.036667 -0.318926
5 -0.300077 -1.311863 -0.806702
0 1 2
0 NaN 0.874616 NaN
1 -1.534451 0.513203 0.392755
2 NaN -0.577343 -0.097435
3 -1.708633 1.218809 NaN
4 NaN 0.036667 NaN
5 -0.300077 -1.311863 -0.806702
dropping all rows having NaN values
dropping NaN in specific columns
0 1 2
1 -1.534451 0.513203 0.392755
2 NaN -0.577343 -0.097435
5 -0.300077 -1.311863 -0.806702
count the NaN values in a column in DataFrame
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[::2,0] = np.nan; df.iloc[::4,2] = np.nan; df.iloc[::3,2] = np.nan
print(df)
0 1 2
0 NaN -0.040283 NaN
1 1.411649 0.837492 -0.490277
2 NaN -1.853904 1.035163
3 0.230543 -0.365505 NaN
4 NaN -0.364073 NaN
5 -0.146592 0.381510 0.868098
df.isna().sum()
0 3
1 0
2 3
dtype: int64
for columns
df.isnull().sum(axis = 0)
0 3
1 0
2 3
dtype: int64
for rows
df.isnull().sum(axis = 1)
0 2
1 0
2 1
3 1
4 2
5 0
dtype: int64
check if any value is NaN in a DataFrame
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[::2,0] = np.nan; df.iloc[::4,2] = np.nan; df.iloc[::3,2] = np.nan
print(df)
0 1 2
0 NaN -2.701309 NaN
1 -1.346370 2.021702 0.518913
2 NaN -0.220619 -0.083774
3 0.706715 -0.475726 NaN
4 NaN 0.607583 NaN
5 1.864375 1.427532 -0.553547
how many NaN
column wise
df.isnull().any()
0 True
1 False
2 True
dtype: bool
if there is any NaN in entire data
df.isnull().any().any()
True
replace NaN values by Zeroes in a column of a Dataframe?
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[::2,0] = np.nan; df.iloc[::4,2] = np.nan; df.iloc[::3,2] = np.nan
print(df)
0 1 2
0 NaN 0.586977 NaN
1 -2.020800 2.246211 -0.138082
2 NaN -0.003072 -0.902660
3 -0.455902 1.308343 NaN
4 NaN 2.010836 NaN
5 -0.598552 -0.270577 0.004179
df.fillna(0)
To fill the NaNs in only one column
0 1 2
0 NaN 0.586977 0.000000
1 -2.020800 2.246211 -0.138082
2 NaN -0.003072 -0.902660
3 -0.455902 1.308343 0.000000
4 NaN 2.010836 0.000000
5 -0.598552 -0.270577 0.004179
check if a column exists in Pandas
df = pd.DataFrame(np.random.randn(6,3))
print(df)
0 1 2
0 0.322260 -0.163406 0.755192
1 -1.429486 1.118667 -0.738101
2 -0.137188 1.157078 -0.016540
3 -1.187170 -0.086152 0.832547
4 -1.666758 0.856316 1.873881
5 -3.547606 2.329907 1.286130
if 0 in df.columns:
print("true")
true
Python dict into a dataframe
d = {
'2012-06-08': 388,
'2012-06-09': 388,
'2012-06-10': 388,
'2012-06-11': 389,
'2012-06-12': 389,
'2012-06-13': 389,
'2012-06-14': 389,
'2012-06-15': 389,
'2012-06-16': 389,
'2012-06-17': 389,
'2012-06-18': 390,
'2012-06-19': 390,
'2012-06-20': 390,
}
pd.DataFrame(d.items())
pd.DataFrame(d.items(), columns=['Date', 'DateValue'])
uncomment following line pd.DataFrame(d) # ValueError: If using all scalar values, you must pass an index
pd.DataFrame([d])
pd.DataFrame.from_dict(d, orient='index', columns=['DateVaue'])
Count the frequency that a value occurs in a dataframe column
df = pd.DataFrame(np.random.randint(0, 14, (10, 3)),
columns=['a', 'b', 'c'],
index=[2000+i for i in range(10)])
df['a'].value_counts()
10 4
12 2
8 1
5 1
1 1
0 1
Name: a, dtype: int64
for index, row in df.iterrows():
print(index, row, '\n')
2000 a 12
b 8
c 5
Name: 2000, dtype: int64
2001 a 10
b 9
c 2
Name: 2001, dtype: int64
2002 a 8
b 1
c 6
Name: 2002, dtype: int64
2003 a 5
b 3
c 4
Name: 2003, dtype: int64
2004 a 10
b 3
c 5
Name: 2004, dtype: int64
2005 a 12
b 8
c 9
Name: 2005, dtype: int64
2006 a 10
b 2
c 3
Name: 2006, dtype: int64
2007 a 1
b 13
c 2
Name: 2007, dtype: int64
2008 a 0
b 8
c 4
Name: 2008, dtype: int64
2009 a 10
b 3
c 1
Name: 2009, dtype: int64
df = pd.DataFrame(np.random.randint(0, 14, (10, 3)),
columns=['a', 'b', 'c'])
print(df)
a b c
0 10 5 10
1 5 12 8
2 1 0 0
3 10 0 2
4 1 6 4
5 8 5 0
6 1 5 7
7 3 3 6
8 13 5 3
9 9 1 1
0 2.000000
1 0.416667
2 inf
3 inf
4 0.166667
5 1.600000
6 0.200000
7 1.000000
8 2.600000
9 9.000000
dtype: float64
add an empty column to a dataframe?
a b c d
0 10 5 10
1 5 12 8
2 1 0 0
3 10 0 2
4 1 6 4
5 8 5 0
6 1 5 7
7 3 3 6
8 13 5 3
9 9 1 1
print(df['d'])
0
1
2
3
4
5
6
7
8
9
Name: d, dtype: object
a b c d
0 10 5 10 NaN
1 5 12 8 NaN
2 1 0 0 NaN
3 10 0 2 NaN
4 1 6 4 NaN
5 8 5 0 NaN
6 1 5 7 NaN
7 3 3 6 NaN
8 13 5 3 NaN
9 9 1 1 NaN
What does axis in pandas mean?
df.mean(axis=0)
a 6.1
b 4.2
c 4.1
d NaN
dtype: float64
df.mean(axis=1)
0 8.333333
1 8.333333
2 0.333333
3 4.000000
4 3.666667
5 4.333333
6 4.333333
7 4.000000
8 7.000000
9 3.666667
dtype: float64
Replace NaN with blank/empty string
df.replace(9, np.nan)
df.replace(np.nan, '')
Rename specific column(s) in pandas
df = pd.DataFrame(np.random.randint(0, 14, (10, 3)), columns=['a', 'b', 'c'])
print(df)
a b c
0 13 5 13
1 8 8 2
2 9 9 6
3 11 7 9
4 12 2 3
5 9 9 1
6 10 9 7
7 7 3 1
8 1 5 12
9 4 3 9
log(A) b c
0 13 5 13
1 8 8 2
2 9 9 6
3 11 7 9
4 12 2 3
5 9 9 1
6 10 9 7
7 7 3 1
8 1 5 12
9 4 3 9
print DataFrame without index
print(df)
log(A) b c
0 13 5 13
1 8 8 2
2 9 9 6
3 11 7 9
4 12 2 3
5 9 9 1
6 10 9 7
7 7 3 1
8 1 5 12
9 4 3 9
/home/docs/checkouts/readthedocs.org/user_builds/python-seekho/checkouts/dev/scripts/pandas/dataframe_vs_series.py:478: FutureWarning: this method is deprecated in favour of `Styler.hide(axis="index")`
df.style.hide_index()
replace nan values with average of columns
retrieve the number of columns in a dataframe?
len(df.columns)
3
print(df.shape[1])
3
We can create empty DataFrame by telling how many columns should exist or how many rows should exist.
df = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
print(df)
Empty DataFrame
Columns: [A, B, C, D, E, F, G]
Index: []
print(df.shape)
(0, 7)
df = pd.DataFrame(index=range(1,8))
print(df)
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5, 6, 7]
print(df.shape)
(7, 0)
Total running time of the script: ( 0 minutes 0.480 seconds)