5.2 indexing and slicing
Contents
Note
Click here to download the full example code or to run this example in your browser via Binder
5.2 indexing and slicing#
This lesson shows how to select rows or columns from pandas dataframe.
import pandas as pd
print(pd.__version__)
1.5.3
Let’s create a dataframe
df = pd.DataFrame({
'name': ['Ali', 'Hasan', 'Husain', 'Ali', 'Muhammad', 'Jafar', 'Musa', 'Raza', 'Naqi'],
'age': [63, 47, 57, 57, 57, 65, 55, 55, 40],
'other': ['muavia','muavia','yazid', 'walid','hisham', 'mansur', 'harun', 'mamun', 'mutaz'],
'cityb': ['MKH', 'MAD', 'MAD', 'MAD', 'MAD', 'MAD', 'MAD', 'MAD', 'MAD'],
'duration':[29, 10, 11, 34, 19, 32, 35, 20, 34],
'YoM': [40, 50, 61, 95, 114, 148, 183, 203, 254],
'dynasty':[None, 'umayad', 'umayad', 'umayad', 'umayad', 'abbasid', 'abbasid', 'abbasid', 'abbasid'],
'cityd': ['NJF', 'MAD', 'KBL', 'MAD', 'MAD', 'MAD', 'BGD', 'MAS', 'SAM']
},
index=['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eigth', 'tenth']
)
print(df)
name age other cityb duration YoM dynasty cityd
first Ali 63 muavia MKH 29 40 None NJF
second Hasan 47 muavia MAD 10 50 umayad MAD
third Husain 57 yazid MAD 11 61 umayad KBL
fourth Ali 57 walid MAD 34 95 umayad MAD
fifth Muhammad 57 hisham MAD 19 114 umayad MAD
sixth Jafar 65 mansur MAD 32 148 abbasid MAD
seventh Musa 55 harun MAD 35 183 abbasid BGD
eigth Raza 55 mamun MAD 20 203 abbasid MAS
tenth Naqi 40 mutaz MAD 34 254 abbasid SAM
print(df.shape)
(9, 8)
The indexing operator, []
, can be used for slicing and for selecting rows and columns.
However, it can not be used for both purposes (slicing and selecting rows/columns) simultaneously.
We can select a single column from dataframe as below
print(df['name'])
first Ali
second Hasan
third Husain
fourth Ali
fifth Muhammad
sixth Jafar
seventh Musa
eigth Raza
tenth Naqi
Name: name, dtype: object
For selecting multiple columns, we must pass a list of columns.
print(df[['other', 'YoM']])
other YoM
first muavia 40
second muavia 50
third yazid 61
fourth walid 95
fifth hisham 114
sixth mansur 148
seventh harun 183
eigth mamun 203
tenth mutaz 254
when slice notation :
is used, then selection happens either by row labels
or by integer location
Select rows starting from index of second till tenth
print(df['second':'tenth'])
name age other cityb duration YoM dynasty cityd
second Hasan 47 muavia MAD 10 50 umayad MAD
third Husain 57 yazid MAD 11 61 umayad KBL
fourth Ali 57 walid MAD 34 95 umayad MAD
fifth Muhammad 57 hisham MAD 19 114 umayad MAD
sixth Jafar 65 mansur MAD 32 148 abbasid MAD
seventh Musa 55 harun MAD 35 183 abbasid BGD
eigth Raza 55 mamun MAD 20 203 abbasid MAS
tenth Naqi 40 mutaz MAD 34 254 abbasid SAM
Select every second row starting from 3rd till 7th
print(df[2:6:2])
name age other cityb duration YoM dynasty cityd
third Husain 57 yazid MAD 11 61 umayad KBL
fifth Muhammad 57 hisham MAD 19 114 umayad MAD
However, there are more specific methods for indexing and slicing a dataframe. These
are loc
, iloc
, at
and iat
.
at
and iat
are meant to access a scalar, i.e, a single element in the dataframe,
while loc
and iloc
are used to access several elements at the same time,
potentially to perform vectorized operations
loc
#
only work on index
label based
It is used when we want to select rows or columns from a dataframe using the names of columns or the name of index.
The index operator []
after .loc
can have two values/identifiers separated
by comma “,”. The first identifier (before comma) tells which row/rows we want to
select and second identifier tells, which columns we want to select.
For example if we want to select a row whose index is “third”, we can use loc
.
print(df.loc['third'])
name Husain
age 57
other yazid
cityb MAD
duration 11
YoM 61
dynasty umayad
cityd KBL
Name: third, dtype: object
Above we did not specify the second identifier i.e. there is no comma. This is because the if we don’t specify the columns, it will give all the columns.
We can select multiple rows with .loc with a list of strings
print(df.loc[['second', 'fourth', 'sixth']])
name age other cityb duration YoM dynasty cityd
second Hasan 47 muavia MAD 10 50 umayad MAD
fourth Ali 57 walid MAD 34 95 umayad MAD
sixth Jafar 65 mansur MAD 32 148 abbasid MAD
Selecting multiple rows with .loc with slice notation :
print(df.loc['second':'fifth'])
name age other cityb duration YoM dynasty cityd
second Hasan 47 muavia MAD 10 50 umayad MAD
third Husain 57 yazid MAD 11 61 umayad KBL
fourth Ali 57 walid MAD 34 95 umayad MAD
fifth Muhammad 57 hisham MAD 19 114 umayad MAD
In following code, we simultaneously select rows and columns by their labels. Before comman, we tell which rows we want and after comma we tell which columns we want.
print(df.loc[['fifth', 'sixth'], 'other':])
other cityb duration YoM dynasty cityd
fifth hisham MAD 19 114 umayad MAD
sixth mansur MAD 32 148 abbasid MAD
If we want to select all the rows, we can use colon i.e. :
.
print(df.loc[:, 'name':'cityd':2])
name other duration dynasty
first Ali muavia 29 None
second Hasan muavia 10 umayad
third Husain yazid 11 umayad
fourth Ali walid 34 umayad
fifth Muhammad hisham 19 umayad
sixth Jafar mansur 32 abbasid
seventh Musa harun 35 abbasid
eigth Raza mamun 20 abbasid
tenth Naqi mutaz 34 abbasid
Above we wanted to select all the rows (indicated by :
) and every second
columns starting from name to cityd.
We can also select rows/columns with conditions. For example if we want rows where age is above 50, we can do as below
name age other cityb duration YoM dynasty cityd
first Ali 63 muavia MKH 29 40 None NJF
third Husain 57 yazid MAD 11 61 umayad KBL
fourth Ali 57 walid MAD 34 95 umayad MAD
fifth Muhammad 57 hisham MAD 19 114 umayad MAD
sixth Jafar 65 mansur MAD 32 148 abbasid MAD
seventh Musa 55 harun MAD 35 183 abbasid BGD
eigth Raza 55 mamun MAD 20 203 abbasid MAS
In above code, df['age']>50
, is the condition. The output of df['age']>50
is
a boolean array. Thus when we pass a boolean array to loc
, it returns us rows based
upon the specified condition.
We can have multiple conditions as well
name age other cityb duration YoM dynasty cityd
fourth Ali 57 walid MAD 34 95 umayad MAD
sixth Jafar 65 mansur MAD 32 148 abbasid MAD
seventh Musa 55 harun MAD 35 183 abbasid BGD
We can not do above conditioning with strings. What if we want
all rows where other is either muavia or yazid. In such
a case we can provide all the values as a list inside the isin
method.
name age other cityb duration YoM dynasty cityd
first Ali 63 muavia MKH 29 40 None NJF
second Hasan 47 muavia MAD 10 50 umayad MAD
third Husain 57 yazid MAD 11 61 umayad KBL
We can even combine boolean indexing/condition with label based indexing.
other YoM
first muavia 40
second muavia 50
third yazid 61
fourth walid 95
fifth hisham 114
sixth mansur 148
seventh harun 183
eigth mamun 203
tenth mutaz 254
Question: Write the names of the people who were in ‘umayad’ dynasty using loc?
iloc
#
integer location based
work on position
iloc
is used to select rows and columns from dataframe by their location/index/position value.
If we don’t know the actual names of columns and just want the columns by their locations/position,
we can use iloc. For example if we want the last row from dataframe, we can do as below
print(df.iloc[-1])
name Naqi
age 40
other mutaz
cityb MAD
duration 34
YoM 254
dynasty abbasid
cityd SAM
Name: tenth, dtype: object
If we want the last column, we can do as below
print(df.iloc[:, -1])
first NJF
second MAD
third KBL
fourth MAD
fifth MAD
sixth MAD
seventh BGD
eigth MAS
tenth SAM
Name: cityd, dtype: object
The :
above tells that we want all rows.
If we want 5th row, we can dow as below
print(df.iloc[4])
name Muhammad
age 57
other hisham
cityb MAD
duration 19
YoM 114
dynasty umayad
cityd MAD
Name: fifth, dtype: object
If we want to select multiple rows, we need to pass a list.
Select thrid and second last row
print(df.iloc[[2, -2]])
name age other cityb duration YoM dynasty cityd
third Husain 57 yazid MAD 11 61 umayad KBL
eigth Raza 55 mamun MAD 20 203 abbasid MAS
Selecting multiple rows with .iloc with slice notation
Select every second row starting from first till 8th
print(df.iloc[:7:2])
name age other cityb duration YoM dynasty cityd
first Ali 63 muavia MKH 29 40 None NJF
third Husain 57 yazid MAD 11 61 umayad KBL
fifth Muhammad 57 hisham MAD 19 114 umayad MAD
seventh Musa 55 harun MAD 35 183 abbasid BGD
Simultaneous selection of rows and columns
Select second and fifth row and third column
print(df.iloc[[1,4], 2])
second muavia
fifth hisham
Name: other, dtype: object
As we did with loc
, we can also use a boolean array for selection to iloc
.
Select 3rd and fifth column but where age is greater than 30
other duration
first muavia 29
second muavia 10
third yazid 11
fourth walid 34
fifth hisham 19
sixth mansur 32
seventh harun 35
eigth mamun 20
tenth mutaz 34
Question:
Select first and last rows and from first and last columns using iloc
at
#
Selection with .at is nearly identical to .loc but it only selects a single ‘cell’ in your DataFrame. We usually refer to this cell as a scalar value. To use .at, pass it both a row and column label separated by a comma.
print(df.at['sixth', 'duration'])
32
iat
#
Selection with iat is nearly identical to iloc but it only selects a single scalar value. You must pass it an integer for both the row and column locations
print(df.iat[2, 5])
61
Total running time of the script: ( 0 minutes 0.123 seconds)