5.2 indexing and slicing
Contents
Note
Click here to download the full example code or to run this example in your browser via Binder
5.2 indexing and slicing#
This lesson shows how to select rows or columns from pandas dataframe.
import pandas as pd
print(pd.__version__)
1.5.3
pd.set_option('display.max_columns', 11)
Let’s create a dataframe
df = pd.DataFrame({
'name': ['Ali', 'Hasan', 'Husain', 'Ali', 'Muhammad', 'Jafar', 'Musa', 'Raza', 'Taqi', 'Naqi', 'Askari'],
'age': [63, 47, 57, 57, 57, 65, 55, 55, 24, 40, 27],
'other': ['muavia','muavia','yazid','walid','hisham','mansur','harun','mamun','Mutasim','Mutasim','mutaz'],
'cityb': ['MKH', 'MAD', 'MAD', 'MAD', 'MAD', 'MAD', 'MAD', 'MAD', 'MAD', 'MAD', 'MAD'],
'duration':[29, 10, 11, 34, 19, 32, 35, 20, 16, 34, 6],
'YoB_H': [-22, 3, 4, 38, 56, 83, 128, 148, 195, 212, 232],
'YOB_G': [600, 625, 626, 659, 676, 702, 745, 766, 811, 828, 844],
'YoM_H': [40, 50, 61, 95, 114, 148, 183, 203, 220, 254, 260],
'YoM_G': [661, 670, 680, 712, 733, 765, 799, 818, 835, 868, 874],
'dynasty':[None,'umayad','umayad','umayad','umayad','abbasid','abbasid','abbasid','abbasid','abbasid','abbasid'],
'cityd': ['NJF', 'MAD', 'KBL', 'MAD', 'MAD', 'MAD', 'BGD', 'MAS', 'BGH', 'SAM', 'SAM']
},
index=['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eigth', 'ninth', 'tenth', 'eleventh']
)
print(df)
name age other cityb duration YoB_H YOB_G YoM_H YoM_G \
first Ali 63 muavia MKH 29 -22 600 40 661
second Hasan 47 muavia MAD 10 3 625 50 670
third Husain 57 yazid MAD 11 4 626 61 680
fourth Ali 57 walid MAD 34 38 659 95 712
fifth Muhammad 57 hisham MAD 19 56 676 114 733
sixth Jafar 65 mansur MAD 32 83 702 148 765
seventh Musa 55 harun MAD 35 128 745 183 799
eigth Raza 55 mamun MAD 20 148 766 203 818
ninth Taqi 24 Mutasim MAD 16 195 811 220 835
tenth Naqi 40 Mutasim MAD 34 212 828 254 868
eleventh Askari 27 mutaz MAD 6 232 844 260 874
dynasty cityd
first None NJF
second umayad MAD
third umayad KBL
fourth umayad MAD
fifth umayad MAD
sixth abbasid MAD
seventh abbasid BGD
eigth abbasid MAS
ninth abbasid BGH
tenth abbasid SAM
eleventh abbasid SAM
print(df.shape)
(11, 11)
The indexing operator, [], can be used for slicing and for selecting rows and columns.
However, it can not be used for both purposes (slicing and selecting rows/columns) simultaneously.
We can select a single column from dataframe as below
print(df['name'])
first Ali
second Hasan
third Husain
fourth Ali
fifth Muhammad
sixth Jafar
seventh Musa
eigth Raza
ninth Taqi
tenth Naqi
eleventh Askari
Name: name, dtype: object
For selecting multiple columns, we must pass a list of columns.
print(df[['other', 'YoM_H']])
other YoM_H
first muavia 40
second muavia 50
third yazid 61
fourth walid 95
fifth hisham 114
sixth mansur 148
seventh harun 183
eigth mamun 203
ninth Mutasim 220
tenth Mutasim 254
eleventh mutaz 260
when slice notation : is used, then selection happens either by row labels
or by integer location
Select rows starting from index of second till tenth
print(df['second':'tenth'])
name age other cityb duration YoB_H YOB_G YoM_H YoM_G \
second Hasan 47 muavia MAD 10 3 625 50 670
third Husain 57 yazid MAD 11 4 626 61 680
fourth Ali 57 walid MAD 34 38 659 95 712
fifth Muhammad 57 hisham MAD 19 56 676 114 733
sixth Jafar 65 mansur MAD 32 83 702 148 765
seventh Musa 55 harun MAD 35 128 745 183 799
eigth Raza 55 mamun MAD 20 148 766 203 818
ninth Taqi 24 Mutasim MAD 16 195 811 220 835
tenth Naqi 40 Mutasim MAD 34 212 828 254 868
dynasty cityd
second umayad MAD
third umayad KBL
fourth umayad MAD
fifth umayad MAD
sixth abbasid MAD
seventh abbasid BGD
eigth abbasid MAS
ninth abbasid BGH
tenth abbasid SAM
Select every second row starting from 3rd till 7th
print(df[2:6:2])
name age other cityb duration YoB_H YOB_G YoM_H YoM_G \
third Husain 57 yazid MAD 11 4 626 61 680
fifth Muhammad 57 hisham MAD 19 56 676 114 733
dynasty cityd
third umayad KBL
fifth umayad MAD
However, there are more specific methods for indexing and slicing a dataframe. These
are loc, iloc, at and iat.
at and iat are meant to access a scalar, i.e, a single element in the dataframe,
while loc and iloc are used to access several elements at the same time,
potentially to perform vectorized operations
loc#
only work on index
label based
It is used when we want to select rows or columns from a dataframe using the names of columns or the name of index.
The index operator [] after .loc can have two values/identifiers separated
by comma “,”. The first identifier (before comma) tells which row/rows we want to
select and second identifier tells, which columns we want to select.
For example if we want to select a row whose index is “third”, we can use loc.
print(df.loc['third'])
name Husain
age 57
other yazid
cityb MAD
duration 11
YoB_H 4
YOB_G 626
YoM_H 61
YoM_G 680
dynasty umayad
cityd KBL
Name: third, dtype: object
Above we did not specify the second identifier i.e. there is no comma. This is because the if we don’t specify the columns, it will give all the columns.
We can select multiple rows with .loc with a list of strings
print(df.loc[['second', 'fourth', 'sixth']])
name age other cityb duration YoB_H YOB_G YoM_H YoM_G \
second Hasan 47 muavia MAD 10 3 625 50 670
fourth Ali 57 walid MAD 34 38 659 95 712
sixth Jafar 65 mansur MAD 32 83 702 148 765
dynasty cityd
second umayad MAD
fourth umayad MAD
sixth abbasid MAD
Selecting multiple rows with .loc with slice notation :
print(df.loc['second':'fifth'])
name age other cityb duration YoB_H YOB_G YoM_H YoM_G \
second Hasan 47 muavia MAD 10 3 625 50 670
third Husain 57 yazid MAD 11 4 626 61 680
fourth Ali 57 walid MAD 34 38 659 95 712
fifth Muhammad 57 hisham MAD 19 56 676 114 733
dynasty cityd
second umayad MAD
third umayad KBL
fourth umayad MAD
fifth umayad MAD
In following code, we simultaneously select rows and columns by their labels. Before comman, we tell which rows we want and after comma we tell which columns we want.
print(df.loc[['fifth', 'sixth'], 'other':])
other cityb duration YoB_H YOB_G YoM_H YoM_G dynasty cityd
fifth hisham MAD 19 56 676 114 733 umayad MAD
sixth mansur MAD 32 83 702 148 765 abbasid MAD
If we want to select all the rows, we can use colon i.e. :.
print(df.loc[:, 'name':'cityd':2])
name other duration YOB_G YoM_G cityd
first Ali muavia 29 600 661 NJF
second Hasan muavia 10 625 670 MAD
third Husain yazid 11 626 680 KBL
fourth Ali walid 34 659 712 MAD
fifth Muhammad hisham 19 676 733 MAD
sixth Jafar mansur 32 702 765 MAD
seventh Musa harun 35 745 799 BGD
eigth Raza mamun 20 766 818 MAS
ninth Taqi Mutasim 16 811 835 BGH
tenth Naqi Mutasim 34 828 868 SAM
eleventh Askari mutaz 6 844 874 SAM
Above we wanted to select all the rows (indicated by :) and every second
columns starting from name to cityd.
We can also select rows/columns with conditions. For example if we want rows where age is above 50, we can do as below
name age other cityb duration YoB_H YOB_G YoM_H YoM_G \
first Ali 63 muavia MKH 29 -22 600 40 661
third Husain 57 yazid MAD 11 4 626 61 680
fourth Ali 57 walid MAD 34 38 659 95 712
fifth Muhammad 57 hisham MAD 19 56 676 114 733
sixth Jafar 65 mansur MAD 32 83 702 148 765
seventh Musa 55 harun MAD 35 128 745 183 799
eigth Raza 55 mamun MAD 20 148 766 203 818
dynasty cityd
first None NJF
third umayad KBL
fourth umayad MAD
fifth umayad MAD
sixth abbasid MAD
seventh abbasid BGD
eigth abbasid MAS
In above code, df['age']>50, is the condition. The output of df['age']>50 is
a boolean array. Thus when we pass a boolean array to loc, it returns us rows based
upon the specified condition.
We can have multiple conditions as well
name age other cityb duration YoB_H YOB_G YoM_H YoM_G \
fourth Ali 57 walid MAD 34 38 659 95 712
sixth Jafar 65 mansur MAD 32 83 702 148 765
seventh Musa 55 harun MAD 35 128 745 183 799
dynasty cityd
fourth umayad MAD
sixth abbasid MAD
seventh abbasid BGD
We can not do above conditioning with strings. What if we want
all rows where other is either muavia or yazid. In such
a case we can provide all the values as a list inside the isin
method.
name age other cityb duration YoB_H YOB_G YoM_H YoM_G \
first Ali 63 muavia MKH 29 -22 600 40 661
second Hasan 47 muavia MAD 10 3 625 50 670
third Husain 57 yazid MAD 11 4 626 61 680
dynasty cityd
first None NJF
second umayad MAD
third umayad KBL
We can even combine boolean indexing/condition with label based indexing.
other YoM_H
first muavia 40
second muavia 50
third yazid 61
fourth walid 95
fifth hisham 114
sixth mansur 148
seventh harun 183
eigth mamun 203
tenth Mutasim 254
Question: Write the names of the people who were in umayad dynasty using loc?
iloc#
integer location based
work on position
iloc is used to select rows and columns from dataframe by their location/index/position value.
If we don’t know the actual names of columns and just want the columns by their locations/position,
we can use iloc. For example if we want the last row from dataframe, we can do as below
print(df.iloc[-1])
name Askari
age 27
other mutaz
cityb MAD
duration 6
YoB_H 232
YOB_G 844
YoM_H 260
YoM_G 874
dynasty abbasid
cityd SAM
Name: eleventh, dtype: object
If we want the last column, we can do as below
print(df.iloc[:, -1])
first NJF
second MAD
third KBL
fourth MAD
fifth MAD
sixth MAD
seventh BGD
eigth MAS
ninth BGH
tenth SAM
eleventh SAM
Name: cityd, dtype: object
The : above tells that we want all rows.
If we want 5th row, we can dow as below
print(df.iloc[4])
name Muhammad
age 57
other hisham
cityb MAD
duration 19
YoB_H 56
YOB_G 676
YoM_H 114
YoM_G 733
dynasty umayad
cityd MAD
Name: fifth, dtype: object
If we want to select multiple rows, we need to pass a list.
Select thrid and second last row
print(df.iloc[[2, -2]])
name age other cityb duration YoB_H YOB_G YoM_H YoM_G \
third Husain 57 yazid MAD 11 4 626 61 680
tenth Naqi 40 Mutasim MAD 34 212 828 254 868
dynasty cityd
third umayad KBL
tenth abbasid SAM
Selecting multiple rows with .iloc with slice notation
Select every second row starting from first till 8th
print(df.iloc[:7:2])
name age other cityb duration YoB_H YOB_G YoM_H YoM_G \
first Ali 63 muavia MKH 29 -22 600 40 661
third Husain 57 yazid MAD 11 4 626 61 680
fifth Muhammad 57 hisham MAD 19 56 676 114 733
seventh Musa 55 harun MAD 35 128 745 183 799
dynasty cityd
first None NJF
third umayad KBL
fifth umayad MAD
seventh abbasid BGD
Simultaneous selection of rows and columns
Select second and fifth row and third column
print(df.iloc[[1,4], 2])
second muavia
fifth hisham
Name: other, dtype: object
As we did with loc, we can also use a boolean array for selection to iloc.
Select 3rd and fifth column but where age is greater than 30
other duration
first muavia 29
second muavia 10
third yazid 11
fourth walid 34
fifth hisham 19
sixth mansur 32
seventh harun 35
eigth mamun 20
tenth Mutasim 34
Question:
Select first and last rows and from first and last columns using iloc
at#
Selection with .at is nearly identical to .loc but it only selects a single ‘cell’ in your DataFrame. We usually refer to this cell as a scalar value. To use .at, pass it both a row and column label separated by a comma.
print(df.at['sixth', 'duration'])
32
iat#
Selection with iat is nearly identical to iloc but it only selects a single scalar value. You must pass it an integer for both the row and column locations
print(df.iat[2, 5])
4
Question:
Calculate the average age of people in umayad and abbasid dynasties using loc and iloc?
Question: Calculate the years in office by subtracting YoM_H from YoM_H of the above row and find out which one had the longest and shortest stay in office? Since there is no row above the first row, consider 11 as the YoM_H of the row above.
Question: Who lived the longest and shortest?
Total running time of the script: ( 0 minutes 0.060 seconds)