5.2 indexing and slicing#

This lesson shows how to select rows or columns from pandas dataframe.

import pandas as pd
print(pd.__version__)
1.5.3
pd.set_option('display.max_columns', 11)

Let’s create a dataframe

df = pd.DataFrame({
    'name':  ['Ali', 'Hasan', 'Husain', 'Ali', 'Muhammad', 'Jafar', 'Musa', 'Raza', 'Taqi', 'Naqi', 'Askari'],
    'age':   [63,     47,      57,       57,     57,         65,      55,     55,    24,     40,     27],
    'other': ['muavia','muavia','yazid','walid','hisham','mansur','harun','mamun','Mutasim','Mutasim','mutaz'],
    'cityb': ['MKH',    'MAD',   'MAD',  'MAD',  'MAD',   'MAD',   'MAD',  'MAD',  'MAD',    'MAD',   'MAD'],
    'duration':[29,      10,      11,       34,     19,    32,      35,     20,     16,      34,       6],
    'YoB_H': [-22,       3,       4,        38,     56,    83,      128,    148,    195,     212,      232],
    'YOB_G': [600,       625,     626,      659,    676,   702,     745,    766,    811,     828,      844],
    'YoM_H': [40,        50,      61,       95,     114,   148,     183,    203,    220,     254,      260],
    'YoM_G': [661,       670,     680,      712,    733,   765,     799,    818,    835,     868,      874],
    'dynasty':[None,'umayad','umayad','umayad','umayad','abbasid','abbasid','abbasid','abbasid','abbasid','abbasid'],
    'cityd':  ['NJF',  'MAD',   'KBL',    'MAD',  'MAD',      'MAD',   'BGD',  'MAS',  'BGH',   'SAM',    'SAM']
},
    index=['first', 'second', 'third', 'fourth', 'fifth',  'sixth', 'seventh', 'eigth', 'ninth', 'tenth', 'eleventh']
)
print(df)
              name  age    other cityb  duration  YoB_H  YOB_G  YoM_H  YoM_G  \
first          Ali   63   muavia   MKH        29    -22    600     40    661
second       Hasan   47   muavia   MAD        10      3    625     50    670
third       Husain   57    yazid   MAD        11      4    626     61    680
fourth         Ali   57    walid   MAD        34     38    659     95    712
fifth     Muhammad   57   hisham   MAD        19     56    676    114    733
sixth        Jafar   65   mansur   MAD        32     83    702    148    765
seventh       Musa   55    harun   MAD        35    128    745    183    799
eigth         Raza   55    mamun   MAD        20    148    766    203    818
ninth         Taqi   24  Mutasim   MAD        16    195    811    220    835
tenth         Naqi   40  Mutasim   MAD        34    212    828    254    868
eleventh    Askari   27    mutaz   MAD         6    232    844    260    874

          dynasty cityd
first        None   NJF
second     umayad   MAD
third      umayad   KBL
fourth     umayad   MAD
fifth      umayad   MAD
sixth     abbasid   MAD
seventh   abbasid   BGD
eigth     abbasid   MAS
ninth     abbasid   BGH
tenth     abbasid   SAM
eleventh  abbasid   SAM
print(df.shape)
(11, 11)

The indexing operator, [], can be used for slicing and for selecting rows and columns. However, it can not be used for both purposes (slicing and selecting rows/columns) simultaneously.

We can select a single column from dataframe as below

print(df['name'])
first            Ali
second         Hasan
third         Husain
fourth           Ali
fifth       Muhammad
sixth          Jafar
seventh         Musa
eigth           Raza
ninth           Taqi
tenth           Naqi
eleventh      Askari
Name: name, dtype: object

For selecting multiple columns, we must pass a list of columns.

print(df[['other', 'YoM_H']])
            other  YoM_H
first      muavia     40
second     muavia     50
third       yazid     61
fourth      walid     95
fifth      hisham    114
sixth      mansur    148
seventh     harun    183
eigth       mamun    203
ninth     Mutasim    220
tenth     Mutasim    254
eleventh    mutaz    260

when slice notation : is used, then selection happens either by row labels or by integer location

Select rows starting from index of second till tenth

print(df['second':'tenth'])
             name  age    other cityb  duration  YoB_H  YOB_G  YoM_H  YoM_G  \
second      Hasan   47   muavia   MAD        10      3    625     50    670
third      Husain   57    yazid   MAD        11      4    626     61    680
fourth        Ali   57    walid   MAD        34     38    659     95    712
fifth    Muhammad   57   hisham   MAD        19     56    676    114    733
sixth       Jafar   65   mansur   MAD        32     83    702    148    765
seventh      Musa   55    harun   MAD        35    128    745    183    799
eigth        Raza   55    mamun   MAD        20    148    766    203    818
ninth        Taqi   24  Mutasim   MAD        16    195    811    220    835
tenth        Naqi   40  Mutasim   MAD        34    212    828    254    868

         dynasty cityd
second    umayad   MAD
third     umayad   KBL
fourth    umayad   MAD
fifth     umayad   MAD
sixth    abbasid   MAD
seventh  abbasid   BGD
eigth    abbasid   MAS
ninth    abbasid   BGH
tenth    abbasid   SAM

Select every second row starting from 3rd till 7th

print(df[2:6:2])
           name  age   other cityb  duration  YoB_H  YOB_G  YoM_H  YoM_G  \
third    Husain   57   yazid   MAD        11      4    626     61    680
fifth  Muhammad   57  hisham   MAD        19     56    676    114    733

      dynasty cityd
third  umayad   KBL
fifth  umayad   MAD

However, there are more specific methods for indexing and slicing a dataframe. These are loc, iloc, at and iat. at and iat are meant to access a scalar, i.e, a single element in the dataframe, while loc and iloc are used to access several elements at the same time, potentially to perform vectorized operations

loc#

  • only work on index

  • label based

It is used when we want to select rows or columns from a dataframe using the names of columns or the name of index.

The index operator [] after .loc can have two values/identifiers separated by comma “,”. The first identifier (before comma) tells which row/rows we want to select and second identifier tells, which columns we want to select.

For example if we want to select a row whose index is “third”, we can use loc.

print(df.loc['third'])
name        Husain
age             57
other        yazid
cityb          MAD
duration        11
YoB_H            4
YOB_G          626
YoM_H           61
YoM_G          680
dynasty     umayad
cityd          KBL
Name: third, dtype: object

Above we did not specify the second identifier i.e. there is no comma. This is because the if we don’t specify the columns, it will give all the columns.

We can select multiple rows with .loc with a list of strings

print(df.loc[['second', 'fourth', 'sixth']])
         name  age   other cityb  duration  YoB_H  YOB_G  YoM_H  YoM_G  \
second  Hasan   47  muavia   MAD        10      3    625     50    670
fourth    Ali   57   walid   MAD        34     38    659     95    712
sixth   Jafar   65  mansur   MAD        32     83    702    148    765

        dynasty cityd
second   umayad   MAD
fourth   umayad   MAD
sixth   abbasid   MAD

Selecting multiple rows with .loc with slice notation :

print(df.loc['second':'fifth'])
            name  age   other cityb  duration  YoB_H  YOB_G  YoM_H  YoM_G  \
second     Hasan   47  muavia   MAD        10      3    625     50    670
third     Husain   57   yazid   MAD        11      4    626     61    680
fourth       Ali   57   walid   MAD        34     38    659     95    712
fifth   Muhammad   57  hisham   MAD        19     56    676    114    733

       dynasty cityd
second  umayad   MAD
third   umayad   KBL
fourth  umayad   MAD
fifth   umayad   MAD

In following code, we simultaneously select rows and columns by their labels. Before comman, we tell which rows we want and after comma we tell which columns we want.

print(df.loc[['fifth', 'sixth'], 'other':])
        other cityb  duration  YoB_H  YOB_G  YoM_H  YoM_G  dynasty cityd
fifth  hisham   MAD        19     56    676    114    733   umayad   MAD
sixth  mansur   MAD        32     83    702    148    765  abbasid   MAD

If we want to select all the rows, we can use colon i.e. :.

print(df.loc[:, 'name':'cityd':2])
              name    other  duration  YOB_G  YoM_G cityd
first          Ali   muavia        29    600    661   NJF
second       Hasan   muavia        10    625    670   MAD
third       Husain    yazid        11    626    680   KBL
fourth         Ali    walid        34    659    712   MAD
fifth     Muhammad   hisham        19    676    733   MAD
sixth        Jafar   mansur        32    702    765   MAD
seventh       Musa    harun        35    745    799   BGD
eigth         Raza    mamun        20    766    818   MAS
ninth         Taqi  Mutasim        16    811    835   BGH
tenth         Naqi  Mutasim        34    828    868   SAM
eleventh    Askari    mutaz         6    844    874   SAM

Above we wanted to select all the rows (indicated by :) and every second columns starting from name to cityd.

We can also select rows/columns with conditions. For example if we want rows where age is above 50, we can do as below

print(df.loc[df['age']>50])
             name  age   other cityb  duration  YoB_H  YOB_G  YoM_H  YoM_G  \
first         Ali   63  muavia   MKH        29    -22    600     40    661
third      Husain   57   yazid   MAD        11      4    626     61    680
fourth        Ali   57   walid   MAD        34     38    659     95    712
fifth    Muhammad   57  hisham   MAD        19     56    676    114    733
sixth       Jafar   65  mansur   MAD        32     83    702    148    765
seventh      Musa   55   harun   MAD        35    128    745    183    799
eigth        Raza   55   mamun   MAD        20    148    766    203    818

         dynasty cityd
first       None   NJF
third     umayad   KBL
fourth    umayad   MAD
fifth     umayad   MAD
sixth    abbasid   MAD
seventh  abbasid   BGD
eigth    abbasid   MAS

In above code, df['age']>50, is the condition. The output of df['age']>50 is a boolean array. Thus when we pass a boolean array to loc, it returns us rows based upon the specified condition.

We can have multiple conditions as well

print(df.loc[(df['age']>50) & (df['duration']>30)])
          name  age   other cityb  duration  YoB_H  YOB_G  YoM_H  YoM_G  \
fourth     Ali   57   walid   MAD        34     38    659     95    712
sixth    Jafar   65  mansur   MAD        32     83    702    148    765
seventh   Musa   55   harun   MAD        35    128    745    183    799

         dynasty cityd
fourth    umayad   MAD
sixth    abbasid   MAD
seventh  abbasid   BGD

We can not do above conditioning with strings. What if we want all rows where other is either muavia or yazid. In such a case we can provide all the values as a list inside the isin method.

print(df.loc[df['other'].isin(['muavia', 'yazid'])])
          name  age   other cityb  duration  YoB_H  YOB_G  YoM_H  YoM_G  \
first      Ali   63  muavia   MKH        29    -22    600     40    661
second   Hasan   47  muavia   MAD        10      3    625     50    670
third   Husain   57   yazid   MAD        11      4    626     61    680

       dynasty cityd
first     None   NJF
second  umayad   MAD
third   umayad   KBL

We can even combine boolean indexing/condition with label based indexing.

print(df.loc[df['age'] > 30, ['other', 'YoM_H']])
           other  YoM_H
first     muavia     40
second    muavia     50
third      yazid     61
fourth     walid     95
fifth     hisham    114
sixth     mansur    148
seventh    harun    183
eigth      mamun    203
tenth    Mutasim    254

Question: Write the names of the people who were in umayad dynasty using loc?

iloc#

  • integer location based

  • work on position

iloc is used to select rows and columns from dataframe by their location/index/position value. If we don’t know the actual names of columns and just want the columns by their locations/position, we can use iloc. For example if we want the last row from dataframe, we can do as below

print(df.iloc[-1])
name         Askari
age              27
other         mutaz
cityb           MAD
duration          6
YoB_H           232
YOB_G           844
YoM_H           260
YoM_G           874
dynasty     abbasid
cityd           SAM
Name: eleventh, dtype: object

If we want the last column, we can do as below

print(df.iloc[:, -1])
first       NJF
second      MAD
third       KBL
fourth      MAD
fifth       MAD
sixth       MAD
seventh     BGD
eigth       MAS
ninth       BGH
tenth       SAM
eleventh    SAM
Name: cityd, dtype: object

The : above tells that we want all rows.

If we want 5th row, we can dow as below

print(df.iloc[4])
name        Muhammad
age               57
other         hisham
cityb            MAD
duration          19
YoB_H             56
YOB_G            676
YoM_H            114
YoM_G            733
dynasty       umayad
cityd            MAD
Name: fifth, dtype: object

If we want to select multiple rows, we need to pass a list.

Select thrid and second last row

print(df.iloc[[2, -2]])
         name  age    other cityb  duration  YoB_H  YOB_G  YoM_H  YoM_G  \
third  Husain   57    yazid   MAD        11      4    626     61    680
tenth    Naqi   40  Mutasim   MAD        34    212    828    254    868

       dynasty cityd
third   umayad   KBL
tenth  abbasid   SAM

Selecting multiple rows with .iloc with slice notation

Select every second row starting from first till 8th

print(df.iloc[:7:2])
             name  age   other cityb  duration  YoB_H  YOB_G  YoM_H  YoM_G  \
first         Ali   63  muavia   MKH        29    -22    600     40    661
third      Husain   57   yazid   MAD        11      4    626     61    680
fifth    Muhammad   57  hisham   MAD        19     56    676    114    733
seventh      Musa   55   harun   MAD        35    128    745    183    799

         dynasty cityd
first       None   NJF
third     umayad   KBL
fifth     umayad   MAD
seventh  abbasid   BGD

Simultaneous selection of rows and columns

Select second and fifth row and third column

print(df.iloc[[1,4], 2])
second    muavia
fifth     hisham
Name: other, dtype: object

As we did with loc, we can also use a boolean array for selection to iloc.

Select 3rd and fifth column but where age is greater than 30

print(df.iloc[(df['age'] > 30).values, [2, 4]])
           other  duration
first     muavia        29
second    muavia        10
third      yazid        11
fourth     walid        34
fifth     hisham        19
sixth     mansur        32
seventh    harun        35
eigth      mamun        20
tenth    Mutasim        34

Question: Select first and last rows and from first and last columns using iloc

at#

Selection with .at is nearly identical to .loc but it only selects a single ‘cell’ in your DataFrame. We usually refer to this cell as a scalar value. To use .at, pass it both a row and column label separated by a comma.

print(df.at['sixth', 'duration'])
32

iat#

Selection with iat is nearly identical to iloc but it only selects a single scalar value. You must pass it an integer for both the row and column locations

print(df.iat[2, 5])
4

Question: Calculate the average age of people in umayad and abbasid dynasties using loc and iloc?

Question: Calculate the years in office by subtracting YoM_H from YoM_H of the above row and find out which one had the longest and shortest stay in office? Since there is no row above the first row, consider 11 as the YoM_H of the row above.

Question: Who lived the longest and shortest?

Total running time of the script: ( 0 minutes 0.060 seconds)

Gallery generated by Sphinx-Gallery