5.2 indexing and slicing#

This lesson shows how to select rows or columns from pandas dataframe.

import pandas as pd
print(pd.__version__)
1.5.3

Let’s create a dataframe

df = pd.DataFrame({
    'name': ['Ali', 'Hasan', 'Husain', 'Ali', 'Muhammad', 'Jafar', 'Musa', 'Raza', 'Naqi'],
    'age':   [63,     47,      57,       57,     57,         65,      55,     55,      40],
    'other': ['muavia','muavia','yazid',  'walid','hisham', 'mansur', 'harun',  'mamun',  'mutaz'],
    'cityb': ['MKH', 'MAD',   'MAD',   'MAD',  'MAD',   'MAD',  'MAD',  'MAD', 'MAD'],
    'duration':[29,   10,      11,       34,     19,         32,      35,     20,      34],
    'YoM':   [40,     50,      61,       95,     114,        148,     183,    203,     254],
    'dynasty':[None, 'umayad', 'umayad', 'umayad', 'umayad', 'abbasid', 'abbasid', 'abbasid', 'abbasid'],
    'cityd':  ['NJF',  'MAD',   'KBL',    'MAD',  'MAD',      'MAD',   'BGD',  'MAS',   'SAM']
},
    index=['first', 'second', 'third', 'fourth', 'fifth',  'sixth', 'seventh', 'eigth', 'tenth']
)
print(df)
             name  age   other cityb  duration  YoM  dynasty cityd
first         Ali   63  muavia   MKH        29   40     None   NJF
second      Hasan   47  muavia   MAD        10   50   umayad   MAD
third      Husain   57   yazid   MAD        11   61   umayad   KBL
fourth        Ali   57   walid   MAD        34   95   umayad   MAD
fifth    Muhammad   57  hisham   MAD        19  114   umayad   MAD
sixth       Jafar   65  mansur   MAD        32  148  abbasid   MAD
seventh      Musa   55   harun   MAD        35  183  abbasid   BGD
eigth        Raza   55   mamun   MAD        20  203  abbasid   MAS
tenth        Naqi   40   mutaz   MAD        34  254  abbasid   SAM
print(df.shape)
(9, 8)

The indexing operator, [], can be used for slicing and for selecting rows and columns. However, it can not be used for both purposes (slicing and selecting rows/columns) simultaneously.

We can select a single column from dataframe as below

print(df['name'])
first           Ali
second        Hasan
third        Husain
fourth          Ali
fifth      Muhammad
sixth         Jafar
seventh        Musa
eigth          Raza
tenth          Naqi
Name: name, dtype: object

For selecting multiple columns, we must pass a list of columns.

print(df[['other', 'YoM']])
          other  YoM
first    muavia   40
second   muavia   50
third     yazid   61
fourth    walid   95
fifth    hisham  114
sixth    mansur  148
seventh   harun  183
eigth     mamun  203
tenth     mutaz  254

when slice notation : is used, then selection happens either by row labels or by integer location

Select rows starting from index of second till tenth

print(df['second':'tenth'])
             name  age   other cityb  duration  YoM  dynasty cityd
second      Hasan   47  muavia   MAD        10   50   umayad   MAD
third      Husain   57   yazid   MAD        11   61   umayad   KBL
fourth        Ali   57   walid   MAD        34   95   umayad   MAD
fifth    Muhammad   57  hisham   MAD        19  114   umayad   MAD
sixth       Jafar   65  mansur   MAD        32  148  abbasid   MAD
seventh      Musa   55   harun   MAD        35  183  abbasid   BGD
eigth        Raza   55   mamun   MAD        20  203  abbasid   MAS
tenth        Naqi   40   mutaz   MAD        34  254  abbasid   SAM

Select every second row starting from 3rd till 7th

print(df[2:6:2])
           name  age   other cityb  duration  YoM dynasty cityd
third    Husain   57   yazid   MAD        11   61  umayad   KBL
fifth  Muhammad   57  hisham   MAD        19  114  umayad   MAD

However, there are more specific methods for indexing and slicing a dataframe. These are loc, iloc, at and iat. at and iat are meant to access a scalar, i.e, a single element in the dataframe, while loc and iloc are used to access several elements at the same time, potentially to perform vectorized operations

loc#

  • only work on index

  • label based

It is used when we want to select rows or columns from a dataframe using the names of columns or the name of index.

The index operator [] after .loc can have two values/identifiers separated by comma “,”. The first identifier (before comma) tells which row/rows we want to select and second identifier tells, which columns we want to select.

For example if we want to select a row whose index is “third”, we can use loc.

print(df.loc['third'])
name        Husain
age             57
other        yazid
cityb          MAD
duration        11
YoM             61
dynasty     umayad
cityd          KBL
Name: third, dtype: object

Above we did not specify the second identifier i.e. there is no comma. This is because the if we don’t specify the columns, it will give all the columns.

We can select multiple rows with .loc with a list of strings

print(df.loc[['second', 'fourth', 'sixth']])
         name  age   other cityb  duration  YoM  dynasty cityd
second  Hasan   47  muavia   MAD        10   50   umayad   MAD
fourth    Ali   57   walid   MAD        34   95   umayad   MAD
sixth   Jafar   65  mansur   MAD        32  148  abbasid   MAD

Selecting multiple rows with .loc with slice notation :

print(df.loc['second':'fifth'])
            name  age   other cityb  duration  YoM dynasty cityd
second     Hasan   47  muavia   MAD        10   50  umayad   MAD
third     Husain   57   yazid   MAD        11   61  umayad   KBL
fourth       Ali   57   walid   MAD        34   95  umayad   MAD
fifth   Muhammad   57  hisham   MAD        19  114  umayad   MAD

In following code, we simultaneously select rows and columns by their labels. Before comman, we tell which rows we want and after comma we tell which columns we want.

print(df.loc[['fifth', 'sixth'], 'other':])
        other cityb  duration  YoM  dynasty cityd
fifth  hisham   MAD        19  114   umayad   MAD
sixth  mansur   MAD        32  148  abbasid   MAD

If we want to select all the rows, we can use colon i.e. :.

print(df.loc[:, 'name':'cityd':2])
             name   other  duration  dynasty
first         Ali  muavia        29     None
second      Hasan  muavia        10   umayad
third      Husain   yazid        11   umayad
fourth        Ali   walid        34   umayad
fifth    Muhammad  hisham        19   umayad
sixth       Jafar  mansur        32  abbasid
seventh      Musa   harun        35  abbasid
eigth        Raza   mamun        20  abbasid
tenth        Naqi   mutaz        34  abbasid

Above we wanted to select all the rows (indicated by :) and every second columns starting from name to cityd.

We can also select rows/columns with conditions. For example if we want rows where age is above 50, we can do as below

print(df.loc[df['age']>50])
             name  age   other cityb  duration  YoM  dynasty cityd
first         Ali   63  muavia   MKH        29   40     None   NJF
third      Husain   57   yazid   MAD        11   61   umayad   KBL
fourth        Ali   57   walid   MAD        34   95   umayad   MAD
fifth    Muhammad   57  hisham   MAD        19  114   umayad   MAD
sixth       Jafar   65  mansur   MAD        32  148  abbasid   MAD
seventh      Musa   55   harun   MAD        35  183  abbasid   BGD
eigth        Raza   55   mamun   MAD        20  203  abbasid   MAS

In above code, df['age']>50, is the condition. The output of df['age']>50 is a boolean array. Thus when we pass a boolean array to loc, it returns us rows based upon the specified condition.

We can have multiple conditions as well

print(df.loc[(df['age']>50) & (df['duration']>30)])
          name  age   other cityb  duration  YoM  dynasty cityd
fourth     Ali   57   walid   MAD        34   95   umayad   MAD
sixth    Jafar   65  mansur   MAD        32  148  abbasid   MAD
seventh   Musa   55   harun   MAD        35  183  abbasid   BGD

We can not do above conditioning with strings. What if we want all rows where other is either muavia or yazid. In such a case we can provide all the values as a list inside the isin method.

print(df.loc[df['other'].isin(['muavia', 'yazid'])])
          name  age   other cityb  duration  YoM dynasty cityd
first      Ali   63  muavia   MKH        29   40    None   NJF
second   Hasan   47  muavia   MAD        10   50  umayad   MAD
third   Husain   57   yazid   MAD        11   61  umayad   KBL

We can even combine boolean indexing/condition with label based indexing.

print(df.loc[df['age'] > 30, ['other', 'YoM']])
          other  YoM
first    muavia   40
second   muavia   50
third     yazid   61
fourth    walid   95
fifth    hisham  114
sixth    mansur  148
seventh   harun  183
eigth     mamun  203
tenth     mutaz  254

Question: Write the names of the people who were in ‘umayad’ dynasty using loc?

iloc#

  • integer location based

  • work on position

iloc is used to select rows and columns from dataframe by their location/index/position value. If we don’t know the actual names of columns and just want the columns by their locations/position, we can use iloc. For example if we want the last row from dataframe, we can do as below

print(df.iloc[-1])
name           Naqi
age              40
other         mutaz
cityb           MAD
duration         34
YoM             254
dynasty     abbasid
cityd           SAM
Name: tenth, dtype: object

If we want the last column, we can do as below

print(df.iloc[:, -1])
first      NJF
second     MAD
third      KBL
fourth     MAD
fifth      MAD
sixth      MAD
seventh    BGD
eigth      MAS
tenth      SAM
Name: cityd, dtype: object

The : above tells that we want all rows.

If we want 5th row, we can dow as below

print(df.iloc[4])
name        Muhammad
age               57
other         hisham
cityb            MAD
duration          19
YoM              114
dynasty       umayad
cityd            MAD
Name: fifth, dtype: object

If we want to select multiple rows, we need to pass a list.

Select thrid and second last row

print(df.iloc[[2, -2]])
         name  age  other cityb  duration  YoM  dynasty cityd
third  Husain   57  yazid   MAD        11   61   umayad   KBL
eigth    Raza   55  mamun   MAD        20  203  abbasid   MAS

Selecting multiple rows with .iloc with slice notation

Select every second row starting from first till 8th

print(df.iloc[:7:2])
             name  age   other cityb  duration  YoM  dynasty cityd
first         Ali   63  muavia   MKH        29   40     None   NJF
third      Husain   57   yazid   MAD        11   61   umayad   KBL
fifth    Muhammad   57  hisham   MAD        19  114   umayad   MAD
seventh      Musa   55   harun   MAD        35  183  abbasid   BGD

Simultaneous selection of rows and columns

Select second and fifth row and third column

print(df.iloc[[1,4], 2])
second    muavia
fifth     hisham
Name: other, dtype: object

As we did with loc, we can also use a boolean array for selection to iloc.

Select 3rd and fifth column but where age is greater than 30

print(df.iloc[(df['age'] > 30).values, [2, 4]])
          other  duration
first    muavia        29
second   muavia        10
third     yazid        11
fourth    walid        34
fifth    hisham        19
sixth    mansur        32
seventh   harun        35
eigth     mamun        20
tenth     mutaz        34

Question: Select first and last rows and from first and last columns using iloc

at#

Selection with .at is nearly identical to .loc but it only selects a single ‘cell’ in your DataFrame. We usually refer to this cell as a scalar value. To use .at, pass it both a row and column label separated by a comma.

print(df.at['sixth', 'duration'])
32

iat#

Selection with iat is nearly identical to iloc but it only selects a single scalar value. You must pass it an integer for both the row and column locations

print(df.iat[2, 5])
61

Total running time of the script: ( 0 minutes 0.123 seconds)

Gallery generated by Sphinx-Gallery