.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/pandas/speeding_up.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_pandas_speeding_up.py>`
        to download the full example code or to run this example in your browser via Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_pandas_speeding_up.py:


======================
5.8 efficient pandas
======================
This file shows the how to efficiently use pandas

.. important::
  This lesson is still under development.

.. GENERATED FROM PYTHON SOURCE LINES 11-19

.. code-block:: default

    import time
    from typing import Union

    import numpy as np
    import pandas as pd

    print(pd.__version__, np.__version__)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    1.5.3 1.26.4


.. GENERATED FROM PYTHON SOURCE LINES 20-21

Define a function which prints memory used by a dataframe

.. GENERATED FROM PYTHON SOURCE LINES 21-26

.. code-block:: default


    def memory_usage(dataframe):
        return round(dataframe.memory_usage().sum() / 1024**2, 4)


.. GENERATED FROM PYTHON SOURCE LINES 27-29

don't use csv for large data
-----------------------------

.. GENERATED FROM PYTHON SOURCE LINES 29-40

.. code-block:: default


    def dump_and_load(dataframe:pd.DataFrame):
        st = time.time()
        dataframe.to_csv("File.csv")
        pd.read_csv("File.csv")
        return round(time.time() - st, 3)

    df = pd.DataFrame(np.random.random((100, 10)))

    dump_and_load(df)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0.004


.. GENERATED FROM PYTHON SOURCE LINES 41-45

.. code-block:: default


    df = pd.DataFrame(np.random.random((1000, 20)))
    dump_and_load(df)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0.032


.. GENERATED FROM PYTHON SOURCE LINES 46-50

.. code-block:: default


    df = pd.DataFrame(np.random.random((10_000, 50)))
    dump_and_load(df)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0.779


.. GENERATED FROM PYTHON SOURCE LINES 51-55

.. code-block:: default


    df = pd.DataFrame(np.random.random((100_000, 50)))
    dump_and_load(df)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    7.517


.. GENERATED FROM PYTHON SOURCE LINES 56-68

.. code-block:: default


    def dump_and_load_parquet(dataframe:pd.DataFrame):

        dataframe.columns = dataframe.columns.map(str)  # parquet expects column names to be string

        st = time.time()
        dataframe.to_parquet("File.pq")
        pd.read_parquet("File.pq")
        return round(time.time() - st, 3)

    dump_and_load_parquet(df)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0.598


.. GENERATED FROM PYTHON SOURCE LINES 69-70

categorical type instead of string type

.. GENERATED FROM PYTHON SOURCE LINES 73-75

don't think in terms of rows, but in terms columns
---------------------------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 75-77

.. code-block:: default

    df = pd.DataFrame(np.random.random((5000, 4)), columns=['a', 'b', 'c', 'd'])
    print(df)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

                 a         b         c         d
    0     0.929256  0.725679  0.310663  0.964345
    1     0.721586  0.637415  0.332982  0.689528
    2     0.933878  0.268280  0.083572  0.828697
    3     0.015732  0.994194  0.238057  0.191573
    4     0.509337  0.371048  0.327277  0.294127
    ...        ...       ...       ...       ...
    4995  0.930997  0.454958  0.712097  0.878669
    4996  0.640371  0.206201  0.820313  0.351462
    4997  0.426621  0.693299  0.592340  0.791978
    4998  0.240445  0.387177  0.034030  0.380769
    4999  0.582327  0.699955  0.371898  0.463552

    [5000 rows x 4 columns]


.. GENERATED FROM PYTHON SOURCE LINES 78-81

Iterating over rows is a lot slower than iterating over columns.
This is mainly because pandas is built around column major format. 
This means consective values in columns are stored next to each other in memory.

.. GENERATED FROM PYTHON SOURCE LINES 81-88

.. code-block:: default


    start = time.time()
    for col in df.columns:
        for val in df[col]:
            pass
    print(time.time() - start)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0.002445697784423828


.. GENERATED FROM PYTHON SOURCE LINES 89-96

.. code-block:: default


    start = time.time()
    for row_idx in range(len(df)):
        for val in df.iloc[row_idx]:
            pass
    print(time.time() - start)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0.17723488807678223


.. GENERATED FROM PYTHON SOURCE LINES 97-104

.. code-block:: default

    start = time.time()
    for idx, i in enumerate(range(len(df))):
        row = df.iloc[idx]
        row.loc['a'] = row.loc['a'] + row.loc['b']
        df.iloc[idx] = row
    print(time.time() - start)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0.8116934299468994


.. GENERATED FROM PYTHON SOURCE LINES 105-106

.. code-block:: default

    print(df)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

                 a         b         c         d
    0     1.654934  0.725679  0.310663  0.964345
    1     1.359001  0.637415  0.332982  0.689528
    2     1.202158  0.268280  0.083572  0.828697
    3     1.009926  0.994194  0.238057  0.191573
    4     0.880384  0.371048  0.327277  0.294127
    ...        ...       ...       ...       ...
    4995  1.385956  0.454958  0.712097  0.878669
    4996  0.846572  0.206201  0.820313  0.351462
    4997  1.119921  0.693299  0.592340  0.791978
    4998  0.627622  0.387177  0.034030  0.380769
    4999  1.282282  0.699955  0.371898  0.463552

    [5000 rows x 4 columns]


.. GENERATED FROM PYTHON SOURCE LINES 107-112

.. code-block:: default


    start = time.time()
    df['a'] = df['a'] + df['b']
    print(time.time() - start)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0.0006124973297119141


.. GENERATED FROM PYTHON SOURCE LINES 113-114

Use vectorized operations instead of iterating or using ``apply`` method

.. GENERATED FROM PYTHON SOURCE LINES 116-117

Use chaining instead of creating new dataframes after every operation

.. GENERATED FROM PYTHON SOURCE LINES 119-122

reduce memory consuption
--------------------------
Let's create a dataframe with column which contains only integers

.. GENERATED FROM PYTHON SOURCE LINES 122-125

.. code-block:: default

    df = pd.DataFrame(np.random.randint(0, 256, 10000000))

    print(df.dtypes)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0    int64
    dtype: object


.. GENERATED FROM PYTHON SOURCE LINES 126-130

The default type fo the column is ``object`` which means pandas does not
know that the data in column is only integer.

The memory consumed by the dataframe currently is:

.. GENERATED FROM PYTHON SOURCE LINES 130-132

.. code-block:: default

    print(f"{memory_usage(df)} Mb")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    76.2941 Mb


.. GENERATED FROM PYTHON SOURCE LINES 133-135

However when we check the maximum and minimum value of integers in our dataframe
they range between 0 and 255.

.. GENERATED FROM PYTHON SOURCE LINES 135-138

.. code-block:: default


    print(df[0].min(), df[0].max())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0 255


.. GENERATED FROM PYTHON SOURCE LINES 139-144

This means we can store our data as int16. With ``object`` type, we are
assigning a lot of memory to our data, which is even not necessary.

We can verify that the maximum and minium value in the column is between the
lower and upper limit of of np.int16.

.. GENERATED FROM PYTHON SOURCE LINES 144-146

.. code-block:: default

    print(df[0].min() > np.iinfo(np.int16).min and df[0].max() < np.iinfo(np.int16).max)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    True


.. GENERATED FROM PYTHON SOURCE LINES 147-149

So now let's convert our the data type of our column into np.int16 and check the
memory consuption now.

.. GENERATED FROM PYTHON SOURCE LINES 149-154

.. code-block:: default


    df[0] = df[0].astype(np.int16)

    print(f"{memory_usage(df)} Mb")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    19.0736 Mb


.. GENERATED FROM PYTHON SOURCE LINES 155-157

we see the memory usage has been reduced significantly.
Now let's do same with floats.

.. GENERATED FROM PYTHON SOURCE LINES 157-172

.. code-block:: default


    df = pd.DataFrame(np.random.random(10000000))

    print(df.dtypes)

    print(f"Initial memory: {memory_usage(df)} Mb")

    print(f"min: {df[0].min()} max:  {df[0].max()}")

    print(df[0].min() > np.iinfo(np.int16).min and df[0].max() < np.iinfo(np.int16).max)

    df[0] = df[0].astype(np.float16)

    print(f"Final memory:  {memory_usage(df)} Mb")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0    float64
    dtype: object
    Initial memory: 76.2941 Mb
    min: 2.4135430432004057e-07 max:  0.9999999696427855
    True
    Final memory:  19.0736 Mb


.. GENERATED FROM PYTHON SOURCE LINES 173-177

We can write helper functions to convert the column types in dataframe.
Below, we write functions, which check the data in each column of a dataframe,
and assign the the dtype (read as assign the memory) which is just enough for the data in column.
It means we assign the memory enough for the column but not more than what is required.

.. GENERATED FROM PYTHON SOURCE LINES 177-245

.. code-block:: default


    def int8(array:Union[np.ndarray, pd.Series])->bool:
        return array.min() > np.iinfo(np.int8).min and array.max() < np.iinfo(np.int8).max

    def int16(array:Union[np.ndarray, pd.Series])->bool:
        return array.min() > np.iinfo(np.int16).min and array.max() < np.iinfo(np.int16).max

    def int32(array:Union[np.ndarray, pd.Series])->bool:
        return array.min() > np.iinfo(np.int32).min and array.max() < np.iinfo(np.int32).max

    def int64(array:Union[np.ndarray, pd.Series])->bool:
        return array.min() > np.iinfo(np.int64).min and array.max() < np.iinfo(np.int64).max

    def float16(array:Union[np.ndarray, pd.Series])->bool:
        return array.min() > np.finfo(np.float16).min and array.max() < np.finfo(np.float16).max

    def float32(array:Union[np.ndarray, pd.Series])->bool:
        return array.min() > np.finfo(np.float32).min and array.max() < np.finfo(np.float32).max


    def maybe_convert_int(series:pd.Series)->pd.Series:
        if int8(series):
            return series.astype(np.int8)
        if int16(series):
            return series.astype(np.int16)
        if int32(series):
            return series.astype(np.int32)
        if int64(series):
            return series.astype(np.int64)
        return series


    def maybe_convert_float(series:pd.Series)->pd.Series:

        if float16(series):
            return series.astype(np.float16)
        if float32(series):
            return series.astype(np.float32)

        return series


    def maybe_reduce_memory(dataframe:pd.DataFrame, hints=None)->pd.DataFrame:

        init_memory = memory_usage(dataframe)

        if hints:
            assert len(hints) == len(dataframe.columns)
        else:
            hints = {col:'' for col in dataframe.columns}

        for col in dataframe.columns:
            col_dtype = str(dataframe[col].dtypes)

            if col_dtype in  ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']:

                if 'int' in  hints[col]:
                    dataframe[col] = maybe_convert_int(dataframe[col])
                elif 'float' in hints[col]:
                    dataframe[col] = maybe_convert_float(dataframe[col])
                elif 'int' in col_dtype:
                    dataframe[col] = maybe_convert_int(dataframe[col])
                elif 'float' in col_dtype or 'float' in  hints[col]:
                    dataframe[col] = maybe_convert_float(dataframe[col])

        print(f"memory reduced from {init_memory} to {memory_usage(dataframe)}")
        return dataframe


.. GENERATED FROM PYTHON SOURCE LINES 246-247

Now we can test our function that how much memory it reduces.

.. GENERATED FROM PYTHON SOURCE LINES 247-255

.. code-block:: default


    df = pd.DataFrame(np.column_stack([
        np.random.randint(-126, 126, 100_000),
        np.random.randint(-31000, 32760, 100_000),
        np.random.randint(0, 2147483640, 100_000),
    ]))

    print(df.shape)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    (100000, 3)


.. GENERATED FROM PYTHON SOURCE LINES 256-257

Print the original dtypes

.. GENERATED FROM PYTHON SOURCE LINES 257-259

.. code-block:: default


    print(df.dtypes)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0    int64
    1    int64
    2    int64
    dtype: object


.. GENERATED FROM PYTHON SOURCE LINES 260-263

.. code-block:: default


    maybe_reduce_memory(df)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    memory reduced from 2.2889 to 0.6677


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>0</th>
          <th>1</th>
          <th>2</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>20</td>
          <td>16499</td>
          <td>2144803705</td>
        </tr>
        <tr>
          <th>1</th>
          <td>-112</td>
          <td>22360</td>
          <td>445275584</td>
        </tr>
        <tr>
          <th>2</th>
          <td>-116</td>
          <td>25044</td>
          <td>1744279481</td>
        </tr>
        <tr>
          <th>3</th>
          <td>-5</td>
          <td>14708</td>
          <td>2023913446</td>
        </tr>
        <tr>
          <th>4</th>
          <td>-90</td>
          <td>15687</td>
          <td>1051366168</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th>99995</th>
          <td>-96</td>
          <td>-20089</td>
          <td>1040281795</td>
        </tr>
        <tr>
          <th>99996</th>
          <td>80</td>
          <td>5655</td>
          <td>2092774776</td>
        </tr>
        <tr>
          <th>99997</th>
          <td>-48</td>
          <td>-22207</td>
          <td>485169336</td>
        </tr>
        <tr>
          <th>99998</th>
          <td>49</td>
          <td>6304</td>
          <td>1617105304</td>
        </tr>
        <tr>
          <th>99999</th>
          <td>39</td>
          <td>-10877</td>
          <td>1464378367</td>
        </tr>
      </tbody>
    </table>
    <p>100000 rows × 3 columns</p>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 264-265

print the converted dtypes

.. GENERATED FROM PYTHON SOURCE LINES 265-268

.. code-block:: default


    print(df.dtypes)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0     int8
    1    int16
    2    int32
    dtype: object


.. GENERATED FROM PYTHON SOURCE LINES 269-270

Test with dataframe containing floats

.. GENERATED FROM PYTHON SOURCE LINES 270-280

.. code-block:: default


    df = pd.DataFrame(np.column_stack([
        np.random.randint(-126, 65000, 100_000) * 1.0,
        np.random.randint(-31000, 100_000, 100_000)*1.0,
    ]))

    print(df.dtypes)
    maybe_reduce_memory(df)
    print(df.dtypes)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0    float64
    1    float64
    dtype: object
    memory reduced from 1.526 to 0.5723
    0    float16
    1    float32
    dtype: object


.. GENERATED FROM PYTHON SOURCE LINES 281-294

.. code-block:: default


    df = pd.DataFrame(np.column_stack([
        np.random.randint(-126, 126, 100_000),
        np.random.randint(-31000, 32760, 100_000),
        np.random.randint(0, 2147483640, 100_000),
        np.random.randint(-126, 65000, 100_000) * 1.0,
        np.random.randint(-31000, 100_000, 100_000)*1.0,
    ]))

    print(df.dtypes)
    maybe_reduce_memory(df)
    print(df.dtypes)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0    float64
    1    float64
    2    float64
    3    float64
    4    float64
    dtype: object
    memory reduced from 3.8148 to 1.3353
    0    float16
    1    float16
    2    float32
    3    float16
    4    float32
    dtype: object


.. GENERATED FROM PYTHON SOURCE LINES 295-309

.. code-block:: default


    df = pd.DataFrame(np.column_stack([
        np.random.randint(-126, 126, 100_000),
        np.random.randint(-31000, 32760, 100_000),
        np.random.randint(0, 2147483640, 100_000),
        np.random.randint(-126, 65000, 100_000) * 1.0,
        np.random.randint(-31000, 100_000, 100_000)*1.0,
    ]))

    print(df.dtypes)
    maybe_reduce_memory(df, hints={0: "int", 1: "int", 2: "int",
                                   3: "float", 4: "float"})
    print(df.dtypes)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0    float64
    1    float64
    2    float64
    3    float64
    4    float64
    dtype: object
    memory reduced from 3.8148 to 1.2399
    0       int8
    1      int16
    2      int32
    3    float16
    4    float32
    dtype: object


.. GENERATED FROM PYTHON SOURCE LINES 310-312

For smaller dataframes, teh differene may not seem much but
when we try to scale things up, the difference is very significant

.. GENERATED FROM PYTHON SOURCE LINES 312-326

.. code-block:: default


    df = pd.DataFrame(np.column_stack([
        np.random.randint(-126, 126, 1000_000),
        np.random.randint(-31000, 32760, 1000_000),
        np.random.randint(0, 2147483640, 1000_000),
        np.random.randint(-126, 65000, 1000_000) * 1.0,
        np.random.randint(-31000, 100_000, 1000_000)*1.0,
    ]))

    print(df.dtypes)
    maybe_reduce_memory(df, hints={0: "int", 1: "int", 2: "int",
                                   3: "float", 4: "float"})
    print(df.dtypes)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0    float64
    1    float64
    2    float64
    3    float64
    4    float64
    dtype: object
    memory reduced from 38.1471 to 12.3979
    0       int8
    1      int16
    2      int32
    3    float16
    4    float32
    dtype: object


.. GENERATED FROM PYTHON SOURCE LINES 327-328

References


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  10.524 seconds)


.. _sphx_glr_download_auto_examples_pandas_speeding_up.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/AtrCheema/python-seekho/master?urlpath=lab/tree/notebooks/auto_examples/pandas/speeding_up.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: speeding_up.py <speeding_up.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: speeding_up.ipynb <speeding_up.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_