Note

Click here to download the full example code or to run this example in your browser via Binder

5.4 reading/writing#

This file describes how to read data from files and write data into files using pandas.

Important

This lesson is still under development.

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/AtrCheema/AI4Water/master/ai4water/datasets/arg_busan.csv")

type(df)

df

	index	tide_cm	wat_temp_c	sal_psu	air_temp_c	pcp_mm	pcp3_mm	pcp6_mm	pcp12_mm	wind_dir_deg	wind_speed_mps	air_p_hpa	mslp_hpa	rel_hum	ecoli	16s	inti1	Total_args	tetx_coppml	sul1_coppml	blaTEM_coppml	aac_coppml	Total_otus	otu_5575	otu_273	otu_94
0	6/19/2018 0:00	36.407149	19.321232	33.956058	19.780000	0.0	0.0	0.0	0.0	159.533333	0.960000	1002.856667	1007.256667	95.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	6/19/2018 0:30	35.562515	19.320124	33.950508	19.093333	0.0	0.0	0.0	0.0	86.596667	0.163333	1002.300000	1006.700000	95.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	6/19/2018 1:00	34.808016	19.319666	33.942532	18.733333	0.0	0.0	0.0	0.0	2.260000	0.080000	1001.973333	1006.373333	95.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	6/19/2018 1:30	30.645216	19.320406	33.931263	18.760000	0.0	0.0	0.0	0.0	62.710000	0.193333	1001.776667	1006.120000	95.006667	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	6/19/2018 2:00	26.608980	19.326729	33.917961	18.633333	0.0	0.0	0.0	0.0	63.446667	0.510000	1001.743333	1006.103333	95.006667	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1441	9/7/2019 22:00	-3.989912	20.990612	33.776449	23.700000	0.0	0.0	0.0	0.5	203.760000	6.506667	1003.446667	1007.746667	88.170000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1442	9/7/2019 22:30	-2.807042	21.012014	33.702310	23.620000	0.0	0.0	0.0	0.0	205.353333	5.633333	1003.520000	1007.820000	88.256667	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1443	9/7/2019 23:00	-3.471326	20.831739	33.726177	23.666667	0.0	0.0	0.0	0.0	202.540000	4.480000	1003.610000	1007.910000	87.833333	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1444	9/7/2019 23:30	0.707771	21.006086	33.716274	23.633333	0.0	0.0	0.0	0.0	207.206667	4.946667	1003.633333	1007.933333	88.370000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1445	9/8/2019 0:00	1.011731	20.896149	33.729773	23.600000	0.0	0.0	0.0	0.0	210.200000	4.400000	1003.700000	1008.000000	87.700000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

1446 rows × 26 columns

df.to_csv("arg_busan.csv")

The index of df was 0,1,2,… By default, to_csv function writes the index to csv

print(df.index)

RangeIndex(start=0, stop=1446, step=1)

print(df.index.name)

None

we can avoid writing the index to csv file by setting index=False.

df.to_csv("arg_busan.csv", index=False)

we can also explicitly tell pandas what label for index to use when writing the index to csv file.

df.to_csv("arg_busan.csv", index_label="index")

if we want to save a dataframe to Excel file we can do it as following

df.to_excel("arg_busan.xlsx")  # we must have ``openpyxl`` package for that

we can define the sheet name and exlcude the index as following

df.to_excel("arg_busan.xlsx", index=False, sheet_name="data")

to read the excel file as dataframe we can make use of read_excel function

df = pd.read_excel("arg_busan.xlsx")
print(df)

               index    tide_cm  wat_temp_c  ...  otu_5575  otu_273  otu_94
   6/19/2018 0:00  36.407149   19.321232  ...       NaN      NaN     NaN
   6/19/2018 0:30  35.562515   19.320124  ...       NaN      NaN     NaN
   6/19/2018 1:00  34.808016   19.319666  ...       NaN      NaN     NaN
   6/19/2018 1:30  30.645216   19.320406  ...       NaN      NaN     NaN
   6/19/2018 2:00  26.608980   19.326729  ...       NaN      NaN     NaN
...              ...        ...         ...  ...       ...      ...     ...
9/7/2019 22:00  -3.989912   20.990612  ...       NaN      NaN     NaN
9/7/2019 22:30  -2.807042   21.012014  ...       NaN      NaN     NaN
9/7/2019 23:00  -3.471326   20.831739  ...       NaN      NaN     NaN
9/7/2019 23:30   0.707771   21.006086  ...       NaN      NaN     NaN
 9/8/2019 0:00   1.011731   20.896149  ...       NaN      NaN     NaN

[1446 rows x 26 columns]

we can tell which column should be used as index for the dataframe

df = pd.read_excel("arg_busan.xlsx", index_col="index")
print(df)

                  tide_cm  wat_temp_c    sal_psu  ...  otu_5575  otu_273  otu_94
index                                             ...
6/19/2018 0:00  36.407149   19.321232  33.956058  ...       NaN      NaN     NaN
6/19/2018 0:30  35.562515   19.320124  33.950508  ...       NaN      NaN     NaN
6/19/2018 1:00  34.808016   19.319666  33.942532  ...       NaN      NaN     NaN
6/19/2018 1:30  30.645216   19.320406  33.931263  ...       NaN      NaN     NaN
6/19/2018 2:00  26.608980   19.326729  33.917961  ...       NaN      NaN     NaN
...                   ...         ...        ...  ...       ...      ...     ...
9/7/2019 22:00  -3.989912   20.990612  33.776449  ...       NaN      NaN     NaN
9/7/2019 22:30  -2.807042   21.012014  33.702310  ...       NaN      NaN     NaN
9/7/2019 23:00  -3.471326   20.831739  33.726177  ...       NaN      NaN     NaN
9/7/2019 23:30   0.707771   21.006086  33.716274  ...       NaN      NaN     NaN
9/8/2019 0:00    1.011731   20.896149  33.729773  ...       NaN      NaN     NaN

[1446 rows x 25 columns]

print(type(df.index))

<class 'pandas.core.indexes.base.Index'>

Although we index of dataframe is date and time but pandas does not recognize it as data and time but it recognizes it just as numbers

df = pd.read_excel("arg_busan.xlsx", index_col="index", parse_dates=True)

print(df)

                       tide_cm  wat_temp_c  ...  otu_273  otu_94
index                                       ...
2018-06-19 00:00:00  36.407149   19.321232  ...      NaN     NaN
2018-06-19 00:30:00  35.562515   19.320124  ...      NaN     NaN
2018-06-19 01:00:00  34.808016   19.319666  ...      NaN     NaN
2018-06-19 01:30:00  30.645216   19.320406  ...      NaN     NaN
2018-06-19 02:00:00  26.608980   19.326729  ...      NaN     NaN
...                        ...         ...  ...      ...     ...
2019-09-07 22:00:00  -3.989912   20.990612  ...      NaN     NaN
2019-09-07 22:30:00  -2.807042   21.012014  ...      NaN     NaN
2019-09-07 23:00:00  -3.471326   20.831739  ...      NaN     NaN
2019-09-07 23:30:00   0.707771   21.006086  ...      NaN     NaN
2019-09-08 00:00:00   1.011731   20.896149  ...      NaN     NaN

[1446 rows x 25 columns]

Now the index of dataframe is read as DateTimeIndex

print(type(df.index))

#%%

<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

Total running time of the script: ( 0 minutes 3.608 seconds)

Gallery generated by Sphinx-Gallery