Introduction

Data processing is very important part of analyzing data. It is because the data available are not always in desire format. Before analyzing data processing such as cleaning, merging, etc. are required. NumPy, SciPy and Pandas are some of the tools available in python for data processing. Working with pandas are fast and easy compared to other tools. Pandas integrates with matplotlib library which makes very good tool for analyzing data.

 

Data structures

Pandas provide two useful data structures i.e Series and DataFrame.

Series

Series is one dimensional array that can store various data types. The row label are called index. Tuple, list and dictionary can be converted into series.

Converting tuple into series
import pandas as pd
tup = ("Ram", 123, "Kathmandu", 50000)
df = pd.Series(tup)

#type of df
print("Type of df is: ", type(df))

#printing the series
print(df

Output

Type of df is:  <class 'pandas.core.series.Series'>
0          Ram
1          123
2    Kathmandu
3        50000
dtype: object

Here, the index is set from 0 to 3. We can provide custom index to this df as

import pandas as pd
tup = ("Ram", 123, "Kathmandu", 50000)
df = pd.Series(tup, index = ["Name", "Id", "Address", "Salary"])

#printing the series
print(df)

Output

Name             Ram
Id               123
Address    Kathmandu
Salary         50000
dtype: object

 

Converting list into series
import pandas as pd
lst = ["Ram", 123, "Kathmandu", 50000]
df = pd.Series(lst)
print(df)

Output

0          Ram
1          123
2    Kathmandu
3        50000
dtype: object

Same as tuple,  conversion of list to series, the index is set from 0 to 3. We can also set custom index to df as

import pandas as pd
lst = ["Ram", 123, "Kathmandu", 50000]
df = pd.Series(lst, index = ["Name", "Id", "Address", "Salary"])
print(df)

Output

Name             Ram
Id               123
Address    Kathmandu
Salary         50000
dtype: object

 

Converting dictionary to series
import pandas as pd
dict = {"Name" : "Ram", "Id": 123,"Address":"Kathmandu", "Salary": 50000}
df = pd.Series(dict)
print(df)

Output

Name             Ram
Id               123
Address    Kathmandu
Salary         50000
dtype: object

Elements of series can be accessed using index name as

print(df[["Name", "Address"]])

Output

Name             Ram
Address    Kathmandu
dtype: object

 

DataFrame

DataFrame is most widely used data structures of pandas. Series is used to work with one dimensional array only but DataFrame works with two dimensional array. DataFrame has two indices i.e row index and column index. Text files and spreadsheet are read using DataFrame so it is mostly used data structure. Most common method of creating DataFrame is using dictionary of equal length list.

import pandas as pd
dict = {
			"Name" : ["Ram", "Shyam", "Gita"],
			"Id" : [1,2,3],
			"Address": ["KTM", "PKR", "BHK"],
			"Salary": [1000, 2000, 3000],
			}
			
df = pd.DataFrame(dict)
print(df)

Output

    Name  Id Address  Salary
0    Ram   1     KTM    1000
1  Shyam   2     PKR    2000
2   Gita   3     BHK    3000

We can see that the column index are set to Name, Id, Address and Salary. But the row index are set to 0 to 2 and we can give custom name to row index as

df.index = ["One", "Two", "Three"]
print(df)

Output

        Name  Id Address  Salary
One      Ram   1     KTM    1000
Two    Shyam   2     PKR    2000
Three   Gita   3     BHK    3000

It is possible to set any value of DataFrame to row index using set_index() function

df = df.set_index(["Name"])

Output

Name    Id Address  Salary
Ram     1     KTM    1000
Shyam   2     PKR    2000
Gita    3     BHK    3000

We can add value to previous dataframe as

import pandas as pd
dict = {
			"Name" : ["Ram", "Shyam", "Gita"],
			"Id" : [1,2,3],
			"Address": ["KTM", "PKR", "BHK"],
			"Salary": [1000, 2000, 3000],
			}
			
df = pd.DataFrame(dict)

print("Original dataframe: \n")
print(df)
print("\n")

#adding value to dataframe
df["Post"] = ["Manager", "CEO", "worker"]

print("after addition of value to dataframe: \n")
print(df)

Output

Original dataframe:

    Name  Id Address  Salary
0    Ram   1     KTM    1000
1  Shyam   2     PKR    2000
2   Gita   3     BHK    3000


after addition of value to dataframe:

    Name  Id Address  Salary     Post
0    Ram   1     KTM    1000  Manager
1  Shyam   2     PKR    2000      CEO
2   Gita   3     BHK    3000   worker

We can also add common value to DataFrame

#adding common value to dataframe
df["Company"] = "CodeFires"

print("after addition of common value to dataframe: \n")
print(df)

Output

Original dataframe:

    Name  Id Address  Salary
0    Ram   1     KTM    1000
1  Shyam   2     PKR    2000
2   Gita   3     BHK    3000


after addition of common value to dataframe:

    Name  Id Address  Salary     Post    Company
0    Ram   1     KTM    1000  Manager  CodeFires
1  Shyam   2     PKR    2000      CEO  CodeFires
2   Gita   3     BHK    3000   worker  CodeFires

Same as Series we can set any value of DataFrame to row index

print("Before adding row index: \n")
print(df)

#adding value "Name" to row index
df = df.set_index("Name")

print("After setting row index: \n")
print(df)

Output

Before adding row index:

    Name  Id Address  Salary     Post    Company
0    Ram   1     KTM    1000  Manager  CodeFires
1  Shyam   2     PKR    2000      CEO  CodeFires
2   Gita   3     BHK    3000   worker  CodeFires
After setting row index:

       Id Address  Salary     Post    Company
Name
Ram     1     KTM    1000  Manager  CodeFires
Shyam   2     PKR    2000      CEO  CodeFires
Gita    3     BHK    3000   worker  CodeFires

We can access data using row index and column index

#using column index
print("Data accessing using column index: \n")
print(df["Salary"])
print("\n")

#using row index
print("Data accessing using row index: \n")
print(df.loc["Gita", : ])

Output

Data accessing using column index:

Name
Ram      1000
Shyam    2000
Gita     3000
Name: Salary, dtype: int64


Data accessing using row index:

Id                 3
Address          BHK
Salary          3000
Post          worker
Company    CodeFires
Name: Gita, dtype: object

At first we use column index for accessing data of DataFrame. We've used "Salary" as column index and it printed out the salary of all persons. For accessing row index we used loc attribute of DataFrame. We use "Gita" as row index which printed out all the information about that specific person.

 

We can access specific data of DataFrame. Let's say we want to access salary of person Gita, so we can do this way

#accessing specific element
print(df.loc["Gita", "Salary"])

Output

3000

 

Deleting data of DataFrame
Using del
#deleting data of dataframe
del df["Company"]
print("New DataFrame after delete: \n")
print(df)

Output

New DataFrame after delete:

       Id Address  Salary     Post
Name
Ram     1     KTM    1000  Manager
Shyam   2     PKR    2000      CEO
Gita    3     BHK    3000   worker

 

Using drop()
#deleting data of dataframe
print("New DataFrame after delete: \n")
print(df.drop("Salary", axis = 1))

Output

New DataFrame after delete:

       Id Address     Post    Company
Name
Ram     1     KTM  Manager  CodeFires
Shyam   2     PKR      CEO  CodeFires
Gita    3     BHK   worker  CodeFires

In first case we used del to delete the Company column and in latter case we used drop function to delete Salary column. Use use axis = 1 to specify column index. If we want to drop row wise we can use index = 0.

 

Reading Files

We can read various spreadsheet using pandas. For demonstration  we will access csv files . Also make sure that the csv file is located in current working directory or pass the full path of that csv file 

import pandas as pd
df = pd.read_csv("order.csv")
print(df.head())

 Output

   id  groupid  bookid  quantity
0   1        1       6         1
1   2        1       1         4
2   3        2      10         1
3   4        2       7        20
4   5        2       8         1

head() function prints first five data from csv files. 

print(pd.tail())

Output

   id  groupid  bookid  quantity
3   4        2       7        20
4   5        2       8         1
5   6        3       1         4
6   7        3       6         5
7   8        3       7         2

tail() function prints out last five data from csv file.

Actually, we can pass the numbers in head() and tail() function to tell them what number of data we want to access. Let's print first three and last three data from csv file. Here's how

print("First 2 data: \n")
#to print first three data from file
print(df.head(3))
print("\n")

print("Last 2 data: \n")
#to print last three data from file
print(df.tail(3))

Output

First 2 data:

   id  groupid  bookid  quantity
0   1        1       6         1
1   2        1       1         4
2   3        2      10         1


Last 2 data:

   id  groupid  bookid  quantity
5   6        3       1         4
6   7        3       6         5
7   8        3       7         2

We can get all the name of columns in datasets

#to print all the columns in dataset
print(df.columns.values)

Output

['id' 'groupid' 'bookid' 'quantity']

These are the columns present in this csv file.

We also can print all the elements present in specific columns as

#to print all values in specific column
print(df["quantity"])

Output

0     1
1     4
2     1
3    20
4     1
5     4
6     5
7     2
Name: quantity, dtype: int64

 

Let's say we want to know how many books have same bookid. We can do this using value_counts() funcation

#to check no of books having same bookid
print(df["bookid"].value_counts())

Output

7     2
6     2
1     2
10    1
8     1
Name: bookid, dtype: int64

Observing this data, we can say that two books have bookid 1, 6, 7 

Sometimes data could be missing. We can check is there any value correspond to null or not using isnull() function

print(df["quantity"].isnull())

Output

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
Name: quantity, dtype: bool

Here there is no missing values in our dataset. If there is any missing values we can replace them with mean of current data.

 

Indexing or slicing Dataframe

For slicing Dataframe we use iloc and loc method. Usually, we pass number(index) of row and column to iloc method and for loc we use name of columns

Using iloc
Syntax
iloc[row_selection, column_selection]
print(df.iloc[:4, 0:3])

Output

id  groupid  bookid
0   1        1       6
1   2        1       1
2   3        2      10
3   4        2       7

Here we got the output having 4 rows and three columns.

Using loc
Syntax
loc[row_selection, ["name of columns"]]
 
print(df.loc[:, ["bookid"]])

Output

bookid
0       6
1       1
2      10
3       7
4       8
5       1
6       6
7       7
We also can access multiple columns as
print(df.loc[:, ["groupid","bookid", "quantity"]])

Output

       groupid  bookid  quantity
0        1       6         1
1        1       1         4
2        2      10         1
3        2       7        20
4        2       8         1
5        3       1         4
6        3       6         5
7        3       7         2
Filtering

Let's say we want to filter the original dataframe such a way that the quantity of book ordered are greater than 5. So here's how it can be done

df2 = df[df["quantity"]>4]
print(df2)

Output

    id  groupid  bookid  quantity
3   4        2       7        20
6   7        3       6         5

If we take a look at this new dataframe we see that the row index is unordered that is because while filtering those index were taken out to new dataframe along with other data. We can reset those index using reset_index() function

index  id  groupid  bookid  quantity
0      3   4        2       7        20
1      6   7        3       6         5

 

We've reset the index. Have we managed to reset index actually? Let's print newly formed dataframe

#to reset index 
print(df2.reset_index())
print("\n")
#lets print df2
print(df2)

Output

index  id  groupid  bookid  quantity
0      3   4        2       7        20
1      6   7        3       6         5


   id  groupid  bookid  quantity
3   4        2       7        20
6   7        3       6         5

While applying reset_index() function those row index got reset but it didn't really affected the dataframe. So what if we want to actually make change in dataframe

#to reset index 
df2.reset_index(inplace = True)

#lets print df2
print(df2)

Output

index  id  groupid  bookid  quantity
0      3   4        2       7        20
1      6   7        3       6         5

So by using inplace = True we can make change in original (modified) dataframe

 

Conclusion

Pandas is powerful python module for processing of data. It is used to read, filter, analyze the spreadsheets. Raw data that are available may or may not be suitable for retrieve the conclusion from it. So they need to be filtered, fill the null values, sorting, grouping and so on.

Happy Learning :-)