This is the second chapter of the series, "Complete Pandas Tutorial from Start to End". If you haven't seen the introductory post, I will encourage you to please do some hands-on following Chapter 1 Link.
Contents of Pandas Tutorial Chapter 2:
- Dataframe Operations for Getting a high-level understanding of Data.
- Different ways to select particular Columns, Rows and Filtering the data.
1. Dataframe Basic Methods.
As you know, In the previous chapter we learned about data frame, series data structures, and importing the dataset. Now, after importing your data, understanding the high-level data is most important. Whether you are a Kaggle competition winner or working in the top-notch MNC, high-level data understanding is done by every data scientist. Pandas have collections of functions in the data frame which provide a high-level overview of data.
Let's go back to our previous example. Importing an iris dataset in the data frame.
For retrieving statistical information about a data frame, we have describe() function.
You can clearly see the high-level statistical overview of data ie. count, mean, min, max, and std of sepal length, sepal width, petal length, and petal width.
Apart from this, Some other basic functions of pandas are:
The shape method will return the tuple having dimensionality of the data frame.
The ndim method will return the dimensions of the underlying data
The size method will return the number of elements of underlying data
The dtypes method will return the data type of object
The values method will return the Series as ndarray
2. Different ways to select particular Columns and Rows.
In order to select a particular column in data frame, use any of the following syntax.
#1. Create a list of columns column_list = ["First","Second"] #2. Select columns in dataframe passing the column list to df df[column_list] #3. Another way using loc df.loc[:,column_list] #4 Another way using iloc (This way will accept index of columns) df.iloc[:,2:5]
Note: iloc has [:,:] i.e [START_ROW_INDEX:END_ROW_INDEX+1 and after, START_COLUMN_INDEX:END_COLUMN_INDEX+1] whereas loc has [:, List] we can select rows specifying at first and after the comma, we can pass the selected column list.
In order to select a particular row in data frame, use any of the following syntax.
. using numerical indexes - iloc
# 2. using labels as index - loc (The below example will happen if the default index is used )
row_index_to_select = [0, 1, 2, 3] df.loc[row_index_to_select]
loc is used when we used labels as an index, we can directly search index searching become really fast.
Please follow the below example for better clarity.
For loc, let's make sepal length as an index, and then search for the rows which as sepal length as 5.0
In the real world examples, a case can arise if you have to filter out a record which is not in the index. The above methodology will help you to achieve that.
Just take an example we need to select records in the iris dataset whose petal length is greater than 5 and sepal length is greater than 6.
NOTE: Please use the brackets otherwise you will encounter error in expression.
Congrats! You have finished the second chapter. Now, you can read the final chapter.
If you have any queries or suggestions. Please leave a comment below. Will help you as soon as possible.