×

Pandas DataFrame

The DataFrame is a data structure provided by the Pandas Library. The DataFrame is a 2-dimensional labeled array and can be compared to tabular data having rows and columns. The data inside a DataFrame can be heterogeneous (different data types).

Each column of the DataFrame normally represents an attribute and when individually treated, is known as a Series. Each column can have a distinct data type and can be accessed by a label. Each row value can be accessed by either an integer index or a label.

The size of a DataFrame is not fixed and values of data in DataFrame can be changed(mutable). The DataFrame() function of Pandas is used to create DataFrame objects.

Table of Contents

The DataFrame Function

The DataFrame() function of Pandas library creates a 2-dimensional, size mutable (changeable), heterogeneous tabular data. The DataFrame() function returns a DataFrame object which contains axis labels(labels for both rows and columns).

Syntax:

DataFrame(data, index, columns)

The data parameter is the input data of DataFrame. The data can be a dictionary, NumPy array, or a DataFrame. If data is a dictionary then, the dictionary can contain Series, arrays, constants, data class, or list-like objects. The index and columns parameters are the index labels and column labels for the resulting DataFrame respectively.

If no index and columns are specified then the default will be integers in the range 0 to n-1. If data already contains column labels and column labels are explicitly given as well then, this will lead to column selection instead( i.e. the column labels which exist in data and are also mentioned in columns parameter will be shown).

DataFrame Creation

The DataFrame can be created in several ways as follows:

DataFrame object from list of lists

Creating a DataFrame object from the list of lists having strings, integers, and float values.

import pandas as pd

listoflist = [['tom', 10, 5], ['sam', 12, 5.3], ['david', 15, 5.8]]
list_dataframe = pd.DataFrame(listoflist, index=['person 1', 'person 2', 'person 3'],
                              columns = ['Name', 'Age', 'Height'])
print(list_dataframe)

#Output:
           Name  Age  Height
person 1    tom   10     5.0
person 2    sam   12     5.3
person 3  david   15     5.8

DataFrame object from dictionary of lists

The keys of the dictionary will be taken as column labels for the DataFrame.

 dictoflists = {'column 1': ['abc', 'def'], 'column 2': ['123', '456'],
                'column 3': [23, 32], 'column 4': [5.6, 8.9]}
dictlist_dataframe = pd.DataFrame(dictoflists, columns = ['column 1', 'column 3'])
print(dictlist_dataframe)

#Output:
  column 1  column 3
0      abc        23
1      def        32

The Output shows that ‘column 1’ and ‘column 3’ are selected as they are in the dictionary keys and mentioned in the columns parameter of the DataFrame() function, other columns are not selected in the DataFrame object.

DataFrame object from list of dictionaries

 dictoflists = [{'col a': ['abc', 12], 'col b': ['bcd', 34]},
                {'col a': ['cde', 24], 'col b': ['def', 68], 
                 'col c': [43, 15], 'col d': [3.6, 0.9]}]
dictlist_dataframe = pd.DataFrame(dictoflists)
print(dictlist_dataframe)

#Output:
       col a      col b     col c       col d
0  [abc, 12]  [bcd, 34]       NaN         NaN
1  [cde, 24]  [def, 68]  [43, 15]  [3.6, 0.9]

The Output shows that When there are missing row values, the DataFrame() function will treat them as NaN(Not a number) values.

DataFrame object from zip

The zip() function of the built-in Python library creates a tuple of two input lists element-wise, which can be stored in a list.

day = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']
date = ['14-06-2021', '22-06-2021', '30-06-2021',
        '08-07-2021', '16-07-2021', '24-07-2021', '01-08-2021']
listoftuple = list(zip(day, date))
zip_dataframe = pd.DataFrame(listoftuple, columns = ['Day', 'Date'])
print(zip_dataframe)

#Output:
   Day        Date
0  mon  14-06-2021
1  tue  22-06-2021
2  wed  30-06-2021
3  thu  08-07-2021
4  fri  16-07-2021
5  sat  24-07-2021
6  sun  01-08-2021

DataFrame object from numpy array

For this, we need to import the NumPy(Numerical Python) library, and from the array() function of NumPy, we will create a NumPy array.

import numpy as np
two_d_arr = np.array([[1, 2, 3, 4], [3, 6, 9, 12], [2, 4, 6, 8]])
numpyarray_dataframe = pd.DataFrame(two_d_arr)
print(numpyarray_dataframe)

#Output:
   0  1  2   3
0  1  2  3   4
1  3  6  9  12
2  2  4  6   8

DataFrame object from Dictionary of Series

A Series can be interpreted as a single column of DataFrame or a 1-dimensional labeled array. The Series object can be created using the Series() function of Pandas library.

dictofseries = {'Series 1' : pd.Series([0.1, 0.2, 0.3], index=['a', 'b', 'c']),
     'Series 2' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
series_dataframe = pd.DataFrame(dictofseries)
print(series_dataframe)

#Output:
   Series 1  Series 2
a       0.1       1.0
b       0.2       2.0
c       0.3       3.0
d       NaN       4.0

DataFrame object from datetime

The date_range() function creates a list of date-time values for a specified start date and period.

# creating an array of 5 date-time values at '2019-02-24'
date_ranges = pd.date_range('2019-02-24', periods=5, freq='T')
dict_datetime = { 'Date': date_ranges, 'Val' : range(1, 2 * len(rng),2)}
datetime_dataframe = pd.DataFrame(dict_datetime)
print(datetime_dataframe)

#Output:
                 Date  Val
0 2019-02-24 00:00:00    1
1 2019-02-24 00:01:00    3
2 2019-02-24 00:02:00    5
3 2019-02-24 00:03:00    7
4 2019-02-24 00:04:00    9

DataFrame object from MultiIndex

Creating a DataFrame from the MultiIndex function of Pandas. The MultiIndex.fromproduct() function creates a list of index values from the cartesian product of the two input lists.

listoflist = [['a', 'b', 'c', 'd'],['one','two']]
data = [[1, 2], [3, 4], [5, 6], [7, 8],
        [0, 0.5], [1.5, 2.5], [3.5, 4.5], [5.5, 6.5]]
idx = pd.MultiIndex.from_product(listoflist)
multiindex_dataframe= pd.DataFrame(data, index = idx, columns = ['col 1', 'col2'])
print(multiindex_dataframe)

#Output:
       col 1  col2
a one    1.0   2.0
  two    3.0   4.0
b one    5.0   6.0
  two    7.0   8.0
c one    0.0   0.5
  two    1.5   2.5
d one    3.5   4.5
  two    5.5   6.5

DataFrame object from a text file and a csv file

For this, we need to read data from the text and CSV files which can be done through the Pandas read_csv() function. The file path with file name is given as an argument to read_csv() function to read the file.

txt_dataframe = pd.read_csv("C:/Users/risha/Documents/textdf_data.txt",)
print(txt_dataframe)
#Output:
   a b c d e f g h
0  1 2 1 4 3 6 5 9
1  0 1 0 3 2 7 6 5
2  8 2 3 4 5 6 7 2

csv_dataframe = pd.read_csv("C:/Users/risha/Documents/csv_data.csv")
print(csv_dataframe.head())
#Output:
  Series reference  ... Revised
0   PPIQ.SQU900000  ...    1184
1   PPIQ.SQU900001  ...    1181
[2 rows x 5 columns]

DataFrame object from excel file

For this, we need to read data from an excel file through the Pandas read_excel() function. The file path with file name is given as an argument to read_excel() function to read the file.

excel_dataframe = pd.read_excel("C:/Users/risha/Documents/exceldf_data.xlsx",)
print(excel_dataframe.head())

#Output:
  Buying  Maint Doors Persons Lug_boot Safety Overall
0  vhigh  vhigh     2       2    small    low   unacc
1  vhigh  vhigh     2       2    small    med   unacc

Summary of a DataFrame

The info() function gets the summary of the DataFrame that invokes the function. It is useful while thoroughly analyzing a given data.

DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, 
show_counts=None, null_counts=None)

The info() method displays information about a DataFrame including the index dtype and columns, non-null values, and memory usage. The method takes several parameters as input:

  • verbose: bool, optional: The verbose parameter is used to declare whether to print a full summary or not.
  • buf: writable buffer, defaults to sys.stdout: The buf parameter is used to decide where to send the output. By default, the output is printed to sys.stdout
  • max_cols: int, optional: The max_cols parameter is used to define the maximum number of columns to be printed. If the DataFrame has more columns than the max_cols, the rest of the data is truncated.
  • memory_usage: bool, str, optional: The memory_usage parameter specifies whether the total memory usage of the DataFrame columns is to be displayed or not.
    • True : Always shows the memory usage.
    • False : Never shows the memory usage.
    • Deep : Is equivalent to “True” with deep introspection, and the memory usage is shown in human-readable units ( base-2 representation).
  • show_counts: bool, optional: The shows_counts parameter defines whether to display non-null counts.
    • True : Always shows counts.
    • False : Never shows the counts.

Example:

Making a sample DataFrame and printing it.

import pandas as pd

# Making a sample DataFrame
int_values = [1, 2, 3, 4, 5]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"integer_col": int_values, "text_col": text_values,
                  "float_col": float_values})

# Printing the DataFrame
df

Output:

1 26
# using the info method
df.info()

Output:

1 28
# using the info method with custom parameters
df.info(verbose=True, max_cols=2, memory_usage='deep')
1 29

Removing duplicate index in a DataFrame

We sometimes might get a duplicated index that looks identical to the existing index in the DataFrame. It is usually named “Unnamed:0“. Duplicate index does not add any value to the data set.

Methods to remove duplicate index

There are various methods to remove a duplicate index from the DataFrame:

The drop() method

The drop() command removes rows or columns by specifying names and corresponding axis, or by directly specifying the index or column names.

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

A Sample DataFrame

1 33

Example:

Removing the duplicate column from the DataFrame using the drop() method.

# Dropping the column "Unnamed:0", axis=1 refers to the vertial column.
DataFrame = DataFrame.drop('Unnamed:0', axis=1)
1 32

The set_index() method

In this method, instead of removing the duplicate column, we can make it the index of the DataFrame.

DataFrame.set_index(keys, drop=True, append=False, 
inplace=False, verify_integrity=False)
1 33

Example:

Removing the duplicate index from the DataFrame using set_index() method

# Setting "Unnamed:0" as the index of the DataFrame
DataFrame = DataFrame.set_index('Unnamed:0')
1 35

The read_csv() method

Instead of removing duplicate index later, we try to define our index while importing the file, using the parameter “index_col” in the read_csv() method.

Example:

Removing the duplicate column from the DataFrame using the read_csv() method.

# index_col=[0] makes the first column as the index of the DataFrame
DataFrame = pd.read_csv(dataurl, index_col=[0])
1 32

Sample items from DataFrame

The sample() method returns a sample of items from a DataFrame from an axis of the object.

DataFrame.sample(n=None, frac=None, replace=False, weights=None, 
random_state=None, axis=None, ignore_index=False)
  • n: int, optional: The parameter n takes the value of the number of items to be returned from the axis. This parameter cannot be used with the other parameter frac. Its default value equals 1 if the value of frac=None.
  • frac: float, optional: The frac parameter prints out a random fraction of the dataset, unlike the n parameter. The frac parameter cannot be used with n.
  • replace: bool, default False: The replace parameter either allows or disallows sampling of the same row more than once. Sampling with replacement makes it possible to pick the exact same method multiple times.
  • weights: str or ndarray-like, optional: The weights parameter assigns a weight or a priority to every sample. The greater the priority, the higher would be its chances to be sampled. The value of this parameter defaults to None, and the probability of choosing any set from the DataFrame would be equal.
  • random_state: int, array-like, BitGenerator, np.random.RandomState, optional: The random_state parameter accepts either an integer, an array-like, or random number generator itself, and it will use this as a seed for a pseudo-random number generator.
  • axis: 0 or ‘index’, 1 or ‘columns’, default None: The axis parameter takes in which axis to sample, it accepts the axis number or name as inputs.
    • 0 for Series or DataFrames
    • 1 for the Columns
  • ignore_index: bool, default False: The ignore_index parameter when set to true, will make the resulting index labeled as 0, 1, ......, n-1.

Rows and Columns of a DataFrame

DataFrames have two dimensions, visually they are categorized as:

  • 0 refers to Index label or Rows
  • 1 refers to Columns

To select a particular observation in the DataFrame, we need to specify two attributes, the row label, and the corresponding column label. These two dimensions are known as the DataFrame Axes.

We have a predefined attribute in our DataFrame which can be accessed as:

DataFrame.axes

Printing all the axes

Let us take a sample DataFrame and print its axes attribute.

1 36
import pandas as pd

# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
                  "float_col": float_values})

# Printing the axes attribute
df.axes
1 37

Printing a single axis

We can also access the row axes and column axes separately using the following command, having index values in square braces.

# Printing the first row axis
df.axes[0]
1 38

Printing one element from an axis

Now, to print a particular element from a DataFrame, we need to specify the index values of both row and column axes simultaneously.

# Printing the second element of the first row axis
df.axes[0][1]

# Output
1

Setting index of a DataFrame

We can manually set the index of a DataFrame using the existing columns, with the help of the set_index method. Using this function, we can set the DataFrame index using one or more existing columns or arrays (of the correct length).

The new index can either replace the existing index or expand on it.

Printing index of a DataFrame

Before understanding how to change the index of a DataFrame, we first discuss how to print the index of an existing DataFrame.

import pandas as pd

# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
                  "float_col": float_values})

# Printing the index of DataFrame df
df.index

Output:

1 40

Data type of a DataFrame index

The data type of an index in pandas is always represented as an Int64Index, which is quite similar to the range_index() object, which is a special case of the Int64Index called the RangeIndex, in the Data Series section.

# Printing the data-type of the index of a DataFrame
type(df.index)

Output:

1 41

Changing the index of a DataFrame

Using RangeIndex() method

The RangeIndex() function is used to implement an index with a monotonic integer range. The syntax of the method is given as follows:

RangeIndex(start=None, stop=None, step=None)

Example:

1 43
import pandas as pd

# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
                  "float_col": float_values})

# Change the index using RangeIndex() Method
df.index = pd.RangeIndex(start=5, stop=10, step=1)
1 44

Now, let us check the data type of our newly created index using the type() method.

# Printing the data-type of the index of a DataFrame
type(df.index)
1 45

Using the set_index() method

This method, instead of removing the duplicate column, can set the duplicate index as the index of the DataFrame.

DataFrame.set_index(keys, drop=True, append=False, 
inplace=False, verify_integrity=False)

Sample DataFrame:

1 33

Example:

Setting index using various parameters of the set_index() function.

# Setting "Unnamed:0" as the index of the DataFrame
df = df.set_index('Unnamed:0')
1 35
# Setting "text_col" as the index of the DataFrame
df = df.set_index('text_col', inplace=True)
Set Index Dataframe
# Setting "float_col" as the index of the DataFrame
df.set_index('float_col', drop=False)
Set Index Dataframe 1

Extracting from DataFrames by Label

We can extract data using either the ‘.‘ notation or the get() method, which works for DataFrames too. DataFrames have two dimensions, which give us a wide range of options to slice and extract data, other than the two mentioned above.

Extraction using the loc attribute

This method is a product typical way to extract by the label in Pandas. The loc attribute enables us to access a group of rows and columns by labels or a boolean array.

Considering a sample DataFrame:

import pandas as pd

# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
                  "float_col": float_values})
1 36

We use the loc attribute to access the row labeled by ‘D‘.

# Extracting from DataFrames by label 'D'
df.loc['D']
Set Index Dataframe 3

Now, let us check the data type of the slice extracted by the loc attribute.

# Determining the data type of the slice extracted by the loc attribute
type(df.loc['D'])

# Output
pandas.core.series.Series

This series represents all the column values for the row labeled by ‘D‘.

If we want to extract a particular element from the DataFrame, we do it using the square braces

Set Index Dataframe 2

We will extract the value for the float_col for the row labeled by ‘C‘.

# Extracting from DataFrames by label 'C' corresponding to 'float_col'
df.loc['C']['float_col']
# Can also be written as: df.loc['C','float_col']

# Output
1.0

Extracting from DataFrames by Position

Extraction using the iloc attribute

To extract data from DataFrames by position, we use the iloc function which provides purely integer-location-based indexing for selection by position.

Considering a sample DataFrame:

import pandas as pd

# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
                  "float_col": float_values})
Set Index Dataframe 2

Now we use the iloc attribute to extract one specific data value.

# Extracting from DataFrames by position '3'
df.iloc[3]
# Can also be written as: df.iloc[3, :]
Set Index Dataframe 4

For selecting all the elements of a row by position.

# Extracting from DataFrames by position '3'
df.iloc[[0,2], :]
Set Index Dataframe 5

For displaying selected rows and columns by position.

# Extracting from DataFrames by position
df.iloc[[0,2], 1:2]
Set Index Dataframe 6

Extraction using Boolean Masks

When we use boolean masks for the series, we had to provide a list/sequence of booleans that was exactly equal to the length of the series. Now that we’re working with DataFrames we have to provide two lists of booleans. One is exactly matching in length to the row axis, the other having the same length as the column axis.

Example:

Set Index Dataframe 8

Implementing Boolean Mask for DataFrame extraction.

# Implementing Boolean Masks for Extraction from a DataFrame
df.iloc[
    [True if i%2==0 else False for i in range(5)],
    [True if i%2==0 else False for i in range(3)]
]
Set Index Dataframe 7

To extract one specific data value from the DataFrame using Boolean Mask.

# Implementing Boolean Masks for Extraction from a DataFrame
df.iloc[3][1]

# Output
'3'

Single Value Access with at and iat

We can extract values from a DataFrame using loc and iloc attributes. We also have at and iat which are only used to access single values from a DataFrame.

The at and iat attributes are two highly specialized and efficient alternatives since they are said to be faster because of the lack of overhead and isolated case of use.

Extraction using at attribute

The at attribute is used to access a single value for a row/ column label pair. It is similar to the loc attribute since both provide label-based lookups. at is only used to extract a single value in a DataFrame or Series.

Example:

Considering a sample DataFrame.

import pandas as pd

# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
                  "float_col": float_values})
Set Index Dataframe 8

Extract a single data value from the DataFrame using the at attribute.

# Extracting the data value corresponding to '4' and 'text_col'
df.at[4, 'text_col']

# Output
'E'

Extraction using iat attribute

The iat attribute is used to access a single value for a row/ column pair by an integer position. It is similar to the iloc attribute since both provide integer-based position lookups. iat is only used to extract a single value in a DataFrame or Series.

Example:

Set Index Dataframe 8

Extract a single data value from the DataFrame using the iat attribute.

# Extracting the data value corresponding to '4' and 'text_col'
df.at[4, 'text_col']

# Output
'E'

Runtime Difference Between loc and at

To check the run time difference between loc and at attributes, we use %timeit, which is an ipython magic function, which can be used to measure the run time of a particular piece of code.

For loc attribute:

# Runtime of loc attribute using timeit magic function
%timeit df.loc[1,'text_col']

# Output
6.11 µs ± 25.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

For at attribute:

# Runtime of at attribute using timeit magic function
%timeit df.at[1,'text_col']

# Output
3.19 µs ± 28.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

As we see the runtime for the at attribute is much faster than the runtime for the loc attribute. Even though the differences are in micro-seconds, when used in larger datasets and more complex algorithms, this minute difference can add up to something substantial.

Finding Labels and Positions in a DataFrame

Whenever we work with position-based extraction approaches, we may find ourselves counting columns and rows to figure out the exact integer location of the data. This may especially be the case when we combine labels with positions.

In the previous methods, loc, iloc, at, and iat indexers, all these approaches expect consistent indices. This means, they either expect labels or integer positions.

Let us take an example where we need to extract the data corresponding to one index as a label and the other as an integer position. Majorly, what we are going to find out is how do we get two labels or two positions, when we have one of each.

Getting Label from Position

This is the first approach where we extract the label from the integer position given.

Example:

Set Index Dataframe 9

Let us consider we need to extract the data corresponding to the Label ‘text_col‘ and position ‘1‘.

  • Firstly, we are going to access the first index using the .index attribute and assign it to a new variable ‘index_label‘.
index_label = df.index[1]
index_label

# Output
1
  • Now, we will put together the existing column label and the extracted row label and use the at attribute to complete the extraction.
column_label = 'text_col'
df.at[index_label, column_label]

# Output
'B'

Getting Integer Location from Label

This is the second approach where we extract the integer position from the label given.

Example:

Considering the same example as above, with the label as ‘text_col‘ and the integer position ‘1‘.

Set Index Dataframe 9
  • First we isolate the columns by using the Dataframe.columns attribute. To find out the location, we use the get_loc() function and pass the label as a parameter.
column_loc = df.columns.get_loc('text_col')
column_loc

# Output
1
  • Now, we will put together the existing index location and the extracted column label and use the iat attribute to complete the extraction.
index_loc = 1
df.iat[index_loc, column_loc]

# Output
'B'

Typecasting data types in a DataFrame

The astype() function is a very useful, built-in method of the Pandas library that casts the given DataFrame or series from one data type to another.

Creating a new DataFrame:

import pandas as pd

# Creating a new sample DataFrame
names = ['Cornstarch', 'Nuts', 'Eggplant', 'Cheese', 'Goji Berries', 'Crackers', 'Peppers', 'Grapes']
serving_size = [100, 100, 100, 100, 100, 100, 100, 100]
calories = [381, 691, 25, 300, 349, 407, 29, 57]
total_fat = ['0.1g', '72g', '0.2g', '24g', '0.4g', '7.2g', '0.4g', '0.5g']

df = pd.DataFrame({"names": names,"serving_size": serving_size, 
'calories': calories, 'total_fat': total_fat})

# Printing the DataFrame
Set Index Dataframe 10

Now, let’s try and convert our calories column to float using the astype() method.

# Convert the calories column to float using astype() method
df.calories.astype(float)

# Output
0    381.0
1    691.0
2     25.0
3    300.0
4    349.0
5    407.0
6     29.0
7     57.0
Name: calories, dtype: float64

To typecast multiple columns simultaneously, we can do using the following syntax:

# Convert the serving_size and calories columns to float using astype() method
df.astype({'serving_size': float, 'calories': float})
Set Index Dataframe 11

Now the question arises of how to convert the total_fat column into float. We cannot convert the total_fat column into a float directly as it may seem to look like a floating point object but it is a string object. Hence, we first need to separate the units from the values and then apply the astype() function.

To separate the units from the total_fat column, we use the replace() function to replace ‘g’ by ” and then typecast it to an integer.

# Replacing 'g' by '' in the total_fat column
df.replace(to_replace= 'g', value= '', regex=True)
Set Index Dataframe 13

Renaming labels in a DataFrame

The rename() method will help us with DataFrame Transformation. Using this method, we will be able to change or rename our labels for both rows and columns.

Considering a sample DataFrame.

import pandas as pd

# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
                  "float_col": float_values})
Set Index Dataframe 8

Now, to rename the rows and column labels, we can use the rename function as:

# Renaming the rows and column
df.rename(columns={"text_col": "string"}, 
index={0: "Apple", 1: "Banana", 2:"Orange", 3: "Mango", 4: "Cherry"})
Set Index Dataframe 16

Another method of doing this is by using a dictionary, specifying the mapping of the labels to either column or index parameters. Now, we will be passing a dictionary to a parameter called mapper.

# Renaming the rows and column
df.rename(mapper= {'int_col': 'integer_colums'}, axis=1)
Set Index Dataframe 17

The axis parameter takes in which axis to sample, it accepts the axis number or name as inputs.

  • 0 for Series or DataFrames
  • 1 for the Columns

Excluding null values from a DataFrame

The dropna() method excludes all the null values from a Series or a DataFrame. Using this method in DataFrames comes with a few additional powers. With DataFrames we can specify the axis to drop the null values from.

We have converted our sample DataFrame into:

Set Index Dataframe 18
# Converting the sample DataFrame to have some NaN values
df.loc[2, 'int_col'] = np.nan
df.loc[1, :] = np.nan

# To drop all rows containing NaN as a value
df.dropna()
# Can also be written as: df.dropna(how= 'any', axis=0)
Set Index Dataframe 19
# To drop rows containing all elements as Nan
df.dropna(how='all', axis=0)
Set Index Dataframe 20
# To drop rows with a minimum threshold of values
df.dropna(thresh=3, axis=0)
# This is the same as: df.dropna(how='any', axis=0)
Set Index Dataframe 21

The dropna() Method with Subset

We use the subset parameter for dropna() method to restrict the function to only a handful of row or column labels that we choose. This will make pandas only look for NaNs in the specified rows or columns.

Using the sample DataFrame:

# Using dropna() on the float_col column
df.dropna(axis=0, how='any', subset=['float_col'])
Set Index Dataframe 22

NOTE: The subset parameters are always orthogonal to the axis provided. For example, we have our axis= 1 (or columns), therefore our subset column will take row labels.

# Using dropna() to drop missing values from 2 and 3 subset of the DataFrame
df.dropna(axis=1, how='any', subset=[2, 3])
Set Index Dataframe 23

Filtering a DataFrame in 2D

When working with large datasets, we sometimes need to quickly slice and dice the labels we are interested in. To achieve this we have to use the filter() method. The filter() method is the same as we used in the Series.

Using this method in DataFrames comes with additional powers since we can manipulate rows, columns, or both.

Considering a sample DataFrame:

Set Index Dataframe 24

Using filter() with like parameter:

# Using the filter() function to identify rows having "Goji" as a keyword 
nutrition.filter(like="Goji", axis=0)
Set Index Dataframe 25

NOTE: The like parameter is case-sensitive.

Using filter() with regex:

The regex parameter uses regular expressions and can match patterns of partial words, whole words, or multiple words.

# To identify rows having "a" or "A" as a partial word or a keyword 
 nutrition.filter(regex ='[aA]')
Set Index Dataframe 28

If we want the function to not be case-sensitive, we can use the following approach:

# To identify rows having "e" or "E" as a partial word or a keyword 
 nutrition.filter(regex ='(?i)e')
Set Index Dataframe 30

Filtering along both axis simultaneously:

# To filter elements having "e" or "E" along both axes
  nutrition.filter(regex ='(?i)e', axis=0)\
    .filter(items=['serving_size', 'calories'], axis=1)
Set Index Dataframe 31

DataFrame Sorting

The sort_values() method is used to sort Series and DataFrames. In DataFrames, while filtering using sort_values() method we must provide parameters to avoid uncertainty over what to sort by since our DataFrame contains several columns.

Let us consider a sample DataFrame:

Set Index Dataframe 32

Now let us sort this DataFrame in ascending order with respect to the calorie content in a food item.

# To sort elements in ascending order by calories.
  nutrition.sort_values(by=['calories'], ascending=True)
Set Index Dataframe 33

Similarly, if we want to sort the DataFrame by multiple columns, we can do it as:

# To sort elements by calories and total_fat in ascending and descending order respectively.
  nutrition.sort_values(by=['calories', 'total_fat'], ascending=[True, False])
Set Index Dataframe 34

Using between() function with DataFrames

The between() function is used to check which of the elements of a given series lies between the specified range. It returns a series of boolean elements containing True wherever the corresponding series element is between the boundary of the left and right values.

Example:

# Finding the elements containing calories between 20 and 300
nutrition.calories.between(20,300)

# Output
name
Cornstarch      False
Nuts            False
Eggplant         True
Cheese           True
Goji berries    False
Crackers        False
Peppers          True
Grapes           True
Name: calories, dtype: bool

Now if we want to print all the elements lying between a given range, we can apply the following approach:

# Finding and printing the elements containing calories between 20 and 300
nutrition[nutrition.calories.between(20,300)]
Set Index Dataframe 35

The min, max and idx Functions

The min(), max() and idx[minmax]() methods are applicable on DataFrames for both row and column labels which need to be specified in parameters.

Taking the sample DataFrame as:

Set Index Dataframe 36

The min() Function

The min() method is used to return the minimum value of the elements over the requested axis.

Using the min() function without providing any parameters:

# Using min() function without any parameters
nutrition.min()

# Output
serving_size    100.0
calories         25.0
total_fat         0.1
dtype: float64

To find the min() function horizontally:

# Using min() function along the rows without any parameters
nutrition.min(axis=1)

# Output
name
Cornstarch       0.1
Nuts            72.0
Eggplant         0.2
Cheese          24.0
Goji berries     0.4
Crackers         7.2
Peppers          0.4
Grapes           0.5
dtype: float64

The max() Function

The max() method is used to return the maximum value of the elements over the requested axis.

Let us find the maximum amount of calories in a food item:

# Using max() function to find the maximum amount of calories
nutrition.calories.max()

# Output
691

The idxmax() Function

The idxmax() method is used to return the index of the first occurrence of maximum over the requested axis.

In the above example, we found the maximum amount of calories in a food item. The idxmax() function can be used to return the index of the food item with the maximum calories.

# Using max() function to find the index with maximum amount of calories
nutrition.calories.idxmax()

# Output
'Nuts'

The nlargest and nsmallest function

The nlargest() Function

The nlargest() function returns the first n rows with the largest values in columns, in descending order. The columns which are not specified are also returned but are not used for ordering.

Let us understand it with an example on our sample DataFrame:

# Finding the first 8 rows with largest values of "total_fat"
nutrition.nlargest(8, columns='total_fat')
Set Index Dataframe 37

The nsmallest() Function

The nsmallest() function returns the first n rows with the smallest values in columns, in ascending order. The columns which are not specified are also returned but are not used for ordering.

Let us understand it with an example on our sample DataFrame:

# Finding the first 8 rows with lowest of "total_fat" and "calories"
nutrition.nsmallest(8, columns=['total_fat', 'calories'])
Set Index Dataframe 38