The DataFrame is a data structure provided by the Pandas Library. The DataFrame is a 2-dimensional labeled array and can be compared to tabular data having rows and columns. The data inside a DataFrame can be heterogeneous (different data types).
Each column of the DataFrame normally represents an attribute and when individually treated, is known as a Series. Each column can have a distinct data type and can be accessed by a label. Each row value can be accessed by either an integer index or a label.
The size of a DataFrame is not fixed and values of data in DataFrame can be changed(mutable). The DataFrame()
function of Pandas is used to create DataFrame objects.
The DataFrame Function
The DataFrame()
function of Pandas library creates a 2-dimensional, size mutable (changeable), heterogeneous tabular data. The DataFrame()
function returns a DataFrame object which contains axis labels(labels for both rows and columns).
Syntax:
DataFrame(data, index, columns)
The data
parameter is the input data of DataFrame. The data can be a dictionary, NumPy array, or a DataFrame. If data is a dictionary then, the dictionary can contain Series, arrays, constants, data class, or list-like objects. The index
and columns
parameters are the index labels and column labels for the resulting DataFrame respectively.
If no index and columns are specified then the default will be integers in the range 0 to n-1. If data already contains column labels and column labels are explicitly given as well then, this will lead to column selection instead( i.e. the column labels which exist in data and are also mentioned in columns parameter will be shown).
DataFrame Creation
The DataFrame can be created in several ways as follows:
DataFrame object from list of lists
Creating a DataFrame object from the list of lists having strings, integers, and float values.
import pandas as pd
listoflist = [['tom', 10, 5], ['sam', 12, 5.3], ['david', 15, 5.8]]
list_dataframe = pd.DataFrame(listoflist, index=['person 1', 'person 2', 'person 3'],
columns = ['Name', 'Age', 'Height'])
print(list_dataframe)
#Output:
Name Age Height
person 1 tom 10 5.0
person 2 sam 12 5.3
person 3 david 15 5.8
DataFrame object from dictionary of lists
The keys of the dictionary will be taken as column labels for the DataFrame.
dictoflists = {'column 1': ['abc', 'def'], 'column 2': ['123', '456'],
'column 3': [23, 32], 'column 4': [5.6, 8.9]}
dictlist_dataframe = pd.DataFrame(dictoflists, columns = ['column 1', 'column 3'])
print(dictlist_dataframe)
#Output:
column 1 column 3
0 abc 23
1 def 32
The Output shows that ‘column 1’ and ‘column 3’ are selected as they are in the dictionary keys and mentioned in the columns parameter of the DataFrame()
function, other columns are not selected in the DataFrame object.
DataFrame object from list of dictionaries
dictoflists = [{'col a': ['abc', 12], 'col b': ['bcd', 34]},
{'col a': ['cde', 24], 'col b': ['def', 68],
'col c': [43, 15], 'col d': [3.6, 0.9]}]
dictlist_dataframe = pd.DataFrame(dictoflists)
print(dictlist_dataframe)
#Output:
col a col b col c col d
0 [abc, 12] [bcd, 34] NaN NaN
1 [cde, 24] [def, 68] [43, 15] [3.6, 0.9]
The Output shows that When there are missing row values, the DataFrame()
function will treat them as NaN(Not a number) values.
DataFrame object from zip
The zip()
function of the built-in Python library creates a tuple of two input lists element-wise, which can be stored in a list.
day = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']
date = ['14-06-2021', '22-06-2021', '30-06-2021',
'08-07-2021', '16-07-2021', '24-07-2021', '01-08-2021']
listoftuple = list(zip(day, date))
zip_dataframe = pd.DataFrame(listoftuple, columns = ['Day', 'Date'])
print(zip_dataframe)
#Output:
Day Date
0 mon 14-06-2021
1 tue 22-06-2021
2 wed 30-06-2021
3 thu 08-07-2021
4 fri 16-07-2021
5 sat 24-07-2021
6 sun 01-08-2021
DataFrame object from numpy array
For this, we need to import the NumPy(Numerical Python) library, and from the array() function of NumPy, we will create a NumPy array.
import numpy as np
two_d_arr = np.array([[1, 2, 3, 4], [3, 6, 9, 12], [2, 4, 6, 8]])
numpyarray_dataframe = pd.DataFrame(two_d_arr)
print(numpyarray_dataframe)
#Output:
0 1 2 3
0 1 2 3 4
1 3 6 9 12
2 2 4 6 8
DataFrame object from Dictionary of Series
A Series can be interpreted as a single column of DataFrame or a 1-dimensional labeled array. The Series object can be created using the Series()
function of Pandas library.
dictofseries = {'Series 1' : pd.Series([0.1, 0.2, 0.3], index=['a', 'b', 'c']),
'Series 2' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
series_dataframe = pd.DataFrame(dictofseries)
print(series_dataframe)
#Output:
Series 1 Series 2
a 0.1 1.0
b 0.2 2.0
c 0.3 3.0
d NaN 4.0
DataFrame object from datetime
The date_range()
function creates a list of date-time values for a specified start date and period.
# creating an array of 5 date-time values at '2019-02-24'
date_ranges = pd.date_range('2019-02-24', periods=5, freq='T')
dict_datetime = { 'Date': date_ranges, 'Val' : range(1, 2 * len(rng),2)}
datetime_dataframe = pd.DataFrame(dict_datetime)
print(datetime_dataframe)
#Output:
Date Val
0 2019-02-24 00:00:00 1
1 2019-02-24 00:01:00 3
2 2019-02-24 00:02:00 5
3 2019-02-24 00:03:00 7
4 2019-02-24 00:04:00 9
DataFrame object from MultiIndex
Creating a DataFrame from the MultiIndex function of Pandas. The MultiIndex.fromproduct()
function creates a list of index values from the cartesian product of the two input lists.
listoflist = [['a', 'b', 'c', 'd'],['one','two']]
data = [[1, 2], [3, 4], [5, 6], [7, 8],
[0, 0.5], [1.5, 2.5], [3.5, 4.5], [5.5, 6.5]]
idx = pd.MultiIndex.from_product(listoflist)
multiindex_dataframe= pd.DataFrame(data, index = idx, columns = ['col 1', 'col2'])
print(multiindex_dataframe)
#Output:
col 1 col2
a one 1.0 2.0
two 3.0 4.0
b one 5.0 6.0
two 7.0 8.0
c one 0.0 0.5
two 1.5 2.5
d one 3.5 4.5
two 5.5 6.5
DataFrame object from a text file and a csv file
For this, we need to read data from the text and CSV files which can be done through the Pandas read_csv()
function. The file path with file name is given as an argument to read_csv()
function to read the file.
txt_dataframe = pd.read_csv("C:/Users/risha/Documents/textdf_data.txt",)
print(txt_dataframe)
#Output:
a b c d e f g h
0 1 2 1 4 3 6 5 9
1 0 1 0 3 2 7 6 5
2 8 2 3 4 5 6 7 2
csv_dataframe = pd.read_csv("C:/Users/risha/Documents/csv_data.csv")
print(csv_dataframe.head())
#Output:
Series reference ... Revised
0 PPIQ.SQU900000 ... 1184
1 PPIQ.SQU900001 ... 1181
[2 rows x 5 columns]
DataFrame object from excel file
For this, we need to read data from an excel file through the Pandas read_excel()
function. The file path with file name is given as an argument to read_excel()
function to read the file.
excel_dataframe = pd.read_excel("C:/Users/risha/Documents/exceldf_data.xlsx",)
print(excel_dataframe.head())
#Output:
Buying Maint Doors Persons Lug_boot Safety Overall
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
Summary of a DataFrame
The info()
function gets the summary of the DataFrame that invokes the function. It is useful while thoroughly analyzing a given data.
DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None,
show_counts=None, null_counts=None)
The info()
method displays information about a DataFrame including the index dtype and columns, non-null values, and memory usage. The method takes several parameters as input:
- verbose: bool, optional: The
verbose
parameter is used to declare whether to print a full summary or not.
- buf: writable buffer, defaults to sys.stdout: The
buf
parameter is used to decide where to send the output. By default, the output is printed tosys.stdout
- max_cols: int, optional: The
max_cols
parameter is used to define the maximum number of columns to be printed. If the DataFrame has more columns than themax_cols
, the rest of the data is truncated.
- memory_usage: bool, str, optional: The
memory_usage
parameter specifies whether the total memory usage of the DataFrame columns is to be displayed or not.True
: Always shows the memory usage.False
: Never shows the memory usage.Deep
: Is equivalent to “True” with deep introspection, and the memory usage is shown in human-readable units ( base-2 representation).
- show_counts: bool, optional: The
shows_counts
parameter defines whether to display non-null counts.True
: Always shows counts.
False
: Never shows the counts.
Example:
Making a sample DataFrame and printing it.
import pandas as pd
# Making a sample DataFrame
int_values = [1, 2, 3, 4, 5]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"integer_col": int_values, "text_col": text_values,
"float_col": float_values})
# Printing the DataFrame
df
Output:
# using the info method
df.info()
Output:
# using the info method with custom parameters
df.info(verbose=True, max_cols=2, memory_usage='deep')
Removing duplicate index in a DataFrame
We sometimes might get a duplicated index that looks identical to the existing index in the DataFrame. It is usually named “Unnamed:0
“. Duplicate index does not add any value to the data set.
Methods to remove duplicate index
There are various methods to remove a duplicate index from the DataFrame:
The drop() method
The drop()
command removes rows or columns by specifying names and corresponding axis, or by directly specifying the index or column names.
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
A Sample DataFrame
Example:
Removing the duplicate column from the DataFrame using the drop()
method.
# Dropping the column "Unnamed:0", axis=1 refers to the vertial column.
DataFrame = DataFrame.drop('Unnamed:0', axis=1)
The set_index() method
In this method, instead of removing the duplicate column, we can make it the index of the DataFrame.
DataFrame.set_index(keys, drop=True, append=False,
inplace=False, verify_integrity=False)
Example:
Removing the duplicate index from the DataFrame using set_index()
method
# Setting "Unnamed:0" as the index of the DataFrame
DataFrame = DataFrame.set_index('Unnamed:0')
The read_csv() method
Instead of removing duplicate index later, we try to define our index while importing the file, using the parameter “index_col
” in the read_csv()
method.
Example:
Removing the duplicate column from the DataFrame using the read_csv()
method.
# index_col=[0] makes the first column as the index of the DataFrame
DataFrame = pd.read_csv(dataurl, index_col=[0])
Sample items from DataFrame
The sample()
method returns a sample of items from a DataFrame from an axis of the object.
DataFrame.sample(n=None, frac=None, replace=False, weights=None,
random_state=None, axis=None, ignore_index=False)
- n: int, optional: The parameter
n
takes the value of the number of items to be returned from the axis. This parameter cannot be used with the other parameterfrac
. Its default value equals 1 if the value offrac=None
.
- frac: float, optional: The
frac
parameter prints out a random fraction of the dataset, unlike then
parameter. Thefrac
parameter cannot be used withn
.
- replace: bool, default False: The
replace
parameter either allows or disallows sampling of the same row more than once. Sampling with replacement makes it possible to pick the exact same method multiple times.
- weights: str or ndarray-like, optional: The
weights
parameter assigns a weight or a priority to every sample. The greater the priority, the higher would be its chances to be sampled. The value of this parameter defaults toNone
, and the probability of choosing any set from the DataFrame would be equal.
- random_state: int, array-like, BitGenerator, np.random.RandomState, optional: The
random_state
parameter accepts either an integer, an array-like, or random number generator itself, and it will use this as a seed for a pseudo-random number generator.
- axis: 0 or ‘index’, 1 or ‘columns’, default None: The
axis
parameter takes in which axis to sample, it accepts the axis number or name as inputs.0
for Series or DataFrames1
for the Columns
- ignore_index: bool, default False: The
ignore_index
parameter when set to true, will make the resulting index labeled as0, 1, ......, n-1
.
Rows and Columns of a DataFrame
DataFrames have two dimensions, visually they are categorized as:
0
refers to Index label or Rows1
refers to Columns
To select a particular observation in the DataFrame, we need to specify two attributes, the row label, and the corresponding column label. These two dimensions are known as the DataFrame Axes.
We have a predefined attribute in our DataFrame which can be accessed as:
DataFrame.axes
Printing all the axes
Let us take a sample DataFrame and print its axes
attribute.
import pandas as pd
# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
"float_col": float_values})
# Printing the axes attribute
df.axes
Printing a single axis
We can also access the row axes and column axes separately using the following command, having index values in square braces.
# Printing the first row axis
df.axes[0]
Printing one element from an axis
Now, to print a particular element from a DataFrame, we need to specify the index values of both row and column axes simultaneously.
# Printing the second element of the first row axis
df.axes[0][1]
# Output
1
Setting index of a DataFrame
We can manually set the index of a DataFrame using the existing columns, with the help of the set_index
method. Using this function, we can set the DataFrame index using one or more existing columns or arrays (of the correct length).
The new index can either replace the existing index or expand on it.
Printing index of a DataFrame
Before understanding how to change the index of a DataFrame, we first discuss how to print the index of an existing DataFrame.
import pandas as pd
# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
"float_col": float_values})
# Printing the index of DataFrame df
df.index
Output:
Data type of a DataFrame index
The data type of an index in pandas is always represented as an Int64Index, which is quite similar to the range_index()
object, which is a special case of the Int64Index called the RangeIndex, in the Data Series section.
# Printing the data-type of the index of a DataFrame
type(df.index)
Output:
Changing the index of a DataFrame
Using RangeIndex() method
The RangeIndex()
function is used to implement an index with a monotonic integer range. The syntax of the method is given as follows:
RangeIndex(start=None, stop=None, step=None)
Example:
import pandas as pd
# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
"float_col": float_values})
# Change the index using RangeIndex() Method
df.index = pd.RangeIndex(start=5, stop=10, step=1)
Now, let us check the data type of our newly created index using the type() method.
# Printing the data-type of the index of a DataFrame
type(df.index)
Using the set_index() method
This method, instead of removing the duplicate column, can set the duplicate index as the index of the DataFrame.
DataFrame.set_index(keys, drop=True, append=False,
inplace=False, verify_integrity=False)
Sample DataFrame:
Example:
Setting index using various parameters of the set_index()
function.
# Setting "Unnamed:0" as the index of the DataFrame
df = df.set_index('Unnamed:0')
# Setting "text_col" as the index of the DataFrame
df = df.set_index('text_col', inplace=True)
# Setting "float_col" as the index of the DataFrame
df.set_index('float_col', drop=False)
Extracting from DataFrames by Label
We can extract data using either the ‘.
‘ notation or the get()
method, which works for DataFrames too. DataFrames have two dimensions, which give us a wide range of options to slice and extract data, other than the two mentioned above.
Extraction using the loc attribute
This method is a product typical way to extract by the label in Pandas. The loc
attribute enables us to access a group of rows and columns by labels or a boolean array.
Considering a sample DataFrame:
import pandas as pd
# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
"float_col": float_values})
We use the loc
attribute to access the row labeled by ‘D
‘.
# Extracting from DataFrames by label 'D'
df.loc['D']
Now, let us check the data type of the slice extracted by the loc
attribute.
# Determining the data type of the slice extracted by the loc attribute
type(df.loc['D'])
# Output
pandas.core.series.Series
This series represents all the column values for the row labeled by ‘D
‘.
If we want to extract a particular element from the DataFrame, we do it using the square braces
We will extract the value for the float_col
for the row labeled by ‘C
‘.
# Extracting from DataFrames by label 'C' corresponding to 'float_col'
df.loc['C']['float_col']
# Can also be written as: df.loc['C','float_col']
# Output
1.0
Extracting from DataFrames by Position
Extraction using the iloc attribute
To extract data from DataFrames by position, we use the iloc
function which provides purely integer-location-based indexing for selection by position.
Considering a sample DataFrame:
import pandas as pd
# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
"float_col": float_values})
Now we use the iloc
attribute to extract one specific data value.
# Extracting from DataFrames by position '3'
df.iloc[3]
# Can also be written as: df.iloc[3, :]
For selecting all the elements of a row by position.
# Extracting from DataFrames by position '3'
df.iloc[[0,2], :]
For displaying selected rows and columns by position.
# Extracting from DataFrames by position
df.iloc[[0,2], 1:2]
Extraction using Boolean Masks
When we use boolean masks for the series, we had to provide a list/sequence of booleans that was exactly equal to the length of the series. Now that we’re working with DataFrames we have to provide two lists of booleans. One is exactly matching in length to the row axis, the other having the same length as the column axis.
Example:
Implementing Boolean Mask for DataFrame extraction.
# Implementing Boolean Masks for Extraction from a DataFrame
df.iloc[
[True if i%2==0 else False for i in range(5)],
[True if i%2==0 else False for i in range(3)]
]
To extract one specific data value from the DataFrame using Boolean Mask.
# Implementing Boolean Masks for Extraction from a DataFrame
df.iloc[3][1]
# Output
'3'
Single Value Access with at and iat
We can extract values from a DataFrame using loc
and iloc
attributes. We also have at
and iat
which are only used to access single values from a DataFrame.
The at
and iat
attributes are two highly specialized and efficient alternatives since they are said to be faster because of the lack of overhead and isolated case of use.
Extraction using at attribute
The at
attribute is used to access a single value for a row/ column label pair. It is similar to the loc
attribute since both provide label-based lookups. at
is only used to extract a single value in a DataFrame or Series.
Example:
Considering a sample DataFrame.
import pandas as pd
# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
"float_col": float_values})
Extract a single data value from the DataFrame using the at
attribute.
# Extracting the data value corresponding to '4' and 'text_col'
df.at[4, 'text_col']
# Output
'E'
Extraction using iat attribute
The iat
attribute is used to access a single value for a row/ column pair by an integer position. It is similar to the iloc
attribute since both provide integer-based position lookups. iat
is only used to extract a single value in a DataFrame or Series.
Example:
Extract a single data value from the DataFrame using the iat
attribute.
# Extracting the data value corresponding to '4' and 'text_col'
df.at[4, 'text_col']
# Output
'E'
Runtime Difference Between loc and at
To check the run time difference between loc
and at
attributes, we use %timeit
, which is an ipython magic function, which can be used to measure the run time of a particular piece of code.
For loc attribute:
# Runtime of loc attribute using timeit magic function
%timeit df.loc[1,'text_col']
# Output
6.11 µs ± 25.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For at attribute:
# Runtime of at attribute using timeit magic function
%timeit df.at[1,'text_col']
# Output
3.19 µs ± 28.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As we see the runtime for the at attribute is much faster than the runtime for the loc attribute. Even though the differences are in micro-seconds, when used in larger datasets and more complex algorithms, this minute difference can add up to something substantial.
Finding Labels and Positions in a DataFrame
Whenever we work with position-based extraction approaches, we may find ourselves counting columns and rows to figure out the exact integer location of the data. This may especially be the case when we combine labels with positions.
In the previous methods, loc
, iloc
, at
, and iat
indexers, all these approaches expect consistent indices. This means, they either expect labels or integer positions.
Let us take an example where we need to extract the data corresponding to one index as a label and the other as an integer position. Majorly, what we are going to find out is how do we get two labels or two positions, when we have one of each.
Getting Label from Position
This is the first approach where we extract the label from the integer position given.
Example:
Let us consider we need to extract the data corresponding to the Label ‘text_col
‘ and position ‘1
‘.
- Firstly, we are going to access the first index using the
.index
attribute and assign it to a new variable ‘index_label
‘.
index_label = df.index[1]
index_label
# Output
1
- Now, we will put together the existing column label and the extracted row label and use the
at
attribute to complete the extraction.
column_label = 'text_col'
df.at[index_label, column_label]
# Output
'B'
Getting Integer Location from Label
This is the second approach where we extract the integer position from the label given.
Example:
Considering the same example as above, with the label as ‘text_col
‘ and the integer position ‘1
‘.
- First we isolate the columns by using the
Dataframe.columns
attribute. To find out the location, we use theget_loc()
function and pass the label as a parameter.
column_loc = df.columns.get_loc('text_col')
column_loc
# Output
1
- Now, we will put together the existing index location and the extracted column label and use the
iat
attribute to complete the extraction.
index_loc = 1
df.iat[index_loc, column_loc]
# Output
'B'
Typecasting data types in a DataFrame
The astype()
function is a very useful, built-in method of the Pandas library that casts the given DataFrame or series from one data type to another.
Creating a new DataFrame:
import pandas as pd
# Creating a new sample DataFrame
names = ['Cornstarch', 'Nuts', 'Eggplant', 'Cheese', 'Goji Berries', 'Crackers', 'Peppers', 'Grapes']
serving_size = [100, 100, 100, 100, 100, 100, 100, 100]
calories = [381, 691, 25, 300, 349, 407, 29, 57]
total_fat = ['0.1g', '72g', '0.2g', '24g', '0.4g', '7.2g', '0.4g', '0.5g']
df = pd.DataFrame({"names": names,"serving_size": serving_size,
'calories': calories, 'total_fat': total_fat})
# Printing the DataFrame
Now, let’s try and convert our calories
column to float using the astype()
method.
# Convert the calories column to float using astype() method
df.calories.astype(float)
# Output
0 381.0
1 691.0
2 25.0
3 300.0
4 349.0
5 407.0
6 29.0
7 57.0
Name: calories, dtype: float64
To typecast multiple columns simultaneously, we can do using the following syntax:
# Convert the serving_size and calories columns to float using astype() method
df.astype({'serving_size': float, 'calories': float})
Now the question arises of how to convert the total_fat
column into float. We cannot convert the total_fat
column into a float directly as it may seem to look like a floating point object but it is a string object. Hence, we first need to separate the units from the values and then apply the astype()
function.
To separate the units from the total_fat
column, we use the replace()
function to replace ‘g’ by ” and then typecast it to an integer.
# Replacing 'g' by '' in the total_fat column
df.replace(to_replace= 'g', value= '', regex=True)
Renaming labels in a DataFrame
The rename() method will help us with DataFrame Transformation. Using this method, we will be able to change or rename our labels for both rows and columns.
Considering a sample DataFrame.
import pandas as pd
# Making a sample DataFrame
int_values = [0, 1, 2, 3, 4]
text_values = ['A', 'B', 'C', 'D', 'E']
float_values = [0.0, 0.5, 1.0, 1.5, 2.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
"float_col": float_values})
Now, to rename the rows and column labels, we can use the rename function as:
# Renaming the rows and column
df.rename(columns={"text_col": "string"},
index={0: "Apple", 1: "Banana", 2:"Orange", 3: "Mango", 4: "Cherry"})
Another method of doing this is by using a dictionary, specifying the mapping of the labels to either column or index parameters. Now, we will be passing a dictionary to a parameter called mapper
.
# Renaming the rows and column
df.rename(mapper= {'int_col': 'integer_colums'}, axis=1)
The axis
parameter takes in which axis to sample, it accepts the axis number or name as inputs.
0
for Series or DataFrames1
for the Columns
Excluding null values from a DataFrame
The dropna()
method excludes all the null values from a Series or a DataFrame. Using this method in DataFrames comes with a few additional powers. With DataFrames we can specify the axis to drop the null values from.
We have converted our sample DataFrame into:
# Converting the sample DataFrame to have some NaN values
df.loc[2, 'int_col'] = np.nan
df.loc[1, :] = np.nan
# To drop all rows containing NaN as a value
df.dropna()
# Can also be written as: df.dropna(how= 'any', axis=0)
# To drop rows containing all elements as Nan
df.dropna(how='all', axis=0)
# To drop rows with a minimum threshold of values
df.dropna(thresh=3, axis=0)
# This is the same as: df.dropna(how='any', axis=0)
The dropna() Method with Subset
We use the subset parameter for dropna()
method to restrict the function to only a handful of row or column labels that we choose. This will make pandas only look for NaNs in the specified rows or columns.
Using the sample DataFrame:
# Using dropna() on the float_col column
df.dropna(axis=0, how='any', subset=['float_col'])
NOTE: The subset parameters are always orthogonal to the axis provided. For example, we have our axis= 1
(or columns), therefore our subset column will take row labels.
# Using dropna() to drop missing values from 2 and 3 subset of the DataFrame
df.dropna(axis=1, how='any', subset=[2, 3])
Filtering a DataFrame in 2D
When working with large datasets, we sometimes need to quickly slice and dice the labels we are interested in. To achieve this we have to use the filter()
method. The filter()
method is the same as we used in the Series.
Using this method in DataFrames comes with additional powers since we can manipulate rows, columns, or both.
Considering a sample DataFrame:
Using filter()
with like
parameter:
# Using the filter() function to identify rows having "Goji" as a keyword
nutrition.filter(like="Goji", axis=0)
NOTE: The like
parameter is case-sensitive.
Using filter()
with regex
:
The regex
parameter uses regular expressions and can match patterns of partial words, whole words, or multiple words.
# To identify rows having "a" or "A" as a partial word or a keyword
nutrition.filter(regex ='[aA]')
If we want the function to not be case-sensitive, we can use the following approach:
# To identify rows having "e" or "E" as a partial word or a keyword
nutrition.filter(regex ='(?i)e')
Filtering along both axis simultaneously:
# To filter elements having "e" or "E" along both axes
nutrition.filter(regex ='(?i)e', axis=0)\
.filter(items=['serving_size', 'calories'], axis=1)
DataFrame Sorting
The sort_values()
method is used to sort Series and DataFrames. In DataFrames, while filtering using sort_values()
method we must provide parameters to avoid uncertainty over what to sort by since our DataFrame contains several columns.
Let us consider a sample DataFrame:
Now let us sort this DataFrame in ascending order with respect to the calorie content in a food item.
# To sort elements in ascending order by calories.
nutrition.sort_values(by=['calories'], ascending=True)
Similarly, if we want to sort the DataFrame by multiple columns, we can do it as:
# To sort elements by calories and total_fat in ascending and descending order respectively.
nutrition.sort_values(by=['calories', 'total_fat'], ascending=[True, False])
Using between() function with DataFrames
The between()
function is used to check which of the elements of a given series lies between the specified range. It returns a series of boolean elements containing True
wherever the corresponding series element is between the boundary of the left and right values.
Example:
# Finding the elements containing calories between 20 and 300
nutrition.calories.between(20,300)
# Output
name
Cornstarch False
Nuts False
Eggplant True
Cheese True
Goji berries False
Crackers False
Peppers True
Grapes True
Name: calories, dtype: bool
Now if we want to print all the elements lying between a given range, we can apply the following approach:
# Finding and printing the elements containing calories between 20 and 300
nutrition[nutrition.calories.between(20,300)]
The min, max and idx Functions
The min()
, max()
and idx[minmax]()
methods are applicable on DataFrames for both row and column labels which need to be specified in parameters.
Taking the sample DataFrame as:
The min() Function
The min()
method is used to return the minimum value of the elements over the requested axis.
Using the min()
function without providing any parameters:
# Using min() function without any parameters
nutrition.min()
# Output
serving_size 100.0
calories 25.0
total_fat 0.1
dtype: float64
To find the min()
function horizontally:
# Using min() function along the rows without any parameters
nutrition.min(axis=1)
# Output
name
Cornstarch 0.1
Nuts 72.0
Eggplant 0.2
Cheese 24.0
Goji berries 0.4
Crackers 7.2
Peppers 0.4
Grapes 0.5
dtype: float64
The max() Function
The max()
method is used to return the maximum value of the elements over the requested axis.
Let us find the maximum amount of calories in a food item:
# Using max() function to find the maximum amount of calories
nutrition.calories.max()
# Output
691
The idxmax() Function
The idxmax()
method is used to return the index of the first occurrence of maximum over the requested axis.
In the above example, we found the maximum amount of calories in a food item. The idxmax()
function can be used to return the index of the food item with the maximum calories.
# Using max() function to find the index with maximum amount of calories
nutrition.calories.idxmax()
# Output
'Nuts'
The nlargest and nsmallest function
The nlargest() Function
The nlargest()
function returns the first n
rows with the largest values in columns, in descending order. The columns which are not specified are also returned but are not used for ordering.
Let us understand it with an example on our sample DataFrame:
# Finding the first 8 rows with largest values of "total_fat"
nutrition.nlargest(8, columns='total_fat')
The nsmallest() Function
The nsmallest()
function returns the first n
rows with the smallest values in columns, in ascending order. The columns which are not specified are also returned but are not used for ordering.
Let us understand it with an example on our sample DataFrame:
# Finding the first 8 rows with lowest of "total_fat" and "calories"
nutrition.nsmallest(8, columns=['total_fat', 'calories'])