Advanced DataFrames

We will take a closer look at the advanced indexing using binary operators and multiple boolean conditions, exploring some computer science concepts which include bitwise operators and how modern computers represent numbers.

Sorting, indexing, and looking up values to reorder our records into customized logic. We will be discussing pruning operations related to duplicate and null values, and how to reshape our DataFrame.

Table of Contents

1. Indexing with Boolean Masks

2. More approaches to Boolean Masking

2.0.1. Using the isin Function

2.0.2. Using the between Function

2.0.3. Using the comparison functions

3. Binary Operators with Booleans

3.0.4. Binary OR operator

3.0.5. Binary AND operator

3.0.6. Binary XOR operator

3.0.7. The Two’s Complement

4. Combining conditions using operators

5. Conditions as Variables

6. 2D Indexing

6.0.8. The loc parameter method

6.0.9. The startswith method

6.0.10. The chaining method

7. Fancy Indexing with Lookup

7.0.11. Loc vs the Lookup Method

8. Sorting by Index and Column

9. Sorting vs Reordering a DataFrame

10. Identifying Duplicates in a DataFrame

10.0.12. Specifying subsets to check duplicates

10.0.13. Labeling elements as duplicates

11. Removing Duplicates in a DataFrame

12. Removing DataFrame Rows

13. Removing DataFrame Columns

13.0.14. Using the drop method

13.0.15. Using the pop method

13.0.16. Using reindexing

14. Null Values in a DataFrame

14.0.17. Calculating the total null values in a DataFrame

14.0.18. Finding out all the null records

15. Handling DataFrame Null Values

15.0.19. Filling or replacing the null values

15.0.20. Excluding or dropping the null values

16. Calculating Aggregates in a DataFrame

17. Same Shape Transforms

18. The apply Method in DataFrame

19. Element-wise Operations with applymap()

20. Setting DataFrame Values

20.0.21. Using label based indexing (loc)

20.0.22. Using position based indexing (iloc)

21. The SettingWithCopy Warning

22. View vs Copy

22.0.23. Views in a DataFrame

22.0.24. Copies in a DataFrame

23. Adding DataFrame Columns

23.0.25. Using the assignment operator

23.0.26. Using the insert method

23.0.27. Using the assign method

24. Adding Rows to DataFrame

24.1. Using the append method

24.1.28. Passing Series object as parameter

24.1.29. Passing list of Series objects as parameters

24.1.30. Passing DataFrame object as parameter

24.2. Using the loc/iloc indexers

Indexing with Boolean Masks

We have already discussed indexing with boolean masks for DataFrames before. The key concept while using a boolean mask is a 2 step process:

STEP 1: To generate a sequence of booleans.
STEP 2: Using this boolean sequence in [] or .loc[]

Considering a large sample DataFrame of famous soccer players:

Now let us consider, we need to find out the players who have a market value greater than 60 million.

STEP 1: Generating the sequence of booleans.

# Using the comparison operator we obtain a boolean series
df.market_value > 60

# Output
0       True
1      False
2      False
       ...  
463    False
464    False
Name: market_value, Length: 465, dtype: bool

We have now obtained a series of booleans. To complete this selection, we simply need to pass this boolean series into the DataFrame.

STEP 2: Using this boolean sequence in [] or .loc[]

# Passing the boolean series into the DataFrame
df[df.market_value > 60]

# To find out the total number of players fitting into the category
df[df.market_value > 60].shape

# Output
(5, 17)

More approaches to Boolean Masking

Using the isin Function

Now let us take another example, suppose we need to find all the defenders in our DataFrame. (We have our defender codes as LB, CB, and RB.)

# Checking whether the position is in the array of defender codes.
df.position.isin(['LB', 'CB', 'RB'])

# Output
0      False
1      False
2      False
       ...  
463    False
464    False
Name: position, Length: 465, dtype: bool

Now, we need to pass on this boolean sequence into our DataFrame.

# Passing the boolean series into the DataFrame
df[df.position.isin(['LB', 'CB', 'RB'])]

Using the between Function

Taking another example, let us consider we need to find out the players having a market value between 40 and 50 million. We will do this using the between() function.

# Using the comparison operator we obtain a boolean series.
df.market_value.between(40,50)

# Output
0      False
1       True
2      False
       ...  
463    False
464    False
Name: market_value, Length: 465, dtype: bool

Now we will pass this to the DataFrame to print all the players having a market value between 40 and 50.

# Passing the boolean series into the DataFrame
df[df.market_value.between(40,50, inclusive=False)]

Using the comparison functions

The Pandas library has functions for every comparison operator we use. The comparison operators and their corresponding Pandas methods are given in the following table.

Comparison Operators	Pandas Method
<	.lt()
<=	.le()
>	.gt()
>=	.ge()
==	.eq()

Taking an example, let us assume we need to find all the players having an age less than equal to 25.

# Checking whether the position is in the array of defender codes.
df.age.le(25)

# Output
0      False
1      False
2      False
       ...  
463     True
464    False
Name: age, Length: 465, dtype: bool

Passing the obtained boolean sequence into the DataFrame.

# Passing the boolean series into the DataFrame
df[df.age.le(25)]

Binary Operators with Booleans

Binary operators are just like the other operators like +, -, *, / etc. The only difference is that the binary operators work on the binary representation of the value, on the individual bits.

Two of the major operators are:

Binary OR – |
Binary AND – &

Binary OR operator

The truth table for the Binary OR operator is given as follows:

A	B	A\|B
0	0	0
0	1	1
1	0	1
1	1	1

Binary AND operator

The truth table for the Binary AND operator is given as follows:

A	B	A\|B
0	0	0
0	1	0
1	0	0
1	1	1

Let us create a Pandas boolean series, having alternative True and False.

# Creating a boolean series
t = pd.Series([True if i%2 ==0 else False for i in range(10)])
t

# Output
0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
9    False
dtype: bool

To try out the boolean operators on the series, we create another boolean series having all False.

# Creating another boolean series
f = pd.Series([False for i in range(10)])
f

# Output
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

Using the Binary OR operator on the two series:

# t OR f
t | f

# Output
0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
9    False
dtype: bool

Using the Binary AND operator:

# t AND f
t & f

# Output
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

There are several other binary operations, two of them we are going to discuss are given below:

XOR – The Exclusive OR
Complement

Binary XOR operator

The XOR operator, which stands for Exclusive OR. It gives True only when the inputs are different and False when the inputs are similar. The XOR operator in python is represented as ^ (caret symbol).

The truth table for the Binary XOR operator is given as follows:

A	B	A^B
0	0	0
0	1	1
1	0	1
1	1	0

# t XOR f
t ^ f

# Output
0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
9    False
dtype: bool

The Two’s Complement

The Two’s Complement is represented as ~ (tilde symbol) in python. This operator is very useful in Boolean Series, for finding the opposite of the series. All of these operators act on the Bits of a number.

Let us take an example to understand it further:

# Taking two's complement
~ True

# Output
-2

~ False

# Output
-1

This provides some surprising results because it is related to how computers represent numbers. It gives the binary complement the number. It inverts the bits representing the number, as shown in the figure below:

Combining conditions using operators

While performing multi-condition selections, we combine several boolean series using the binary operators, discussed above. Let us understand this using an example.

Suppose we need to find all the left-backs who are 25 or younger. We will do this using two conditions and combine them using the AND binary operator.

# Combining conditions using AND operator
df[(df.position == 'LB') & (df.age <= 25)]

NOTE: Always remember to wrap the two conditions, connected by the binary operator, in separate parenthesis.

Let us take another example, let us say we want to retrieve the players who are left-backs, 25 years or younger, having a market value above 10 million, and who are not from Arsenal or Tottenham clubs. Now we have four conditions to look for:

# Combining four conditions using AND operator
df[
    (df.position == 'LB') &
    (df.age <= 25) &
    (df.market_value >= 10) &
    ~(df.club.isin(('Tottenham', 'Arsenal')))
]

Conditions as Variables

The multi-line expression we came across in the above example is quite difficult to read. To avoid such cases we can always refactor our conditions into stand-alone variables. Especially when we need to compose an indexing logic that involves several conditions, it is always a good idea to refactor.

Let us take another example to understand using conditions as variables. Let’s consider we need to find all the Arsenal right-backs and Chelsea goalkeepers.

STEP 1: Let us create a variable as arsenal_player which stores the boolean series for all the players belonging to the Arsenal club.

# Creating the 'arsenal_player' variable
arsenal_player = df.club == 'Arsenal'

arsenal_player

# Output
0       True
1       True
2       True
       ...  
463    False
464    False
Name: club, Length: 465, dtype: bool

STEP 2: We will be creating another variable to store the boolean list of all the players who are on the position “RB” as right_back.

# Creating the 'right_back' variable
right_back = df.position == 'RB'

right_back

# Output
0      False
1      False
2      False
       ...  
463    False
464    False
Name: position, Length: 465, dtype: bool

STEP 3: For the Chelsea Goalkeepers part, we are going to combine both the conditions to one variable. Therefore, we make another variable as chelsea_and_GK which is for the players who are Goalkeepers from the Chelsea club.

# Creating the 'right_back' variable
right_back = df.position == 'RB'

right_back

# Output
0      False
1      False
2      False
       ...  
463    False
464    False
Name: position, Length: 465, dtype: bool

STEP 4: Now all we need to do is to index the DataFrame by passing the above conditions in square brackets or the loc indexer.

# Indexing the DataFrame using the conditions as variables
df.loc[arsenal_player & right_back | chelsea_and_GK]

2D Indexing

In the last section, we have discussed Boolean Indexing in the context of extracting rows. Our indexing has only operated on the index axis, the axis zero, which is the default behavior in Pandas.

When working with DataFrames, we are working with two dimensions, and therefore we certainly have an option of indexing along the other axis as well.

The loc parameter method

Let us take an example, consider we need to find the players from the Chelsea club who are 23 years or younger.

STEP 1: Let us begin my forming the conditions and storing them in a seperate variable chelsea_23under

# Creating the 'chelsea_23under' variable
chelsea_23under = (df.club == 'Chelsea') & (df.age.le(23))

chelsea_23under

# Output
0      False
1      False
2      False
       ...  
463    False
464    False
Name: position, Length: 465, dtype: bool

STEP 2: To do the selection, we simply pass the above variable to the square brackets in the loc condition.

# Printing all the chelsea_23under players.
df[chelsea_23under]

STEP 3: In order to index the retrieved data according to the required columns, we specify the columns in a list as a parameter to the loc attribute.

# Creating the 'chelsea_23under' variable with column parameters.
df.loc[chelsea_23under, ['position', 'market_value']]

This is the most intuitive way to go about the selection from the column axis, but it is not very flexible. We have another method to index the DataFrame according to the column axis.

The startswith method

Taking another example where we need to find out all the column names that begin with ‘p‘.

STEP 1: We use the DataFrame.columns attribute and use the startswith function, that will go through all the column labels and check whether they start with ‘p’ and returns a boolean list. Assign this list to a variable ‘p_cols‘.

# Creating the 'p_cols' variable
p_cols = df.columns.str.startswith('p')

p_cols

# Output
array([False, False, False,  True,  True, False,  True, False, False,
       False, False, False, False, False, False, False, False])

STEP 2: Now all we need to do is to pass this variable to the loc indexer as a parameter.

# Creating the 'chelsea_23under' variable with column parameters.
df.loc[chelsea_23under, p_cols]

The chaining method

The third method to index according to the columns is the chaining method, where we chain two square brackets.

# Chaining two square brackets for indexing.
df[chelsea_23under]['position']

# Output
110    CF
111    CB
112    LB
115    DM
Name: position, dtype: object

Here, the first square bracket selects the DataFrame slice, containing all the columns and the second square bracket selects a single column from that.

This may appear to be equivalent to the loc method but in reality, the bracket method chaining method will always be slower, since the method get_item gets called twice. Therefore it is recommended to avoid using the chaining method as it may affect the runtime of the program.

Fancy Indexing with Lookup

The fancy indexing involved passing multiple labels all at once. The fancy indexing is similar to the basic label-based indexing, but instead of using single labels, we specify a list or a tuple of labels.

We will be talking about a Pandas method which is dedicated to fancy indexing and item selection and is more effective and quicker. This method is called lookup.

The lookup method takes a specific or a list of lookup labels and column labels and then it returns the value or values associated with each pair of labels.

Let us take an example where we need to find the value corresponding to the index label 450 and the column label ‘age‘.

# Fancy indexing using the lookup method
df.lookup([450], ['age'])

# Output
array([30], dtype=int64)

Loc vs the Lookup Method

We will consider a basic label-based selection using the loc indexer with some fancy indexing since we will be passing a list of index labels as well as a tuple of column labels.

# Fancy indexing using the loc method
df.loc[[0, 132], ('name', 'market_value')]

It gives us back a 2×2 DataFrame slice. Now let us compare this with what we get from the lookup method.

# Fancy indexing using the lookup method
df.lookup([0, 132], ['name', 'market_value'])

# Output
array(['Alexis Sanchez', 6.0], dtype=object)

Using the lookup method, we do not get back a DataFrame. Instead, we have an array of two items containing only the values that correspond to the row and column labels that we specified above.

Thinking of the parameters as label coordinates, the specifier row, and specifier column, and for that pair of labels, we get back a specific value.

So in the above lookup function, we have done this twice:

Once with the 0 index label and the column ‘name’. Which gives us ‘Alexis Sanchez’, the first item in the array.
The second time with the index label as 132 and the column ‘market_value’ and since they intersect at the value 6.0 which becomes our second item.

In reality, the lookup method is most useful when we already have a collection of labels that we want to select to do the DataFrame selection.

Sorting by Index and Column

Sorting by a particular value is not the only way to sort the DataFrame. We occasionally also find the need to sort the DataFrame by index, usually for the indices we create.

Let us set the index of our existing DataFrame using the set_index method.

# Setting Index of the DataFrame
df.set_index('name', inplace = True)

# Printing the DataFrame
df

Now to order the Index according to the alphabetical order. In Pandas, it is very quick and easy to fix the unordered index using the sort_index method.

# Sorting the index of the DataFrame
df.sort_index(inplace = True)

# Printing the DataFrame
df

If we want the changes to be permanent in the DataFrame, we can set the inplace parameter to be True.

The sort_index method also helps us to sort our columns. Now, in addition to ordering our index axis, we also want to sort our column axis. We can do this using the sort_index method, all we have to do is to change the axis parameter, from the default of 0 to 1.

# Sorting the index of the DataFrame
df.sort_index(axis=1)

# Printing the DataFrame
df

Sorting vs Reordering a DataFrame

We have sorted the DataFrame by values, index, and columns. Both the sort_values and sort_index methods contain several parameters that enable us to customize the specifics of how we want to carry out the sorting.

We have two fundamental ways of sorting:

Ascending or in alphabetical order
Descending order or in non-alphabetical order

Now, if we want to reorder more precisely in a DataFrame, according to a specified order, we can use the reindex method.

Let us begin by isolating a chunk of our DataFrame, using the iloc indexer and store it in a variable as df_lite.

# Slicing the DataFrame
df_lite = df.iloc[:4, :4]

# Printing the Sliced DataFrame
df_lite

Let us define a format we want our data to be reflected in, let’s say:

row order: 2, 1, 3, 0
column order: age, name, position, club

We cannot do this using the sorting method, so have to use the reindexing method.

# Reindexing using the format defined
df_lite.reindex(index = [2, 1, 3, 0], columns = ['age', 'name', 'position', 'club'])

Even if we use the same command on the entire DataFrame, we will get the same result without any errors.

# Reindexing the entire DataFrame using the format defined
df.reindex(index = [2, 1, 3, 0], columns = ['age', 'name', 'position', 'club'])

Now, let us suppose we need to get all the columns and have them alphabetically ordered.

# Reindexing to get all columns alphabetically ordered.
df.reindex(index = [2, 1, 3, 0]).sort_index(axis=1)

Identifying Duplicates in a DataFrame

In real-world scenarios, we often have to work with imperfect DataFrames, having duplicate values. This is why tools like Pandas are so popular among data analysts and scientists for handling, manipulating, and organizing such data.

Our example DataFrame does contain some duplicate records too. Duplicate values would be when the records have the same values across all columns.

This might cause a lot of problems, let’s say we need to find an average or aggregate calculation. If we have the same redundant value showing up multiple times, the resulting calculation will come out incorrectly.

An easy way to identify duplicate records in Pandas is by using the duplicated method.

# Identifying duplicate records using the duplicated function.
df.duplicated()

# Output
0      False
1      False
2      False
       ...  
463    False
464    False
Length: 465, dtype: bool

This method returns a boolean series. Now using the boolean masking method, we will pass this boolean series to select the duplicated items from the DataFrame.

# Selecting all the duplicated elements from the DataFrame
df[df.duplicated()]

Specifying subsets to check duplicates

Sometimes we might fall into situations where we have duplicate values but a different value in a particular column that is making it look unique. In such cases, we need to have the ability to define precisely what constitutes a unique value.

In Pandas, we have an option to specify the subset of columns that should be considered to check the redundancy of the elements in a DataFrame. This can be achieved using the subset parameter.

Let us define that we need to characterize the unique elements by their club, age, position, and the market_value.

# Selecting all the duplicated elements using a subset
df.duplicated(subset = ['club', 'age', 'position', 'market_value'])

# Output
0      False
1      False
2      False
       ...  
463    False
464    False
Length: 465, dtype: bool


# Printing the duplicate elements
df.loc[df.duplicated(subset = ['club', 'age', 'position', 'market_value'])]

Labeling elements as duplicates

Now comes the question, how to find out which occurrence is original and which ones are the duplicate. Most often, also the default for Pandas, we treat the first occurrence of the element as original and every following similar index as the duplicate.

But this is not the only option available for us. If we order by a different parameter, the order of the appearance of the duplicates might change, which implies that this assumption has no intrinsic logic of its own.

To change this, we can use the keep parameter.

# Keeping the first occurrence of the duplicates
df.loc[df.duplicated(subset = ['club', 'age', 'position', 'market_value'], keep='first')]

The keep parameter takes in first, last, or False. The keep parameter by default is first, which means that Pandas considers the first value as the original and any duplicate value following it is duplicate.

Another option we have is last, which is exactly the opposite of the first. It means that in a set of repeating values, the last will be considered as the original value, and everything preceding it will be marked as a duplicate.

NOTE: The duplicated method parameters first and last does not change the shape of the DataFrame. We are only affecting the labeling of the DataFrame using this method.

There is also a third option which is boolean False. Setting the keep parameter to false indicates Pandas that all repeating values need to be treated as duplicates. Labeling keep to false does change the shape of the DataFrame since it treats all values to be duplicated.

Removing Duplicates in a DataFrame

We can remove duplicate values in a table using the drop_duplicates method, which returns the copy of the DataFrame where the duplicate values have been removed.

This method also has a keep parameter that takes in first, last, or False.

# Dropping duplicates keeping the first occurrence
df.drop_duplicates(keep = 'first')

Now after removing the duplicates if we want to extract a particular column and perform aggregate calculations, we use the following approach.

For example, we need to find the aggregate market value of all the soccer players in the table after removing the duplicates.

# Dropping duplicates and finding the mean of market_value
df.drop_duplicates(keep = 'first').market_value.mean()

# Output
11.026252723311545

Let us compare the original mean and the mean after dropping the duplicates.

# Finding the original mean of market_value
df.market_value.mean()

# Output
11.125649350649349

This means that both of the means differ by almost 0.1 which is 1%, which is not a huge change in this case but it could be a lot worse in the case of DataFrames having a large number of duplicates or duplicates having extreme values.

Removing DataFrame Rows

Another approach for removing duplicates can be with the help of removing DataFrame rows. Firstly, we would have to identify the records that we want to remove and exclude all or some of them separately.

In our example DataFrame, we have 3 duplicate values. Let us assume, we want to remove one of these values and keep the other two. In that case, the drop_duplicates method won’t work here.

We can simply rely on the drop method to remove the specified number of rows from our DataFrame.

# Dropping the row having index 13
df.drop(labels = 13, axis = 0)

# Can also be written as df.drop(index = 13)

The drop method returns a copy of the DataFrame where the specified row is removed. If we have multiple rows to be removed, we can simply pass a list of the indices of all the rows as a parameter.

Removing DataFrame Columns

Using the drop method

We can use the drop method to remove columns of the DataFrame as well. All we need to do is to change the axis parameter of the drop method to 1 or ‘columns‘.

Let us take an example to remove the columns ‘age‘ and ‘market_value‘ from our sample DataFrame.

# Dropping the columns having index 'age' and 'market_value'
df.drop(labels=['age', 'market_value'], axis=1)

# Can also be written as df.drop(columns=['age', 'market_value'])

Using the pop method

The pop function takes in the column label we need to remove from the DataFrame. There are 3 points of difference to note about using this method:

The pop method does not return the DataFrame with the column removed as in the drop method. Instead, it returns the removed column values in the form of a series.
The pop function works at a single column at once.
The pop function modifies the underlying DataFrame. Which means the function operates inplace.

Let us assume we need to permanently remove the ‘club‘ column from the DataFrame. We can use the following pop command to perform this function.

# Permanently dropping the 'club' column
df.pop('club')

# Output
0       Arsenal
1       Arsenal
2       Arsenal
         ...   
463    West+Ham
464    West+Ham
Name: club, Length: 465, dtype: object

Using reindexing

Another method of removing columns and rows is reindexing the DataFrame by leaving out several unwanted rows or columns. The reindex method crafts a new DataFrame which only reflects the labels we specify.

Let us consider we need to slice a DataFrame that does not include labels page_views, region, and position.

STEP 1: We define a new variable ‘unwanted_values‘ and assign a list of all the column or row values we want to exclude from the new DataFrame, which in this case are page_view, region, and position.

# Defining the unwanted variables.
unwanted_values = ['page_views', 'region', 'position']

STEP 2: Reindexing using the DataFrame with the difference of the original set of indices and the specified set, i.e. the unwanted_values.

# Reindexing using the Difference
df.reindex(columns = set(df.columns).difference(unwanted_values))

Null Values in a DataFrame

Null Values or the Null Markers, more commonly known as NaN refer to the gaps in the data of a DataFrame. The isna method is used to return a boolean series with True for all the elements containing a null value.

When using the isna method on a DataFrame, we get back a new DataFrame where each value is removed and replaced by a boolean indicating whether the value is NaN or not.

Calculating the total null values in a DataFrame

To quickly count the number of NaN in the DataFrame, we can use a very performant NumPy function count_nonzero.

# Function to count all the NaNs in a DataFrame.
np.count_nonzero(df.isna())

Finding out all the null records

To find out all the elements containing null values, first, we need to extract the values from the boolean DataFrame to create a boolean array of arrays.

STEP 1: To get an array of arrays we use the following isna command.

# Finding an array of boolean arrays
df.isna().values

# Output
array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

STEP 2: Now we can pass on the above array of boolean arrays to the DataFrame to retrieve all the elements having null values.

# Finding the elements having null values
df[df.isna().values]

Handling DataFrame Null Values

Filling or replacing the null values

Null values usually indicate a gap in the data, this could be the information that was lost, not available, or incorrectly recorded. So, we can fill or replace these null values with something more meaningful and the fillna method is really useful to accomplish that.

The fillna method takes in the new value we need to replace the null values with, and returns a copy of the DataFrame.

# Replacing all the null values in the DataFrame
df.fillna('something meaningful').loc[[30, 192, 195]]

In reality, our needs might be a bit more sensitive to the context. For example, if we want any null in the market_value should at least be replaced with a numerical value. To be able to do this, we could use a dictionary syntax of the fillna method.

Therefore, instead of passing a string, we pass on a dictionary that has different values associated with different keys.

# Replacing all the null values in market_value and position.
df.fillna({
    'market_value': 100,
    'position': 'RM'
}).loc[[30, 192, 195]]

Excluding or dropping the null values

An entirely different approach to deal with null values is to simply exclude them. For this, we have the dropna method. The dropna method on a DataFrame returns a new DataFrame which excludes rows containing the null values. This means the entire DataFrame rows are removed if any of the values contain a single null.

# Dropping all the null values in market_value and position.
df.dropna(axis = 1, how = 'any').loc[[30, 192, 195]]

Calculating Aggregates in a DataFrame

The aggregate functions become useful when we wish to group our values along with several attributes. The agg method is the short form for the aggregate and it accumulates and aggregates the output of the function that we specify and collects the entire axis in a single value.

The agg function can be used to find the following:

Mean
Median
Product
Sum
Min
Max
Standard Deviation
Variance
Mean Absolute Deviation

Let us find out the aggregate mean function on the entire DataFrame.

# Finding the mean of all columns using the agg function.
df.agg('mean')

# Output
age              26.776344
position_cat      2.178495
market_value     11.125649
...
club_id          10.253763
big_club          0.309677
new_signing       0.144086
dtype: float64

We can also pass in a list of agg functions, which would further give us a DataFrame where the values corresponding to every agg function across all numeric columns are reflected.

For example, we want to find an aggregate of min, max, and mean for all the numeric columns, we use the following approach to retrieve a DataFrame.

df.select_dtypes(np.number).agg(['min', 'max', 'mean'])

The select_dtypes function helps us to select distinct data types from the DataFrame.

Same Shape Transforms

Pandas also has a dedicated method used for applying constant functions to the entire DataFrame. The transform method also guarantees that the original shape of the DataFrame doesn’t change during the transformations.

Let’s say we want to convert our market_value and fpl_value from one currency to another, from USD to Euros.

STEP 1: We need to have a conversion rate or an fx rate and isolate the slice of the DataFrame we want to perform the transformation on.

# usdeur conversion rate = 0.85

# We need to perform the transformation on the following slice
df.loc[: , ['market_value', 'fpl_value']]

STEP 2: Applying the transform function with an inline lambda function, the x represents the entire column, the market_value and the fpl_value are multiplied by their fx rate.

# Applying the transformation with an inline lambda function
df.loc[: , ['market_value', 'fpl_value']].transform(lambda x: x * 0.85)

# This can also be done without using the transform function
df.loc[: , ['market_value', 'fpl_value']] * 0.85

The apply Method in DataFrame

We use the apply method whenever we want to apply a given method to an entire DataFrame, a row or column at once.

For example, we wish to modify all our floating point columns and round them to the closest integer without actually converting them.

STEP 1: We can start by defining a function round_floats, which will take a single parameter. Inside the function we will check the type of the variable, if it is a float we return the rounded version of that series, and if it isn’t we simply return the series unchanged.

def rounded_floats(x):
    if x.dtype == np.float64:
        return round(x)
    
    return x

STEP 2: Applying rounded_floats function to all the float columns.

# Applying the rounded_floats function and using head method to print
df.select_dtypes(np.float64).apply(rounded_floats).head()

NOTE: The apply function is very similar to the transform method. The only difference is that the apply method is a bit more flexible since it supports aggregation as well as the same shape transform. The apply function is a combination of the transform method and the aggregate method.

The apply function is very advanced and flexible and supports two kinds of operations.

The ones that change the shape and dimensions of the underlying DataFrame, which would be the aggregate or the agg function.
The ones that apply inplace transformations, that would identically return the same shape as that of the original DataFrame, which would be the transform function.

Now if we wanted to aggregate our numerical columns, the typical way of doing that is by using the agg method. But now we know that the apply method can also act as the agg method.

We could simply apply the following command to find out the mean across all columns of the DataFrame:

# Finding the mean using the apply function
df.apply('mean')

# Output
age              26.776344
position_cat      2.178495
...
club_id          10.253763
big_club          0.309677
new_signing       0.144086
dtype: float64

Element-wise Operations with applymap()

The agg, transform, and apply methods operate on entire columns or rows at once. These take advantage of the concept of vectorized operations, which is a NumPy feature that allows us to realize huge performance gains by applying the given operation on a set of values all at once.

Occasionally we might be working on some function whose logic depends upon accepting and operating on an individual value, i.e. operating one item at a time. For such cases, Pandas offers the applymap function.

Let us take an example that we want to adjust our players’ market_value and the fpl_value for inflation (a general increase in the price). So we want to increase our market_value and fpl_value by 2%.

To make it more non-vectorized, let us assume we also want to know precisely when each 100th value was adjusted for inflation. So for every 100 values, we want the function to print out the timestamp for when the transformation took place.

STEP 1: Since the values have to be appreciated by 2%, we will have to multiply our existing values by 1.02 therefore, saving it in a variable as inflation.

inflation = 1.02

STEP 2: We will seperate the DataFrame slice we want to modify and store it in a seperate variable mini_df.

mini_df = df.loc[: , ['market_value', 'fpl_value'] ]

# Printing the DataFrame slice
print(mini_df)

STEP 3: Now we need to define our custom function as the log_and_transform which takes in a single parameter. We also set a counter to keep the track of every 100th element. Now increment the counter by 1, and if the counter is divisible by 100 we print the timestamp.

from datetime import datetime
counter = 0

def log_and_transform(x):
    global counter
    counter += 1
    if counter % 100 == 0:
        print(f"It's {datetime.now()} and I just adjusted the {counter}th value.")
    
    return x * inflation

STEP 4: Now applying this log_and_transform function to our DataFrame slice mini_df.

mini_df.applymap(log_and_transform)

The log_and_transform function is impossible to vectorize without losing functionality. It only fulfills its functions when the values are passed one at a time.

Setting DataFrame Values

Sometimes we do not want a large-scale operation to transform the entire row or column at once. We just want to spot-change our data, to modify a single specific value. We can easily do this using the Assignment Operator (=).

Let us try to understand with the help of an example, suppose we want to change Theo Walcott’s position from RW to CM in our DataFrame.

Using label based indexing (loc)

We can do this by locating the element using the loc indexer and then using the assignment operator to set it to a new value.

# Assigning a new value using loc indexer
df.loc[3, 'position'] = 'CM'

Using position based indexing (iloc)

Another way of doing this is through position-based indexing using the iloc indexer.

# Assigning a new value using iloc indexer
df.iloc[3, 3] = 'CM'

However, the at and iat indexers should be preferred for single-value assignments. The at and iat indexers support only one type of syntax and therefore have much less overhead in their function definitions. This means that at and iat indexers are extremely fast as compared to loc and iloc.

The SettingWithCopy Warning

Pandas may throw a warning when we try to update or set values in our DataFrames. This is encountered when we use a different approach to change values of elements, using the square bracket indexing method.

Now if we want to select a specific player’s page_views count, we can use another square bracket associated with his position in the DataFrame.

df['page_views']

# Output
0      4329
1      4395
2      1529
       ... 
463     412
464     214
Name: page_views, Length: 465, dtype: int64

Let us say we want to modify the page_views for the second index. Now that we’ve got the element to be updated, we can go ahead and assign it a new value.

df['page_views'][2] = 2001

However, we get the following warning – “A value is trying to be set on a copy of a slice from a DataFrame.” The reason why we are getting this warning is that Pandas cannot guarantee that we are working with the actual underlying DataFrame when we do indexing like this (Chained indexing).

When using chained assignment, Pandas cannot determine whether we are working with a copy of the data or a view of the underlying DataFrame. If we are working with a copy of the DataFrame, changes made to its elements will not be reflected in the actual data.

Changes may or may not be reflected, based on the structure of how the memory is stored. But this is not guaranteed. Hence, we get the mentioned warning.

Let us check whether the changes that we made above happened in the actual DataFrame or not.

df.head()

In this case, the changes we made can be seen in the actual underlying DataFrame. This indicates that we were working on its view, not a copy of the same.

Whenever performing an assignment when working with a copy of the DataFrame (like using the drop_dupicates() function), we will get the same warning message. Henceforth, no changes are applied to the original underlying data.

It is possible to turn off this warning manually by setting the chained assignment mode to none. The chained assignment mode in the options of Pandas is currently set to “warn”. We can set this to “None” to not get the warning message. However, it is considered good practice to consider the warning issued by Pandas.

pd.options.mode.chained_assignment = ‘None’

View vs Copy

Whenever we index, extract slices or apply a method to our DataFrame, we technically run into the question of whether we are working on the actual DataFrame or its copy.

Views in a DataFrame

A view is like a window into the underlying DataFrame. It gives us direct access to the data. We can update and modify the actual data when working with a view of the DataFrame.

Copies in a DataFrame

A copy, on the other hand, is a replica of the actual data. It may or be the same as the original DataFrame. Any changes made to a copy will not be reflected in the underlying DataFrame.

The problem arises when we have to determine by ourselves when we are working a view of the DataFrame or a copy. This can be determined if we make use of the following 2-point rule:

Pandas tends to give us copies of the DataFrame.
If we use loc/iloc or at/iat, we are guaranteed to get a view.

Let us consider an example where we use the loc indexer to modify the elements of the DataFrame and see whether or not the changes appear in the actual data.

# Modifying elements of the DataFrame
df.loc[0:3, ‘position’] = [‘CM’, ‘RW’, ‘CB’, ‘GK’]

df.head()

As we can see, the changes that we made in the view of the DataFrame were reflected in the underlying data.

In the above case, we used the iloc indexer for the DataFrame. Hence, we get a view of the DataFrame. However, if we use the same function on an attribute instead of the entire DataFrame, or some other function like drop_duplicates(), we will be breaking the rule.

In these cases, we will again receive the SettingWithCopy Warning from Pandas.

df['age'].iloc[1] = '12'

df.drop_duplicates().loc[3, 'position'] = 'CM'

Adding DataFrame Columns

We sometimes feel the need to expand our data set by adding new attributes. Let us discuss a few methods to do so.

Using the assignment operator

The easiest and the most common approach to adding new columns is by using the assignment operator on square brackets with a label that doesn’t yet exist.

Let us say, we want to add a new column MVtoFPL in our original DataFrame. So we will confirm that the column name MVtoFPL does not already exist by using the following command.

'MVtoFPL' in df

# Output
False

Now, we can simply use MVtoFPL in square brackets and set it to a default value.

# Introducing a new column with a default value
df['MVtoFPL'] = 1.0

# Checking if 'MVtoFPL' exists in the DataFrame
'MVtoFPL' in df

# Output
False

We can check towards the very end of the DataFrame, we see that the new attribute MVtoFPL is added.

Now, we need to update the value of this column to the ratio of market_value and fpl_value columns. The syntax is identical except now we’re setting it to a different value and not creating it.

# Setting the value of the new attribute
df['MVtoFPL'] = df['market_value'] / df['fpl_value']

df.head

Using the insert method

Another method for inserting new columns in the DataFrame is by using the insert function.

Now for this example, let us slice our DataFrame and store it in a variable df_mini.

# Slicing our original DataFrame
df_mini = df.iloc[:4, 1:5]

Let us assume we are working with this DataFrame and we need to add a new attribute as the name of the player. We will start by creating a series that contains the names of the players and store them as player_names.

# Creating a series of player names
player_names = pd.Series(['Bronson', 'Bradley', 'Ronald', 'Ronny'])

print(player_names)

# Output
0    Bronson
1    Bradley
2     Ronald
3      Ronny
dtype: object

Now, we want to insert the new set of values into our mini DataFrame df_mini using the insert function.

df_mini.insert(0, 'nicknames', player_names)

print(df_mini)

The insert method not only allows us to add a new column but also lets us specify exactly at which position we want our new column to be placed. We are also able to provide a name to our column, which in this case is nicknames.

Using the assign method

The assign method assigns new columns to an existing DataFrame and returns a new copy.

Let us create a new column as career_goals for the total goals scored by the player in his career.

df_mini.assign(career_goals = [12, 67, 179, 49])

NOTE: We can also add multiple methods using the assign method which we couldn’t use the insert method. We can add multiple columns using the following approach.

df_mini.assign(
        career_goals = [12, 67, 179, 49], 
        nationality = ['American', 'British', 'Turkish', 'Indian']
)

Adding Rows to DataFrame

We may come across situations where we need to expand our DataFrame by adding more rows to the same. There are various methods for achieving this.

Using the append method

The first method that we will go through is the append method. The append method works with Series, DataFrames, or a collection of them.

Passing Series object as parameter

Before appending the DataFrame, let us create the following Series. We’ll name the Series ‘4’ for now.

cristiano = pd.Series({
    'nicknames' : 'Cristiano',
    'age' : 32,
    'position' : 'RW',
    'club' : 'Juventus',
    'position_cat' : 1
}, name = 4)

We can confirm whether or not the object was created accordingly.

print(cristiano)

# Output
nicknames       Cristiano
age                    32
position               RW
club             Juventus
position_cat            1
Name: 4, dtype: object

Now that we have created our Series object, we can use the append function to add this object to the required DataFrame. In the following example, we will use the append function on the df_mini DataFrame and pass the object ‘cristiano’ as a parameter.

df_mini.append(cristiano)

Running this program will yield a copy of the DataFrame that includes the newly added Series object in the rows. The name of the Series we previously created becomes the index label for the appended row.

Passing list of Series objects as parameters

Furthermore, the append method also supports adding multiple Series objects at once to the DataFrame. For this, we need to create multiple Series objects and pass them as a list in the parameter for the function.

df_mini.append(player_1, player_2, player_3,…)

Passing DataFrame object as parameter

Instead of using the above way, we can create a DataFrame object. This object will hold the values of several players as lists simultaneously. Let us consider the following examples.

other_players = pd.DataFrame({
    'nicknames' : ['Gianluigi', 'Lionel'],
    'age' : [37, 32],
    'position' : ['GK', 'CF'],
    'club' : ['Juventus', 'Barcelona'],
    'position_cat' : [4,2]
}, index = [5, 6])

Henceforth, we can confirm whether or not the object was created.

print(other_players)

Now that we have created our DataFrame object, we can use the append function to add this object to df_mini. In the following example, we will use the append function on the df_mini DataFrame and pass the object ‘other_players’ as a parameter.

df_mini.append(other_players)

Using the loc/iloc indexers

This is another technique that we can use to add new rows to our existing DataFrame. We can accomplish this by simply assigning a value to a loc or iloc view.

In the following example, we will assign a value to a specific index in the loc indexer. The data to be added will be passed as a list.

df_mini.loc[9] = ['Ramos', 'PSG', 35, 'CB', 4]

We can confirm if the new row has been added or not by running the following command.

print(df_mini)