Multi-level Indexing in Pandas

We will be taking a closer look into the advanced Pandas indexing feature, the MultiIndex. We are going to build an understanding of modifying the axis, primarily the index, in order to support more than one level of labels. This allows us to reflect multi-dimensional datasets.

We will be working with hierarchical indices, techniques, and methods like slicing, sorting, swapping, shuffling, and pivoting index levels.

Introducing New DataFrames

We are going to explore a new dataset that contains daily stock price information on NASDAQ for the past five and a half years for five of the largest US technology companies, namely Apple, Facebook, Microsoft, Google, and Amazon.

Importing the DataFrame:

import numpy as np
import pandas as pd

tech = df = pd.read_csv("C:/Users/BHAVYA/Documents/tech_giants.csv")

tech.head()

To calculate for how many years the data has been recorded, we can use the following approach:

tech.year.value_counts()/5

# Output
2016    252.0
2014    252.0
2015    252.0
2017    251.0
2018    251.0
2019    163.0
Name: year, dtype: float64

Let us now set the index of the DataFrame from a range index to something more meaningful such as the date.

# Setting date as the Index
tech.set_index('date')

Table of Contents

Creating a Multi-Index

We use a multi-index when we need more than one field index for our DataFrame. A multi-index is also called a Hierarchical Index. Using a multi-index we create a hierarchy of indices within the data.

So, instead of date, we will pass in a list of strings. This will indicate to Pandas that we want all the column names to act as the index for our DataFrame.

tech.set_index(['date', 'name'], inplace = True)

The resultant DataFrame is a multi-index. It means that a single index object has more than one level or component to it. When we promoted the date and name columns as the index, they were removed from our DataFrame and they no longer show up as regular columns.

type(tech.index)

# Output
pandas.core.indexes.multi.MultiIndex

Multi-Index from read_csv Method

Previously, we had created a multi-index from our technology stocks dataset by first, reading in the data and then, using the set_index method to specify date and name as the two levels of our multi-index.

Now, we will be looking at an alternative way to arrive at the same DataFrame, but in a single step.

Let us use the read_csv method to read the data again, but this time we will rely on the index_col parameter in the read_csv method to indicate that we want a multi-index in the resulting DataFrame.

tech = pd.read_csv("C:/Users/Documents/tech_giants.csv", index_col = ['date', 'name'])

Another approach to creating a multi-index DataFrame would be to create a multi-index object separately and then pass it to the Pandas DataFrame constructor.

pd.DataFrame(data, index = MultiIndexObject)

Indexing Hierarchical DataFrames

Now that we have learned how to create multi-index DataFrames with various methods, we will practice how to extract data from the same.

To begin with, we will select data from our technology stocks DataFrame. In this DataFrame, the data attribute, combined with the name attribute makes up the label. The label uniquely identifies each element of the data.

tech.head(10)

For example, let us say we want to find out the opening price for Google on January 1^st, 2014. In other words, we will take the first day that we have in the DataFrame and retrieve the closing price of January 2^nd, 2014.

Using the loc indexer

STEP 1: Isolate the full dataset for that particular date, which is the outer label.

tech.loc['2014-01-02']

STEP 2: Extract the data for Google. This gives us a series containing all the values for Google at that date.

tech.loc['2014-01-02', 'GOOGL']

# Output
month                1
year              2014
day                  2
open            557.73
close           556.56
high            558.88
low             554.13
volume         1822719
volume_type     medium
Name: (2014-01-02, GOOGL), dtype: object

STEP 3: Append the close attribute in the given query to retrieve the closing price for the date. Hence, we get the closing price that was required.

tech.loc['2014-01-02', 'GOOGL'].close

# Output
556.56

Using the label in the loc indexer

Another method to accomplish the same result is to use the label that we have created. We can pass both the outer and inner labels of our DataFrame as a tuple to the loc indexer.

This will result in the seclusion of Google values at that particular date. Further, we will pass the close attribute for the desired output.

tech.loc[('2014-01-02', 'GOOGL'), 'close']

# Output
556.56

The iloc indexer does not consider the hierarchy of the data structure. This means that it won’t pay heave to whether the DataFrame has been created using multi-indexing or not. However, the loc indexer cares for the structure of the DataFrame.

Using the iloc indexer

If the same output needs to be generated using the iloc method, we can simply pass in the ordinal positions of the value to be retrieved as parameters.

tech.iloc[2,4]

# Output
556.56

Indexing Ranges and Slices

We looked at some ways to select values from a multi-index DataFrame, specifically by label or position, building on that knowledge, we will take a look at extracting sequences of slices of values.

Let us begin with the simple task of extracting multiple days from our tech stocks DataFrame. Using the loc indexer, if we need to select several dates, all we need to do is to pass in a list of labels containing those dates.

tech.loc[['2015-01-06', '2015-01-07']]

Pandas use the list of labels to extract from the outer level of our multi-index. If we simply want a subset of stock names we simply specify those in another list of index labels.

# Wrong Approach
tech.loc[['2015-01-06', '2015-01-07'], ['FB', 'AMZN']]

But we cannot pass on another list because Pandas throws an error “None of [Index([‘FB’, ‘AMZN’], dtype=’object’)] are in the [columns]” since the loc indexer perceives it as we are trying to index along the column dimension.

So instead, we need to wrap these two lists in a tuple containing two items, which is because our multi-index contains two levels.

tech.loc[(['2015-01-06', '2015-01-07'], ['FB', 'AMZN']), :]

Slices in a Multi-Index DataFrame

In regular DataFrames or Series, we can slice our data structures by specifying a range of values separated by a column.

tech.loc['2017-01-03':'2017-01-31', 'open':'low']

In this case, we are slicing across both dimensions of our dataset. We are selecting dates along with the outer level of our multi-index and we are also selecting a column slice from our column axis.

Now, if we want to get more specific and only isolate the Google data out of the five selections, we will be using the slice object.

To slice our DataFrame by a range of dates for only the Google stocks, we have to specify the date range in a stand-alone slice object, which is then wrapped in a tuple with our other index label, stock name.

tech.loc[(slice('2017-01-03', '2017-01-31'), 'GOOGL'), 'open':'low']

Now, let us assume, we need to find the open prices for FB and AMZN for all dates in our DataFrame.

tech.loc[(slice(None), ['FB', 'AMZN']), 'open']

# Output
date        name
2014-01-02  FB        54.86
2014-01-03  FB        55.00
                     ...   
2019-08-21  AMZN    1819.39
2019-08-22  AMZN    1828.00
2019-08-23  AMZN    1793.03
Name: open, Length: 2842, dtype: float64

Cross Sections With xs

So far, we have seen various approaches for indexing hierarchical DataFrames. All of them are quite powerful and flexible but not the most intuitive in terms of syntax.

The xs method, in terms of functionality, is a subset of label-based indexing. The syntax of this method is quite straightforward.

For example, let us say we want to get the cross-section of our DataFrame for the first trading day in the year 2019. This can be accomplished using xs in the following manner.

tech.xs('2019-01-02')

We have previously used the loc indexer to get the same result. However, now we want to select an element using the stock name instead. In this case, we will use the command mentioned below.

tech.xs('FB', level=1, drop_level=False)

By default, the xs method excludes the level upon which selection is based. Thus we have set the drop_level parameter to False to get the selected level in the result.

Next, we can also select from multiple levels of our index at once by passing the same-sized tuples to both the key and level parameters. In the following example, we have used the first and the second level to select from our multi-index.

tech.xs(('2019-01-02', 'FB'), level=(0,1))

The Anatomy of a Multi-Index Object

We will be discussing exactly how a multi-index object is structurally put together in Pandas which will be very useful while working with or manipulating hierarchical DataFrames.

tech.index

# Output
MultiIndex([('2014-01-02',    'FB'),
            ('2014-01-02',  'AAPL'),
            ...
            ('2019-08-23', 'GOOGL'),
            ('2019-08-23',  'AMZN'),
            ('2019-08-23',    'FB')],
           names=['date', 'name'], length=7105)

A multi-index is like a data structure in itself, it has its specific attributes, sequence of values, and methods. The multi-index object can support very complex structures and hierarchies in a self-contained way.

We can even take an object from one DataFrame and pass it on to another. Now let us take and discuss one component at a time.

Names

The names attribute returns the two labels that we are indexing by. In our case, it is name and date.

tech.index.names

# Output
FrozenList(['date', 'name'])

Levels

nlevels is a list that contains all the range of values for labels in our multi-index.

tech.index.nlevels

# Output
2

This should be consistent with the length of the levels list of indices, which is:

len(tech.index.levels)

# Output
2

The highest level in our index is considered to be the left-most, for example, the highest level in our DataFrame is the date.

Therefore, the outer-level is level 0, and inwards towards the actual columns the levels keep increasing, which is why name is level 1.

tech.index.levels[0]

# Output
Index(['2014-01-02', '2014-01-03', '2014-01-06', '2014-01-07', '2014-01-08',
       '2014-01-09', '2014-01-10', '2014-01-13', '2014-01-14', '2014-01-15',
       ...
       '2019-08-12', '2019-08-13', '2019-08-14', '2019-08-15', '2019-08-16',
       '2019-08-19', '2019-08-20', '2019-08-21', '2019-08-22', '2019-08-23'],
      dtype='object', name='date', length=1421)

This displays all the dates in our DataFrame.

tech.index.levels[1]

# Output
Index(['AAPL', 'AMZN', 'FB', 'GOOGL', 'MSFT'], dtype='object', name='name')

Levshape

If we want to look at the length of each of our levels, we can use the levshape attribute, which stands for level-shape.

tech.index.levshape

# Output
(1421, 5)

Now this means we have 1421 dates by 5 names.

Values

Multi-indices are not just freely floating sequences of labels, they represent a tight coupling of hierarchies of labels. An easy way to look at the label combinations across our index levels is to access the values attribute on the multi-index itself.

tech.index.values

# Output
array([('2014-01-02', 'FB'), ('2014-01-02', 'AAPL'),
       ('2014-01-02', 'GOOGL'), ..., ('2019-08-23', 'GOOGL'),
       ('2019-08-23', 'AMZN'), ('2019-08-23', 'FB')], dtype=object)

Here, we get a list of tuples each containing two items, a date, and a stock name. If we had a three-level multi-index, we would get the third value in the tuple as well.

The values attribute is the sequence of all possible associations across each level of our multi-index.

Adding a Level in a Multi-Index

We have been working with two levels till now, date and name. Now we will be adding a new level in our DataFrame, let us say the volume_type. The volume_type is set to:

high: If the volume is more than twice the average for that stock.
low: If it is less than twice the average for that stock.
medium: Otherwise.

There are a couple of ways to squeeze this categorical attribute within our multi-index.

Using the set_index method

Setting a new level using the set_index function is the most intuitive method.

tech.set_index('volume_type', append = True, inplace = True)

We are indicating in the command above, that we want the volume_type column to act as our index but not in a way that it replaces our existing index but rather, it appends it. The inplace parameter makes the changes stick to our original DataFrame.

The volume_type is now added to the third level of our multi-index.

tech.index.nlevels

# Output
3

# Isolating the 3rd Level
tech.index.levels[2]

# Output
Index(['high', 'low', 'medium'], dtype='object', name='volume_type')

tech.index.values

# Output
array([('2014-01-02', 'FB', 'medium'), ('2014-01-02', 'AAPL', 'low'),
       ('2014-01-02', 'GOOGL', 'medium'), ...,
       ('2019-08-23', 'GOOGL', 'medium'),
       ('2019-08-23', 'AMZN', 'medium'), ('2019-08-23', 'FB', 'medium')],
      dtype=object)

Since we introduced a new level, selecting by a label from the DataFrame will also look a bit different. Let us assume, we wish to select all the data from high volume trading days from January 2019.

We can still use the loc indexer but, now we need to specify a longer tuple.

tech.loc[(slice('2019-01-01', '2019-01-31'), slice(None), 'high'), :]

Taking another example, if we want to get all high-volume trading days. Using the cross-section method:

tech.xs(level=2, key = 'high')

Shuffling the Levels

We saw how to add additional levels to our multi-index, but if we are not satisfied with the order in which these levels appear within our multi-index.

Taking an example, what if we wanted to make the newly added level volume_type come immediately after the first level i.e. date but before name.

One way to do this is to read the data again and set the index in a different order. There is also a better and quicker approach that operates on our existing multi-index, the swaplevel method.

tech.swaplevel(2,1)

We call this method on a hierarchically indexed DataFrame and then we pass the two levels whose positions we want to be swapped or exchanged. This gives us a new multi-indexed DataFrame that has volume_type as the second level within the index.

The swaplevel method also works with label names instead of label positions. So, we can also say:

tech = tech.swaplevel('volume_type', 'name')

Unfortunately, this method does not have an inplace parameter, so if we want the changes to be applied to the underlying DataFrame we have to reassign our variable to point to new DataFrame returned by swaplevel.

In Pandas, there is a more powerful method that works with more than two levels at a time. What we have done so far is a simple positional swap of two levels, if we want to reorder more broadly the alternative is the reorder_levels method.

The reorder_levels method allows us to express a new order for the multi-index all in one go. We pass in a list of positions and we get a new DataFrame that follows that order.

tech.reorder_levels([2, 0, 1])

The reorder_levels method can be called on the hierarchically indexed DataFrame, as we did above, or on the multi-index directly. In that case, it returns a new multi-index with the labels ordered in the way they were specified.

tech.index.reorder_levels([2, 1, 0])

# Output
MultiIndex([('medium',    'FB', '2014-01-02'),
            (   'low',  'AAPL', '2014-01-02'),
            ('medium', 'GOOGL', '2014-01-02'),
            ...
            ('medium',  'AMZN', '2019-08-23'),
            ('medium',    'FB', '2019-08-23')],
           names=['volume_type', 'name', 'date'], length=7105)

Removing Multi-Index Levels

We have talked about adding new levels and even changing their order in our multi-index, now we will take a look at removing levels. Pandas recently introduced a method to perform the same, i.e. the droplevel method.

Let us say we want to exclude the volume_type level, i.e. position 1, we simply use:

tech.droplevel(1)

We get a new hierarchically indexed DataFrame from where the volume_type level has been removed. We notice one thing, the volume_type level is missing from both the levels columns and the general columns.

Unfortunately, there is no way to restore it as a regular column using the droplevel method. This is why we prefer using another method that is the reset_index.

The reset_index is a lot more powerful and flexible. Not only does it allow us to remove a given level, just like droplevel, but also restores it back to the DataFrame as a regular column.

tech.reset_index(level = 2)

This restoration is because reset_index has a drop parameter that controls this aspect of what we do with levels being removed. This drop parameter defaults to False, which is why we see it restored in the DataFrame. If we set this to true, the removed level is truly discarded just like the droplevel method.

Removing Multiple Levels

If we wish to remove several from our multi-index DataFrame at once, we could either remove those levels one at a time calling droplevel or reset_index functions separately, but fortunately both these levels accept lists of level names.

Let us say we need to remove both volume_type and name from the DataFrame, we can say:

Using drop_level Method

tech.droplevel(['volume_type', 'name'])

# Can also be used with positions
tech.droplevel([1, 2])

Using reset_index Method

tech.reset_index(level = ['volume_type', 'name'], drop = True)

Totally Resetting the Index

For resetting the existing index and getting back a basic range index, we simply call the method reset_index method with no parameters on the DataFrame itself.

tech.reset_index()

We get back the plain range index acting as our new index and all the previous levels that were a part of our multi-index got restored as regular columns in our DataFrame.

Sorting Multi-Indices

When we looked at slicing our multi-indexed DataFrame using labels, we combined multiple slice objects within a tuple. Let’s look at the following example:

tech.loc[(slice('2014-01-02', '2014-04-02'), slice(None), 'AAPL'), 'open':'low']

Here, we are selecting the open, close, high, and low prices for Apple for the first three months of trading in 2014 for all volume types. Now, what would happen if we wanted to select not just Apple but a slice going from Apple to Facebook.

If we use the following approach and wrap the names in a slice object, Pandas throws an “UnsortedIndexError: ‘MultiIndex slicing requires the index to be lexsorted“.

tech.loc[(slice('2014-01-02', '2014-04-02'), slice(None), 
          slice('AAPL', 'FB')), 'open':'low']

The reason for this is that it is not very meaningful to talk about slicing a longer sequence of values that is not sorted or does not have an intrinsic order. Now we need to sort our index using the sort_index method.

tech.sort_index(inplace = True)

This command sorts all levels and saves the result in place. Now if we go to our previous command that was throwing an error, and re-execute it, everything works fine.

tech.loc[(slice('2014-01-02', '2014-04-02'), slice(None), 
          slice('AAPL', 'FB')), 'open':'low']

Advantages of using a sorted Multi-Index

Improves retieval performance significantly.
Enables slicing syntax.
Overall a good practice while working with tabular data representations, including Pandas, Excel, SQL, etc.

The sort index method is the same that we’ve seen in practice, but for multi-index DataFrames specifically, we can fine-tune the sort using the level parameter.

For example, if we wanted the most recent dates to come on the top, i.e. we wish to sort only one level of our multi-index in descending order and the names in ascending order.

This can easily be done with the help of the sort_index method.

tech.sort_index(level = (0, 2), ascending = [True, False])

Multi-Index Methods

Apart from the methods we have discussed so far, it’s important to keep in mind that the multi-index object can also be manipulated as a stand-alone data structure. Henceforth, we will take a look at some methods that are only applicable to multi-index DataFrames.

To begin with, we will create a pointer called tidx. Then, we will call the index attribute of the DataFrame and assign its value to the newly created pointer. The index attribute gives us the multi-index.

tidx = tech.index

# Output
(MultiIndex([('2014-01-02',  'AAPL',    'low'),
             ('2014-01-02',  'AMZN', 'medium'),
             ...
             ('2019-08-23',    'FB', 'medium'),
             ('2019-08-23', 'GOOGL', 'medium'),
             ('2019-08-23',  'MSFT', 'medium')],
           names=['date', 'name', 'volume_type'], length=7105)

First of all, we will check whether this index is sorted. To do this we run the following command to confirm whether the index is sorted alphabetically:

tidx.is_lexsorted()

# Output
True

Now that we know that the index is sorted, we can proceed further. If we wish to modify the sort for a given level within our multi-index without touching the DataFrame itself, we can use the sortlevel method directly on the index.

tidx.sortlevel(0, ascending=False, sort_remaining=True)

# Output
(MultiIndex([('2019-08-23',  'MSFT', 'medium'),
             ('2019-08-23', 'GOOGL', 'medium'),
             ...
             ('2014-01-02',  'AMZN', 'medium'),
             ('2014-01-02',  'AAPL',    'low')],
            names=['date', 'name', 'volume_type'], length=7105),
 array([7100, 7102, 7104, ...,    0,    4,    1], dtype=int64))

As we can see from the mentioned example, the sortlevel method takes three parameters:

The level that needs to be modified.
Whether it should be sorted in ascending order or not.
Whether the remaining levels should be sorted or not.

Unlike the sortindex method which is applied to the complete DataFrame, the sortlevel method is applied to the multi-index. We can also sort multiple levels at once.

Consider the following example.

tidx.sortlevel((0, 1, 2), ascending=[True, True, False])

# Output
(MultiIndex([('2014-01-02',  'AAPL',    'low'),
             ('2014-01-02',  'AMZN', 'medium'),
             ...
             ('2019-08-23', 'GOOGL', 'medium'),
             ('2019-08-23',  'MSFT', 'medium')],
            names=['date', 'name', 'volume_type'], length=7105),
 array([   1,    4,    0, ..., 7104, 7102, 7100], dtype=int64))

Now, let us take a look at how to modify the aesthetics of the multi-index.

tech.head()

In the following example, we will use the set_names method on the multi-index directly. This method takes a list of labels that are equal in length to the number of labels in our multi-index.

tidx.set_names(['Trading  Date', 'Volume Category', 'Ticker'])

# Output
MultiIndex([('2014-01-02',    'FB', 'medium'),
            ('2014-01-02',  'AAPL',    'low'),
            ...
            ('2019-08-23',  'AMZN', 'medium'),
            ('2019-08-23',    'FB', 'medium')],
           names=['Trading  Date', 'Volume Category', 'Ticker'], length=7105)

Reshaping with stack Method

We will take a look at the stack method which is widely used alongside pivot tables, but for now, let us talk about it in a multi-index concept.

The stack method is used to take the column axis and pivot or rotate it into the inner-most level of the index.

So starting with our tech dataset, if we call stack on our multi-index DataFrame we end up with a series where the index now includes our column labels as a new level within our multi-index. Let us assign this to a new variable stacked for ease of access.

stacked = tech.stack()

# Output
date        volume_type  name        
2014-01-02  medium       FB    month            1.00
                               year          2014.00
                               day              2.00
                               open            54.86
                               close           54.71
                                            ...     
2019-08-23  medium       FB    open           180.84
                               close          177.75
                               high           183.13
                               low            176.66
                               volume    17331221.00
Length: 56840, dtype: float64

We went from DataFrame to Series which may seem like we lost a dimension but instead, it is a transfer of dimension from the column axis to a wider multi-index. We went from a three-level multi-indexed DataFrame with a one-level column axis to a four-level multi-index series containing a single sequence of values.

type(stacked)

# Output
pandas.core.series.Series

stacked.index.nlevels

# Output
4

Renaming the Levels

We notice the stacked variable has a gap which indicates that the level does not have a name, so let us quickly fix that using the set_names methods.

STEP 1: Let us first isolate the names.

stacked.index.names

# Output
FrozenList(['date', 'volume_type', 'name', None])

FrozenList is a data structure in Pandas that is immutable. Therefore, it is impossible for us to just change its content. The only way to change it is by creating a new instance object.

We see that the last attribute is None which is responsible for no name in our series. To fix this we will use set_names.

STEP 2: Let us assign the FrozenList to a variable names.

names = stacked.index.names

STEP 3: We will now set the names of the levels using the set_names function by modifying the names variable.

stacked.index.set_names([*names[:-1], 'Previously a Column Axis'], inplace = True)

# Output
date        volume_type  name  Previously a Column Axis
2014-01-02  medium       FB    month                              1.00
                               year                            2014.00
                               day                                2.00
                               open                              54.86
                               close                             54.71
                                                              ...     
2019-08-23  medium       FB    open                             180.84
                               close                            177.75
                               high                             183.13
                               low                              176.66
                               volume                      17331221.00
Length: 56840, dtype: float64

Reshaping with unstack Method

In the previous section, we have learned the use of the stack method which takes a column axis and chops it into our index labels. The unstack method does the exact opposite of that.

The unstack method takes the innermost level of a multi-index and moves it back to the column axis. Let us use the unstack method on our stacked dataset.

As we can see, the innermost level of the dataset used to be a column axis. We will use the unstack method to take this innermost level back to the column dimension. We can simply use the following command to achieve this.

stacked.unstack()

Hence, we have retrieved our previous DataFrame from the modified dataset. Furthermore, the function can be called multiple times.

stacked.unstack().unstack()

By running the above piece of code, we end up creating a multi-index column axis. We now have two levels within our column dimension. Thus, this DataFrame now has two-level multi-indices on both the index and the column dimension.

However, we have created gaps in our DataFrame which are represented by NaN due to the lack of adequate information. However, the fill_value parameter of the unstack method allows us to replace these gaps with the desired value.

Moreover, we can also specify exactly what level we want to operate on while using the mentioned methods. Let us consider an example where we will unstack the Volume Category level, instead of the innermost level.

stacked.unstack(level=1)

Another way to do this is by mentioning the name of the level to be unstacked.

stacked.unstack(level = 'name')

Hence, we receive the DataFrame having the input level moved to the column dimension.

Multi-Level Columns using transpose

For manually creating a multi-level column axis we need to combine the set_index with transpose method. We know:

set_index enables us to create a hierarchical index.
transpose swaps the DataFrame such that the columns becomes its rows and vice versa.

Let us say, we are interested in reshaping our tech DataFrame. Let us set the date and volume_type as our indices.

tech.set_index(['date', 'volume_type'])

Now we have our multi-index, at this point, we use transpose in order to flip this DataFrame over, to make our index become our column. We will end up with two multi-index columns.

Hence, we get a hierarchical index using set_index and transpose functions.