The Pandas is an open-source Python library providing high-performance data manipulation and analysis tools using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.
In 2008, developer Wes McKinney started developing pandas when in need of high-performance, flexible tools for the analysis of data. Pandas provide fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
The Pandas is a robust toolkit for analyzing, filtering, manipulating, aggregating, merging, pivoting, and cleaning data. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, statistics, analytics, etc.
Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution to data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.
Features of Pandas
- Easy handling of missing data (represented as not a number i.e., NaN) in floating point as well as non-floating-point data.
- Columns can be inserted and deleted from a data table and higher dimensional objects.
- Objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let columns of a data table, data table, etc automatically align the data for you in computations.
- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data.
- Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into data table objects.
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.
- Intuitive merging and joining data sets.
- Flexible reshaping and pivoting of data sets.
- Hierarchical labelling of axes (possible to have multiple labels per tick).
- Robust input/output tools for loading data from raw(not processed) files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format.
How Pandas library work ?
The Pandas operations are fast. In Pandas library, many of the low-level algorithmic bits have been extensively tweaked in Cython (C extensions for Python) code. However, as with anything else generalization usually sacrifices performance. So, if you focus on one feature for your application you may be able to create a faster-specialized tool.
The Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries. The Pandas is a dependency of statsmodels Python module, making it an important part of the statistical computing ecosystem in Python. The Pandas library has been used extensively in production in financial applications.
When to use Pandas ?
The Pandas is well suited in handling the following kinds of data:
- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels.
- Any other form of observational / statistical data sets. The data need not be labelled at all to be placed into a Pandas data structure.
Environment Setup
In order to work with Pandas, we need to install Python and Pandas. Python Distribution must be installed prior to Pandas. Standard Python distribution doesn’t come bundled with the Pandas module. To install Pandas we will use the following python command:
pip install pandas
The pip command can be executed in command prompt (Windows Users) and terminal(Mac, Linux Users).
If you are using Anaconda Distribution, then Pandas and some other data analysis libraries come preinstalled. For updating or to get the latest version of Pandas and this can be used for conda users as well:
pip install pandas --upgrade
If the installation commands were executed correctly without errors then Pandas is installed and can be imported in code as :
import pandas
Where import is a keyword used to import Python modules/library/packages, pandas is the library name. Generally, we use as pd
in place of pandas as an alias (as pd
is optional in the context of importing).
import pandas as pd
If Pandas is not installed properly, after executing the import statement, the error message will be displayed as:
Using Command line :
C:\Users\user>pip show pandas
Output:
WARNING: Package(s) not found: pandas
Using Python IDLE (comes with Python Distribution):
>>> import pandas as pd
Output:
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
import pandas as pd
ModuleNotFoundError: No module named 'pandas'
In any case, if this message is coming, then just try running the installation commands again, If the same issue occurs then validate if Python is installed without errors and added properly to your system’s environment variables (Windows Users).
Quick Pandas code
In this short example, we will see how to create a Pandas DataFrame. A DataFrame in pandas is basically tabular data with rows and columns. Do not Panic if these terms are unfamiliar as we will go through each term in-depth in the coming tutorial.
#importing pandas to our code
import pandas as pd
#creating a python list oflists
data = [['Ram', 10],['Shyam', 9],['Gopal', 12]]
#creating a pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
#print DataFrame
print(df)
Output:
Name Age
0 Ram 10
1 Shyam 9
2 Gopal 12
So here in the first line of our code, we import the Pandas library as pd
means instead of writing pandas we are using pd as its substitute(alias). We created a normal python list of lists named data
and initialized it with string and integer values.
After that, we call the DataFrame()
function from Pandas (pd.DatFrame()
). The DataFrame()
function takes the python list data
as an argument(input parameter) and outputs a Pandas DataFrame.
This returned DataFrame is stored as variable df
.We print the contents of df
at the end of the code.