Pandas Syntax
Pandas Syntax
Getting Started
Importing Module
The module can be imported in the following way
import pandas
it can also be renamed when importing to shorten the name
import pandas as pd
Reading CSV
Pandas dataframe object supports multiple data types in each data cell and is specifically designed to work with tabular data. Typical tabular file is a CSV file where a delimiter separating each value is defined by some kind of character (usually a comma, since CSV stands for Comma Separated Values.)
Pandas data frame objects have a header and a row name unlike NumPy objects that are n-dimensional arrays of one particular type of data.
dataframe_obj = pandas.read_csv("file_name.csv")
Exploring the Data Frame
pandas_dataframe_obj.head(n)
will output first $n$ rows.
df.shape # outputs a 2-element tuple
df.shape[0] # number of rows
df.shape[1] # number of columns
The above code block will output the tuple representing the dimensions of the data frame, number of rows and columns respectfully.
df.columns
will output column names. However, df.rows
will produce an error.
Working with Rows
Selecting Rows
*.loc[]
There are a few methods to use when selecting rows with Pandas. The first we will lok at is using *.loc[]
which stands for location.
row = df.loc[value]
The value
parameter passed can be an integer or the row label and the resulting row
variable will be a single vector with headers (column names.)
Selecting multiple rows is just as straight forward:
matrix = df.loc[start:finish]
matrix = df.loc[0,3,10]
where start
and finish
specify the range.
*.iloc[]
Another method is using *.iloc[]
and stands for integer location and just like the explanation suggests can only use integers to specify row selection. Other than that works in exactly the same manner as *.loc[]
.
Bracket notation is used to select a slice of rows.
To select individual or a custom list of rows we need to use the *.iloc[]
method on the Dataframe object. With *.iloc[]
the following methods are acceptable:
- Integer
- List of integers
- Slice object
- Boolean array
Boolean
We can also select rows by subsetting with a vector of Boolean values:
bool_vals = (dataframe["column_name"] == value)
rows = dataframe[bool_vals]
The code above will create a vector of Boolean values for each element where dataframe["column_name"] == value
produces a True
or False
value. Further, rows = dataframe[bool_vals]
will select all rows where the value is True
.
Working with Columns
Column Names
Naming Columns
df.columns = ['col_name01', ... , 'col_nameN']
Renaming Columns
col_rename_dict = {old_name01: new_name01, ... , old_nameN: new_nameN}
df.rename(columns = col_rename_dict)
Selecting Columns
Selecting a column is even easier as the snippet below demonstrates:
dataframe_column = df["column_name"]
dataframe_column = df[["col_name01", "col_name03"]]
By passing a string with the right column name the dataframe will return a Series object. String can also be passed as an object (i.e. df[string]
)
Accessing Values
To generate a list of values from a particular column we can do the following:
list_of_values = dataframe["column_name"].values
Counting Values
It is sometimes helpful to find out what values are unique and how often they appear in the given column. For that we can use a special method Series.value_counts()
Mapping Values
We can use a dictionary to specify what values in the column we want to convert (map). This is most useful when Yes/No
type values need to be converted to Boolean True/False
. We can achieve this using the pd.Series.map()
method.
# Consider the following string
series = ['Yes', 'Yes', NaN, 'No']
# Define a dictionary
dict = {'Yes': True, 'No': False}
series = series.map(dict)
The output of series
will be [True, True, NaN, False]
.
Manipulating a Column
It is possible to define element-wise arithmetic operations on an entire column at once
resulting_column = dataframe["column_name"] / 1000
As a result of the above operation all elements of the column will be divided by the specified value
It is also possible to define operations on multiple columns:
resulting_column = dataframe["column01"] + dataframe["column02"]
Creating a New Column
It is convenient to create a new column in the data frame either by inserting new values or manipulating values from one of the existing columns and assigning it as a new column.
dataframe["new_column_name"] = value
Assuming that the value
variable is a vector.
Creating a new column from an existing one using the =
assignment operator:
new_column = dataframe["old_column_name"] / 1000
dataframe["new_column_name"] = new_column
Sorting the Dataframe
Sometimes it is necessary to sort values within the dataframe by a particular column. This can be done in the following way:
new_dataframe = dataframe.sort_values("column_name", ascending=True)
Specifying ascending=False
will sort the dataframe in descending order.
Resetting Index
After sorting the rows indexes are no longer in sequence. Sometimes it is useful to reset the index which can be done using the *.reset_index()
.
This command will retain the old index and add a new column with the new sequential index. If we do not need to retain the old index then drop=True
has to be specified as a parameter.
Custom Index
The dataframe object has a *.set_index()
method that allows us to pass in the name of the column we want pandas to use as the Dataframe index.
The *.set_index()
method has a few parameters that allow us to tweak this behavior:
inplace
: If set toTrue
, this parameter will set the index for the current dataframe, instead of returning a new dataframe.drop
: If set toFalse
, this parameter will keep the column we specified as the index, instead of dropping it.
Working with Missing Data
Finding NAs
To create a vector of True
and False
values of missing data in a particular column we can use the following method:
column = dataframe["column_name"]
vector_w_bool_vals = pandas.isnull(column)
We can further select the null values:
rows_w_null_vals = column[vector_w_bool_vals]
We can also subset the values and select the ones that are not null
rows_w_vals = column[vector_w_bool_vals == False]
Removing NAs
To remove any rows or columns containing NA
values we can use *.dropna()
method on a dataframe.
drop_nas = dataframe.dropna(axis=0)
Axis 0
will specify to drop rows and 1
will drop any columns that have NA
values as entries.
Filling NAs
We can fill in missing data in pandas using the pandas.DataFrame.fillna()
method. This method will replace any missing values in a dataframe with the values we specify.
Subsetting Dataframe
We can use the subset
keyword argument to the dropna()
method so that it only drops rows if there are NA values in certain columns. We will have to specify a list of column names.
new_df = df.dropna(subset=["col_nm_01", "col_nm_02",...])
Pivot Tables
Parameters
Pivot tables (popularized by Microsoft Excel) allow grouping and then applying a calculation. With Pandas we can create a Pivot Table using the following method:
pivot_table = dataframe.pivot_table(index="column_name", values="another_column", aggfunc=func_name)
For aggfunc
we can specify any function or a list of functions. The default parameter is numpy.mean
.
Similarly, for values
a list of column name strings can be passed to group similar calculations across several columns for a specified index.
For further information refer to the documentation.
Applying Functions to a Dataframe
Using a Built-In Function
We can compute the mean of every column using the pandas.DataFrame.mean()
method.
Using a Custom Function
We can pass a function to a dataframe object and have it apply the function through each column and perform an operation iteratively on each element of the column. The following code illustrates:
new_result = dataframe.apply(function_name)
The code above takes the function function_name
, passes it to the dataframe and has is apply the function to every element assigning the returning result to new_result
.
Using a Lambda Function
We can also apply a Lambda Function:
new_result = dataframe.apply(lambda x: np.std(x), axis=1)
A new Dataframe object will be returned where the lambda function was applied to each element of every column.
In this case axis=1
instructs *.apply()
to perform np.std()
over rows of the dataframe.
Pandas Data Structures
Series
Instanciating
Series is a Pandas object that is basically a collection of values. Series object is constructed as a one-dimensional NumPy ndarray. In order to instanciate a Series object we do the following:
# import the correct module
from pandas import Series
custom_series = Series(data=list_of_values, index=list_of_labels)
In the snippet above list_of_values
and list_of_labels
must have the same length and will produce a labelled series of values.
Accessing Values
When it comes to Series objects we can access values just like in a dictionary by providing a key series_obj['string_label']
or simply by using the bracket notation just like ordinary lists series_obj[n:m]
.
Re-indexing
Series objects can be re-indexed by calling *.reindex(index=sorted_list)
method on the object and passing a sorted list as an index.
- Sorting can be done by calling
*.sort()
on a list or passing to thesorted(unsorted_list)
method
Sorting
Additionally Pandas Series objects can be sorted using the built-in *.sort_index()
or *.sort_values()
methods that preserve data alignment.
Comparing and Filtering
We can use vectorized operations on Series objects to compare values or filter
series_greater_than_50 = series_obj[series_obj > value]
This will return a Series object that has values bigger than $50$ based on Boolean values that expression series_obj > value
creates.
We can also define more complex operations using bracket notation:
both_criteria = series_obj[(series_obj > n) & (series_obj < m)]
Dataframe & Series Plots
Scatter Plots
There is a way to plot Pandas dataframe directly without creating any Matplotlib figures and plots explicitly.
dataframe_obj.plot(x='col_name01', y='col_name02, kind='scatter')
The code above will return an Axes object which is essentially a subplot that we can further customize.
Histograms
We can also quickly visualize histograms of a specific column the following way:
dataframe_obj['col_name'].plot(kind='hist')
Another method to generate a histogram of a column showing it’s distribution is to call Series.hist()
method.
dataframe_obj['col_name'].hist(bins=k, range=(n,m))
Scatter Matrix Plots
A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously. A scatter matrix plot consists of n
by n
plots on a grid.
Pandas contains a function named scatter_matrix()
that generates the plots for us.
- This function is part of the
pandas.tools.plotting
module and needs to be imported separately.
from pandas.tools.plotting import scatter_matrix
scatter_matrix(df_obj[['col_01',...,'col_k']], figsize=(n,m))