Pandas Tutorial part II

Python Pandas - Descriptive Statistics

A large number of methods collectively compute descriptive statistics and other related operations on DataFrame. Most of these are aggregations like sum(), mean(), but some of them, like sumsum(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, ...}, but the axis can be specified by name or integer

  • DataFrame − “index” (axis=0, default), “columns” (axis=1)

Let us create a DataFrame and use this object throughout this chapter for all the operations.

Example

import pandas as pd
import numpy as np 

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98 ,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d) 
print df

Its output is as follows 

#> Age Name   Rating
0  25  Tom    4.23
1  26  James  3.24 
2  25  Ricky  3.98
3  23  Vin    2.56
4  30  Steve  3.20
5  29  Smith  4.60
6  23  Jack   3.80
7  34  Lee    3.78
8  40  David  2.98
9  30  Gasper 4.80
10 51  Betina 4.10
11 46  Andres 3.65

 sum()

Returns the sum of the values for the requested axis. By default, axis is index (axis=0).

import pandas as pd
import numpy as np

#Create a Dictionary of series 
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d) 
print df.sum()

Its output is as follows 

Age 382
Name TomJamesRickyVin SteveSmithJackLeeDavidGasperBe...
Rating 44.92
dtype: object

Each individual column is added individually (Strings are appended).

axis=1

This syntax will give the output as shown below.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d) 
print df.sum(1)

Its output is as follows 

0  29.23
1  29.24
2  28.98
3  25.56
4  33.20 
5  33.60
6  26.80
7  37.78
8  42.98
9  34.80
10 55.10
11 49.65
dtype: float64

mean()

Re turns the average value

import pandas as pd
import numpy as np 

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin',' Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print df.mean() 

Its output is as follows 

Age 31.833333
Rating 3.743333
dtype: float64 

std()

Returns the Bressel standard deviation of the numerical columns.

import pandas as pd 
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3 .98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d) 
print df.std()

Its output is as follows 

 Age 9.232682
Rating 0.661628
dtype: float64 

Functions & Description

Let us now understand the functions under Descriptive Statistics in Python Pandas. The following table list down the important functions −

Function Description
count() Number of non-null observations
sum() Sum of values
mean() Mean of Values
median() Median of Values
mode() Mode of values
std() Standard Deviation of the Values
min() Minimum Value
max() Maximum Value
abs() Absolute Value
prod() Product of Values
cumsum() Cumulative Sum
 cumprod() Cumulative Product

Note − Since DataFrame is a Heterogeneous data structure. Generic operations don’t work with all functions. Functions like sum(), cumsum() work with both numeric and character (or) string data elements without any error. Though n practice, character aggregations are never used generally, these functions do not throw any exception. Functions like abs(), cumprod() throw exception when the DataFrame contains character or string data because such operations cannot be performed.

Summarizing Data

The describe() function computes a summary of statistics pertaining to the DataFrame columns.

import pandas as pd 
import numpy as np

#Create a Dictionary of series 
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print df.describe() 

Its output is as follows 

#>    Age       Rating 
count 12.000000 12.000000
mean  31.833333 3.743333
std   9.232682  0.661628
min   23.000000 2.560000
25%   25.000000 3.230000
50%   29.500000 3.790000
75%   35.500000 4.132500
max   51.000000 4.800000

 This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns. 'include' is the argument which is used to pass necessary information regarding what columns need to be considered for summarizing. Takes the list of values; by default, 'number'.

  • object − Summarizes String columns
  • number − Summarizes Numeric columns
  • all − Summarizes all columns together (Should not pass it as a list value)

Now, use the following statement in the program and check the output 

import pandas as pd 
import numpy as np

#Create a Dictionary of series 
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d) 
print df.describe(include=['object'])

Its output is as follows 

#>     Name
count  12 
unique 12
top    Ricky
freq   1

Now, use the following statement and check the output 

import pandas as pd
import numpy as np 

#Create a Dictionary of series 
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d) 
print df. describe(include='all')

Its output is as follows 

#>     Age       Name  Rating
count  12.000000 12    12.000000
unique NaN       12    NaN
top    NaN       Ricky NaN 
freq   NaN       1     NaN
mean   31.833333 NaN   3.743333
std    9.232682  NaN   0.661628
min    23.000000 NaN   2.560000
25%    25.000000 NaN   3.230000
50%    29.500000 NaN   3.790000
75%    35.500000 NaN   4.132500
max    51.000000 NaN   4.800000

Python Pandas - Reindexing

Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.

Multiple operations can be accomplished through indexing like 

  • Reorder the existing data to match a new set of labels.
  • Insert missing value (NA) markers in label locations where no data for the label existed.

Example

import pandas as pd
import numpy as np 

N=20 

df = pd.DataFrame({ 
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
})

#reindex the DataFrame 
df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])

print df_reindexed 

Its output is as follows 

#>         A   C    B
0 2016-01-0 1 Low  NaN
2 2016-01-03 High NaN
5 2016-01-06 Low  NaN

 Reindex to Align with Other Objects 

 You may wish to take an object and reindex its axes to be labeled the same as another object. Consider the following example to understand the same.

Example

import pandas as pd
import numpy as np 

df1 = pd.DataFrame(np.random.randn (10,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])

df1 = df1.reindex_like(df2) 
print df1

Its output is as follows 

#>     col1      col2      col3
0 -2.467652 -1.211687 -0.391761
1 -0.287396  0.522350  0.562512
2 -0.255409 -0.483250  1.866258 
3 -1.150467 -0.646493 -0.222462
4  0.152768 -2.056643  1.877233
5 -1.155997  1.528719 -1.343719
6 -1.015606 -1.245936 -0.295275

Note − Here, the df1 DataFrame is altered and reindexed like df2. The column names should be matched or else NAN will be added for the entire column label.

Filling while ReIndexing

reindex() takes an optional parameter method which is a filling method with values as follows −

  • pad/ffill − Fill values forward
  • bfill/backfill − Fill values backward
  • nearest − Fill from the nearest index values

Example

import pandas as pd
import numpy as np 

df1 = pd.DataFrame(np.random.randn(6 ,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])

# Padding NAN's 
print df2.reindex_like(df1)

# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill:") 
print df2.reindex_like(df1,method='ffill')

Its output is as follows 

#>     col1      col2     col3 
0  1.311620 -0.707176 0.599863
1 -0.423455 -0.700265 1.133371
2       NaN       NaN      NaN
3       NaN       NaN      NaN
4       NaN       NaN      NaN
5       NaN       NaN      NaN

Data Frame with Forward Fill:
       col1      col2     col3
0  1.311620 -0.707176 0.599863 
1 -0.423455 -0.700265 1.133371
2 -0.423455 -0.700265 1.133371
3 -0.423455 -0.700265 1.133371
4 -0.423455 -0.700265 1.133371
5 -0.423455 -0.700265 1.133371

Note − The last four rows are padded. 

Limits on Filling while Reindexing

The limit argument provides additional control over filling while reindexing. Limit specifies the maximum count of consecutive matches. Let us consider the following example to understand the same 

Example

import pandas as pd 
import numpy as np

df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])

 # Padding NAN's
print df2.reindex_like(df1)

# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill limitin g to 1:")
print df2.reindex_like(df1,method='ffill',limit=1)

Its output is as follows 

#>     col1      col2      col3
0  0.247784  2.128727  0.702576
1 -0.055713 -0.021732 -0.174577
2       NaN       NaN       NaN
3       NaN       NaN       NaN 
4       NaN       NaN       NaN
5       NaN       NaN       NaN

Data Frame with Forward Fill limiting to 1:
       col1      col2      col3
0  0.247784  2.128727  0.702576
1 -0.055713 -0.021732 -0.174577
2 -0.055713 -0.021732 -0.174577
3       NaN       NaN       NaN 
4       NaN       NaN       NaN
5       NaN       NaN       NaN

Note − Observe, only the 7th row is filled by the preceding 6th row. Then, the rows are left as they are.

Renaming

The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

Let us consider the following example to understand this 

import pandas as pd
import numpy as  np

df1 = pd.DataFrame(np.ran dom.randn(6,3),columns=['col1','col2','col3'])
print df1

  print ("After renaming the rows and columns:")
print df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},
index = {0 : 'apple', 1 : 'banana', 2 : 'durian'}) 

Its output is as follows 

#>     col1      col2      col3
0  0.486791  0.105759  1.540122
1 -0.990237  1.007885 -0.217896
2 -0.483855 -1.645027 -1.194113
3 -0.122316  0.566277 -0.366028
4 -0.231524 -0.721172 -0.112007 
5  0.438810  0.000225  0.435479

After renaming the rows and columns:
              c1        c2      col3
apple   0.486791  0.105759  1.540122
banana -0.990237  1.007885 -0.21 7896
durian -0.483855 -1.645027 -1.194113
3      -0.122316  0.566277 -0.366028
4      -0.231524 -0.721172 -0.112007
5       0.438810  0.000225  0.435479

The rename() method provides an inplace named parameter, which by default is False and copies the underlying data. Pass inplace=True to rename the data in place.

Python Pandas - Sorting

There are two kinds of sorting available in Pandas. They are 

  • By label
  • By Actual Value

Let us consider an example with an output.

import pandas as pd
import numpy as np 

unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns=['col2','col1'])
print unsorted_df 

Its output is as follows 

#>     col2      col1
1 -2.063177  0.537527
4  0.142932 -0.684884
6  0.012667 -0.389340
2 -0.548797  1.848743
3 -1.044160  0.837381
5  0.385605  1.300185
9  1.031425 -1.002967
8 -0.407374 -0.435142 
0  2.237453 -1.067139
7 -1.445831 -1.701035

In unsorted_df, the labels and the values are unsorted. Let us see how these can be sorted.

By Label

Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. By default, sorting is done on row labels in ascending order.

import pandas as pd
import numpy as np 

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
   mns = ['col2','col1']) 

sorted_df=unsorted_df.sort_index() 
print sorted_df

Its output is as follows 

#>     col2      col1
0  0.208464  0.627037
1  0.641004  0.331352
2 -0.038067 -0.464730
3 -0.638456 -0.021466
4  0.014646 -0.737438 
5 -0.290761 -1.669827
6 -0.797303 -0.018737
7  0.525753  1.628921
8 -0.567031  0.775951
9  0.060724 -0.322425

Order of Sorting

By passing the Boolean value to ascending parameter, the order of the sorting can be controlled. Let us consider the following example to understand the same.

 import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
   mns = ['col2','col1']) 

sorted_df = unsorted_df.sort_index(ascending=False)
print sorted_df 

Its output is as follows 

#>     col2      col1
9  0.825697  0.374463
8 -1.699509  0.510373
7 -0.581378  0.622958
6 -0.202951  0.954300
5 -1.289321 -1.551250
4  1.302561  0.851385
3 -0.157915 -0.388659
2 -1.222295  0.166609 
1  0.584890 -0.291048
0  0.668444 -0.061294

Sort the Columns

By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. Let us consider the following example to understand the same.

import pandas as pd
import numpy as np 

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
   mns = ['col2','col1'])

sorted_df=unsorted_df.sort_index(axis=1)

print sorted_df 

Its output is as follows 

#>     col1      col2
1 -0.291048  0.584890
4  0.851385  1.302561
6  0.954300 -0.202951
2  0.166609 -1.222295
3 -0.388659 -0.157915
5 -1.551250 -1.289321
9  0.374463  0.825697 
8  0.510373 -1.699509
0 -0.061294  0.668444
7  0.622958 -0.581378

By Value

Like index sorting, sort_values() is the method for sorting by values. It accepts a 'by' argument which will use the column name of the DataFrame with which the values are to be sorted.

import pandas as pd 
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1') 

print sorted_df 

Its output is as follows 

#>col1 col2
1    1    3
2    1    2
3    1    4
0    2    1 

Observe, col1 values are sorted and the respective col2 value and row index will alter along with col1. Thus, they look unsorted.

'by' argument takes a list of column values.

import pandas as pd 
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
   sorted_df = unsorted_df.sort_values(by=['col1','col2']) 

print sorted_df 

Its output is as follows 

#>col1 col2
2    1    2
1    1    3
3    1    4 
0    2    1

Sorting Algorithm

sort_values() provides a provision to choose the algorithm from mergesort, heapsort and quicksort. Mergesort is the only stable algorithm.

import pandas as pd 
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='c ol1' ,kind='mergesort')

 print sorted_df

Its output is as follows 

#>col1 col2
1    1    3
2    1    2 
3    1    4
0    2    1

Python Pandas - Working with Text Data

In this chapter, we will discuss the string operations with our basic Series/Index. In the subsequent chapters, we will learn how to apply these string functions on the DataFrame. Pandas provides a set of string functions which make it easy to operate on string data. Most importantly, these functions ignore (or exclude) missing/NaN values. Almost, all of these methods work with Python string functions. So, convert the Series Object to String Object and then perform the operation.

Let us now see how each operation performs.

Function Description
lower() Converts strings in the Series/Index to lower case.
upper() Converts strings in the Series/Index to upper case.
len() Computes String length().
strip() Helps strip whitespace(including newline) from each string in the Series/index from both the sides.
split(' ') Splits each string with the given pattern.
cat(sep=' ') Concatenates the series/index elements with given separator.
get_dummies() Returns the DataFrame with One-Hot Encoded values.
contains(pattern) Returns count of appearance of pattern in each element.
replace(a,b) Replaces the value a with the value b.
 repeat(value) Repeats each element with specified number of times.
 count(pattern) Returns count of appearance of pattern in each element.
startswith(pattern) Returns true if the element in the Series/Index starts with the pattern.
endswith(pattern) Returns true if the element in the Series/Index ends with the pattern.
find(pattern) Returns the first position of the first occurrence of the pattern.
 findall(pattern) Returns a list of all occurrence of the pattern.
swapcase Swaps the case lower/upper.
islower() Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean
 isupper() Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean.
isnumeric() Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean.

Let us now create a Series and see how all the above functions work.

import pandas as pd
import numpy as np 

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan,  '1234','SteveSmith'])

print s 

Its output is as follows 

0 Tom
1 William Rick
2 John
3 Alber@t
4 NaN
5 1234 
6 Steve Smith
dtype: object

lower() 

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick' , 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print s.str.lower() 

Its output is as follows 

0 tom
1 william rick
2 john
3 alber@t
4 NaN
5 1234
6 steve smith 
dtype: object

upper()

import pandas as pd 
import numpy as np

s = pd.Series(['Tom', 'William Rick', ' John', 'Alber@t', np.nan, '1234','SteveSmith'])

print s.str.upper() 

Its output is as follows 

0 TOM
1 WILLIAM RICK
2 JOHN
3 ALBER@T
4 NaN
5 1234
6 STEVE SMITH
dtype: object 

len()

import pandas as pd
import numpy as np 

s = pd.Series(['Tom', 'William R ick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
print s.str.len()

Its output is as follows 

0 3.0
1 12.0
2 4.0
3 7.0
4 NaN 
5 4.0
6 10.0
dtype: float64

strip()

import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s 
print ("After Stripping:")
print s.str.strip()

Its output is as follows 

0 Tom
1 William Rick
2 John 
3 Alber@t
dtype: object

 After Stripping:
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object

split(pattern)

import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s 
print ("Split Pattern:")
print s.str.split(' ')

Its output is as follows 

0 Tom
1 William Rick 
2 John
3 Alber@t
dtype: object

Split Pattern:
0 [Tom, , , , , , , , , , ]
1 [, , , , , William, Rick]
2 [John]
3 [Alber@t]
dtype: object 

cat(sep=pattern)

import pandas as pd
import numpy as np 

 s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.cat(sep='_') 

Its output is as follows 

Tom _ William Rick_John_Alber@t 

get_dummies()

import pandas as pd
import numpy as np 

s = pd.Series(['Tom ', ' Willia m Rick', 'John', 'Alber@t'])

print s.str.get_dummies() 

Its output is as follows 

#>William Rick Alber@t John Tom
0            0       0    0   1
1            1       0    0   0 
2            0       0    1   0
3            0       1    0   0

contains ()

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s.str.contains(' ') 

Its output is as follows 

0 True
1 True
2 False
3 False
dtype: bool 

replace(a,b)

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("After replacing @ with $:")
print s.str.replace('@','$') 

Its output is as follows 

0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object 

After replacing @ with $:
0 Tom
1 William Rick
2 John 
3 Alber$t
dtype: object

repeat(value)

import pandas as pd 

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t']) 

print s.str.repeat(2) 

Its output is as follows 

0 Tom Tom
1 William Rick William Rick
2 JohnJohn
3 Alber@tAlber@t
dtype: object 

count(pattern)

import pandas as pd
 
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("The number of 'm's in each string:")
print s.str.count('m') 

Its output is as follows 

 The number of 'm's in each string:
0 1
1 1
2 0
3 0 

startswith(pattern)

import pandas as pd 

s = pd.Series(['Tom ', ' William Rick', 'John' , 'Alber@t'])

print ("Strings that start with 'T':")
print s.str. startswith ('T') 

Its output is as follows 

0 True
1 False
2 False
3 False
dtype: bool 

endswith(pattern)

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print ("Strings that end w ith 't':")
print s.str.endswith('t')

Its output is as follows 

Strings that end with 't':
0 False
1 False
2 False 
3 True
dtype: bool

find(pattern)

 import pandas as pd

s = pd.Series(['Tom ', ' William Ri ck', 'John', 'Alber@t'])

 print s.str.find('e')

Its output is as follows 

0 -1
1 -1
2 -1
3 3
dtype: int64 

"-1" indicates that there no such pattern available in the element.

findall(pattern)

import pandas as pd 

s = pd.Series(['Tom ', ' William Rick ', 'John', 'Alber@t'])

print s.str.findall('e') 

Its output is as follows 

0 [ ]
1 [ ]
2 [ ] 
3 [e]
dtype: object

Null list([ ]) indicates that there is no such pattern available in the element.

swapcase()

import pandas as pd 

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print s.str.swapcase() 

Its output is as follows 

0 tOM 
1 wILLIAM rICK
2 jOHN
3 aLBER@T
dtype: object

islower()

 import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print s.str.islower() 

Its output is as follows 

0 False
1 False
2 False 
3 False
dtype: bool

isupper()

import pandas as pd 

 s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])

print s.str.isupper() 

Its output is as follows 

0 False
1 False
2 False
3 False 
dtype: bool

isnumeric()

import pandas as pd 

s = pd.Series(['Tom', 'William Rick', ' John', 'Alber@t'])

print s.str.isnumeric() 

Its output is as follows 

0 False 
1 False
2 False
3 False
dtype: bool

Python Pandas - Indexing and Selecting Data

In this chapter, we will discuss how to slice and dice the date and generally get the subset of pandas object.

The Python and NumPy indexing operators "[ ]" and attribute operator "." provide quick and easy access to Pandas data structures across a wide range of use cases. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommend that you take advantage of the optimized pandas data access methods explained in this chapter.

Pandas now supports three types of Multi-axes indexing; the three types are mentioned in the following table −

Indexing Description
.loc() Label based
.iloc() Integer based
.ix() Both Label and Integer based

.loc()

Pandas provide various methods to have purely label based indexing. When slicing, the start bound is also included. Integers are valid labels, but they refer to the label and not the position.

.loc() has multiple access methods like 

  • A single scalar label
  • A list of labels
  • A slice object
  • A Boolean array

loc takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

Example 1

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np 

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], colum ns = ['A', 'B', 'C', 'D'])

#select all rows for a specific column
print df.loc[:,'A'] 

Its output is as follows 

a  0.391548
b -0.070649
c -0.317212
d -2.162406 
e  2.202797
f  0.613709
g  1.050559
h  1.122680
Name: A, dtype: float64

Example 2

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np 

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], col umns = ['A', 'B', 'C', 'D'])

# Select all rows for multiple  columns, say list[]
print df.loc[:,['A','C'] ]

Its output is as follows 

#>        A         C 
a  0.391548  0.745623
b -0.070649  1.620406
c -0.317212  1.448365
d -2.162406 -0.873557
e  2.202797  0.528067
f  0.613709  0.286414
g  1.050559  0.216526
h  1.122680 -1.621420

Example 3

# import the pandas library and aliasing as pd
import pandas as pd 
import numpy as np

df = pd.DataFrame(np.ran dom.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select few rows for multiple columns,  say list[]
print df.loc[ ['a','b','f','h'],['A','C'] ]

Its output is as follows 

#>        A         C
a  0.391548  0.745623
b -0.070649  1.620406 
f  0.613709  0.286414
h  1.122680 -1.621420

Example 4

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np 

df = pd.DataFrame(np.random.randn(8 , 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select range of rows for all columns
print df.loc['a':'h'] 

Its output is as follows 

#>        A         B         C         D
a  0.391548 -0.224297  0.745623  0.054301
b -0.070649 -0.880130  1.620406  1.419743
c -0.317212 -1.929698  1.448365  0.616899
d -2.162406  0.614256 -0.873557  1.093958
e  2.202797 -2.315915  0.528067  0.612482
f  0.613709 -0.157674  0.286414 -0.500517
g  1.050559 -2.272099  0.216526  0.928449
h  1.122680  0.324368 -1.621420 -0.741470

Example 5

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np 

df = pd.DataFrame(np.random.randn(8, 4), 
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# for getting values with a boolean array
print df.loc['a']>0 

Its output is as follows 

A False
B True
C False
D False
Name: a, dtype: bool 

.iloc()

Pandas provide various methods in order to get purely integer based indexing. Like python and numpy, these are 0-based indexing.

The various access methods are as follows 

  • An Integer
  • A list of integers
  • A range of values

Example 1

# import the pandas library and aliasing as pd
import pandas as pd 
import numpy as np

df = pd.DataFrame(np.ra ndom.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# select all rows for a specific column
print df.iloc[:4] 

Its output is as follows 

#>        A         B         C         D
0  0.699435  0.256239 -1.270702 -0.645195
1 -0.685354  0.890791 -0.813012  0.631615 
2 -0.783192 -0.531378  0.025070  0.230806
3  0.539042 -1.284314  0.826977 -0.026251

Example 2

import pandas as pd
import numpy as np 

df = pd.DataFrame(np.random.rand n(8, 4), columns = ['A', 'B', 'C', 'D'] )

# Integer slicing
print df.iloc[:4]
print df.iloc[1:5, 2:4] 

Its output is as follows 

#>        A         B         C         D 
0  0.699435  0.256239 -1.270702 -0.645195
1 -0.685354  0.890791 -0.813012  0.631615
2 -0.783192 -0.531378  0.025070  0.230806
3  0.539042 -1.284314  0.826977 -0.026251
 
          C         D
1 -0.813012  0.631615
2  0.025070  0.230806
3  0.826977 -0.026251
4  1.423332  1.130568 

Example 3

import pandas as pd
import numpy as np 

df = pd.DataFrame(np.random.randn(8,  4), columns = ['A', 'B', 'C', 'D'])

# Slicing through list of values
print df.iloc[ [1, 3, 5], [1, 3] ]
print df.iloc[1:3, :]
print df.iloc[:,1:3] 

Its output is as follows 

#>        B         D 
1  0.890791  0.631615
3 -1.284314 -0.026251
5 -0.512888 -0.518930

          A         B         C        D
 1 -0.685354  0.890791 -0.813012 0.631615
2 -0.783192 -0.531378  0.025070 0.230806

          B         C
0  0.256239 -1.270702
1  0.890791 -0.813012
2 -0.531378  0.025070
3 -1.284314  0.826977
4 -0.460729  1.423332
5 -0.512888  0.581409 
6 -1.204853  0.098060
7 -0.947857  0.641358

.ix()

Besides pure label based and integer based, Pandas provides a hybrid method for selections and subsetting the object using the .ix() operator.

Example 1

import pandas as pd 
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4 ), columns = ['A', 'B', 'C', 'D'])

# Integer slicing 
print df.ix[:4]

Its output is as follows 

#>        A         B         C         D
0  0.699435  0.256239 -1.270702 -0.645195 
1 -0.685354  0.890791 -0.813012  0.631615
2 -0.783192 -0.531378  0.025070  0.230806
3  0.539042 -1.284314  0.826977 -0.026251

Example 2

import pandas as pd
import numpy as np 

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
# Index slicing 
print df.ix[:,'A']

Its output is as follows 

0 0.699435 
1 -0.685354
2 -0.783192
3 0.539042
4 -1.044209
5 -1.415411
6 1.062095
7 0.994204
Name: A, dtype: float64

Use of Notations

Getting values from the Pandas object with Multi-axes indexing uses the following notation 

Object Indexers Return Type
Series s.loc[indexer] Scalar value
DataFrame df.loc[row_index,col_index] Series object
Panel p.loc[item_index,major_index, minor_index] p.loc[item_index,major_index, minor_index]

Note − .iloc() & .ix() applies the same indexing options and Return value.

Let us now see how each operation can be performed on the DataFrame object. We will use the basic indexing operator '[ ]' −

Example 1

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print df['A'] 

Its output is as follows 

0 -0.478893
1 0.391931
2 0.336825
3 -1.055102
4 -0.165218
5 -0.328641 
6 0.567721
7 -0.759399
Name: A, dtype: float64

Note − We can pass a list of values to [ ] to select those columns.

Example 2

import pandas as pd
import numpy as np 
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

print df[ ['A','B'] ] 

Its output is as follows 

#>        A         B
0 -0.478893 -0.606311
1  0.391931 -0.949025
2  0.336825  0.093717
3 -1.055102 -0.012944
4 -0.165218  1.550310 
5 -0.328641 -0.226363
6  0.567721 -0.312585
7 -0.759399 -0.372696

Example 3

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print df[2:2] 

Its output is as follows 

Columns: [A, B, C, D]
Index: [  ]

Attribute Access

Columns can be selected using the attribute operator '.'.

Example

import pandas as pd
import numpy  as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

 print df.A

Its output is as follows 

0 -0.478893
1 0.391931
2 0.336825
3 -1.055102
4 -0.165218
5 -0.328641
6 0.567721 
7 -0.759399
Name: A, dtype: float64
Page structure