Pandas Tutorial part II
Python Pandas - Descriptive Statistics
A large number of methods collectively compute descriptive statistics and other related operations on DataFrame. Most of these are aggregations like sum(), mean(), but some of them, like sumsum(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, ...}, but the axis can be specified by name or integer
- DataFrame − “index” (axis=0, default), “columns” (axis=1)
Let us create a DataFrame and use this object throughout this chapter for all the operations.
Example
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98 ,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df = pd.DataFrame(d)
print df
Its output is as follows
#> Age Name Rating
0 25 Tom 4.23
1 26 James 3.24
2 25 Ricky 3.98
3 23 Vin 2.56
4 30 Steve 3.20
5 29 Smith 4.60
6 23 Jack 3.80
7 34 Lee 3.78
8 40 David 2.98
9 30 Gasper 4.80
10 51 Betina 4.10
11 46 Andres 3.65
sum()
Returns the sum of the values for the requested axis. By default, axis is index (axis=0).
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df = pd.DataFrame(d)
print df.sum()
Its output is as follows
Age 382
Name TomJamesRickyVin
SteveSmithJackLeeDavidGasperBe...
Rating 44.92
dtype: object
Each individual column is added individually (Strings are appended).
axis=1
This syntax will give the output as shown below.
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df = pd.DataFrame(d)
print df.sum(1)
Its output is as follows
0 29.23
1 29.24
2 28.98
3 25.56
4 33.20
5 33.60
6 26.80
7 37.78
8 42.98
9 34.80
10 55.10
11 49.65
dtype: float64
mean()
Re
turns the average value
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin',' Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df = pd.DataFrame(d)
print df.mean()
Its output is as follows
Age 31.833333
Rating 3.743333
dtype: float64
std()
Returns the Bressel standard deviation of the numerical columns.
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3 .98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df = pd.DataFrame(d)
print df.std()
Its output is as follows
Age 9.232682 Rating 0.661628 dtype: float64
Functions & Description
Let us now understand the functions under Descriptive Statistics in Python Pandas. The following table list down the important functions −
Function | Description |
---|---|
count() | Number of non-null observations |
sum() | Sum of values |
mean() | Mean of Values |
median() | Median of Values |
mode() | Mode of values |
std() | Standard Deviation of the Values |
min() | Minimum Value |
max() | Maximum Value |
abs() | Absolute Value |
prod() | Product of Values |
cumsum() | Cumulative Sum |
cumprod() | Cumulative Product |
Note − Since DataFrame is a Heterogeneous data structure. Generic operations don’t work with all functions. Functions like sum(), cumsum() work with both numeric and character (or) string data elements without any error. Though n practice, character aggregations are never used generally, these functions do not throw any exception. Functions like abs(), cumprod() throw exception when the DataFrame contains character or string data because such operations cannot be performed.
Summarizing Data
The describe() function computes a summary of statistics pertaining to the DataFrame columns.
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df = pd.DataFrame(d)
print df.describe()
Its output is as follows
#> Age Rating
count 12.000000 12.000000
mean 31.833333 3.743333
std 9.232682 0.661628
min 23.000000 2.560000
25% 25.000000 3.230000
50% 29.500000 3.790000
75% 35.500000 4.132500
max 51.000000 4.800000
This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns. 'include' is the argument which is used to pass necessary information regarding what columns need to be considered for summarizing. Takes the list of values; by default, 'number'.
- object − Summarizes String columns
- number − Summarizes Numeric columns
- all − Summarizes all columns together (Should not pass it as a list value)
Now, use the following statement in the program and check the output
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df = pd.DataFrame(d)
print df.describe(include=['object'])
Its output is as follows
#> Name
count 12
unique 12
top Ricky
freq 1
Now, use the following statement and check the output
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df = pd.DataFrame(d)
print df. describe(include='all')
Its output is as follows
#> Age Name Rating
count 12.000000 12 12.000000
unique NaN 12 NaN
top NaN Ricky NaN
freq NaN 1 NaN
mean 31.833333 NaN 3.743333
std 9.232682 NaN 0.661628
min 23.000000 NaN 2.560000
25% 25.000000 NaN 3.230000
50% 29.500000 NaN 3.790000
75% 35.500000 NaN 4.132500
max 51.000000 NaN 4.800000Python Pandas - Reindexing
Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.
Multiple operations can be accomplished through indexing like
- Reorder the existing data to match a new set of labels.
- Insert missing value (NA) markers in label locations where no data for the label existed.
Example
import pandas as pd
import numpy as np
N=20
df = pd.DataFrame({
'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
'x': np.linspace(0,stop=N-1,num=N),
'y': np.random.rand(N),
'C': np.random.choice(['Low','Medium','High'],N).tolist(),
'D': np.random.normal(100, 10, size=(N)).tolist()
})
#reindex the DataFrame
df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])
print df_reindexed
Its output is as follows
#> A C B
0 2016-01-0
1 Low NaN
2 2016-01-03 High NaN
5 2016-01-06 Low NaN
Reindex to Align with Other Objects
You may wish to take an object and reindex its axes to be labeled the same as another object. Consider the following example to understand the same.
Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn (10,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])
df1 = df1.reindex_like(df2)
print df1
Its output is as follows
#> col1 col2 col3
0 -2.467652 -1.211687 -0.391761
1 -0.287396 0.522350 0.562512
2 -0.255409 -0.483250 1.866258
3 -1.150467 -0.646493 -0.222462
4 0.152768 -2.056643 1.877233
5 -1.155997 1.528719 -1.343719
6 -1.015606 -1.245936 -0.295275
Note − Here, the df1 DataFrame is altered and reindexed like df2. The column names should be matched or else NAN will be added for the entire column label.
Filling while ReIndexing
reindex() takes an optional parameter method which is a filling method with values as follows −
- pad/ffill − Fill values forward
- bfill/backfill − Fill values backward
- nearest − Fill from the nearest index values
Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(6 ,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
# Padding NAN's
print df2.reindex_like(df1)
# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill:")
print df2.reindex_like(df1,method='ffill')
Its output is as follows
#> col1 col2 col30 1.311620 -0.707176 0.599863 1 -0.423455 -0.700265 1.133371 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN 5 NaN NaN NaN Data Frame with Forward Fill: col1 col2 col3 0 1.311620 -0.707176 0.599863 1 -0.423455 -0.700265 1.133371 2 -0.423455 -0.700265 1.133371 3 -0.423455 -0.700265 1.133371 4 -0.423455 -0.700265 1.133371 5 -0.423455 -0.700265 1.133371
Note − The last four rows are padded.
Limits on Filling while Reindexing
The limit argument provides additional control over filling while reindexing. Limit specifies the maximum count of consecutive matches. Let us consider the following example to understand the same
Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
# Padding NAN's
print df2.reindex_like(df1)
# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill limitin g to 1:")
print df2.reindex_like(df1,method='ffill',limit=1)
Its output is as follows
#> col1 col2 col3 0 0.247784 2.128727 0.702576 1 -0.055713 -0.021732 -0.174577 2 NaN NaN NaN 3 NaN NaN NaN4 NaN NaN NaN 5 NaN NaN NaN Data Frame with Forward Fill limiting to 1: col1 col2 col3 0 0.247784 2.128727 0.702576 1 -0.055713 -0.021732 -0.174577 2 -0.055713 -0.021732 -0.174577 3 NaN NaN NaN 4 NaN NaN NaN 5 NaN NaN NaN
Note − Observe, only the 7th row is filled by the preceding 6th row. Then, the rows are left as they are.
Renaming
The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.
Let us consider the following example to understand this
import pandas as pd
import numpy as
np
df1 = pd.DataFrame(np.ran dom.randn(6,3),columns=['col1','col2','col3'])
print df1
print ("After renaming the rows and columns:")
print df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},
index = {0 : 'apple', 1 : 'banana', 2 : 'durian'})
Its output is as follows
#> col1 col2 col3 0 0.486791 0.105759 1.540122 1 -0.990237 1.007885 -0.217896 2 -0.483855 -1.645027 -1.194113 3 -0.122316 0.566277 -0.366028 4 -0.231524 -0.721172 -0.1120075 0.438810 0.000225 0.435479 After renaming the rows and columns: c1 c2 col3 apple 0.486791 0.105759 1.540122 banana -0.990237 1.007885 -0.21 7896 durian -0.483855 -1.645027 -1.194113 3 -0.122316 0.566277 -0.366028 4 -0.231524 -0.721172 -0.112007 5 0.438810 0.000225 0.435479
The rename() method provides an inplace named parameter, which by default is False and copies the underlying data. Pass inplace=True to rename the data in place.
Python Pandas - Sorting
There are two kinds of sorting available in Pandas. They are
- By label
- By Actual Value
Let us consider an example with an output.
import pandas as pd
import numpy as np
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns=['col2','col1'])
print unsorted_df
Its output is as follows
#> col2 col1
1 -2.063177 0.537527
4 0.142932 -0.684884
6 0.012667 -0.389340
2 -0.548797 1.848743
3 -1.044160 0.837381
5 0.385605 1.300185
9 1.031425 -1.002967
8 -0.407374 -0.435142
0 2.237453 -1.067139
7 -1.445831 -1.701035
In unsorted_df, the labels and the values are unsorted. Let us see how these can be sorted.
By Label
Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. By default, sorting is done on row labels in ascending order.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns = ['col2','col1'])
sorted_df=unsorted_df.sort_index()
print sorted_df
Its output is as follows
#> col2 col1
0 0.208464 0.627037
1 0.641004 0.331352
2 -0.038067 -0.464730
3 -0.638456 -0.021466
4 0.014646 -0.737438
5 -0.290761 -1.669827
6 -0.797303 -0.018737
7 0.525753 1.628921
8 -0.567031 0.775951
9 0.060724 -0.322425
Order of Sorting
By passing the Boolean value to ascending parameter, the order of the sorting can be controlled. Let us consider the following example to understand the same.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns = ['col2','col1'])
sorted_df = unsorted_df.sort_index(ascending=False)
print sorted_df
Its output is as follows
#> col2 col1
9 0.825697 0.374463
8 -1.699509 0.510373
7 -0.581378 0.622958
6 -0.202951 0.954300
5 -1.289321 -1.551250
4 1.302561 0.851385
3 -0.157915 -0.388659
2 -1.222295 0.166609
1 0.584890 -0.291048
0 0.668444 -0.061294
Sort the Columns
By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. Let us consider the following example to understand the same.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns = ['col2','col1'])
sorted_df=unsorted_df.sort_index(axis=1)
print sorted_df
Its output is as follows
#> col1 col2
1 -0.291048 0.584890
4 0.851385 1.302561
6 0.954300 -0.202951
2 0.166609 -1.222295
3 -0.388659 -0.157915
5 -1.551250 -1.289321
9 0.374463 0.825697
8 0.510373 -1.699509
0 -0.061294 0.668444
7 0.622958 -0.581378
By Value
Like index sorting, sort_values() is the method for sorting by values. It accepts a 'by' argument which will use the column name of the DataFrame with which the values are to be sorted.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1')
print sorted_df
Its output is as follows
#>col1 col2
1 1 3
2 1 2
3 1 4
0 2 1
Observe, col1 values are sorted and the respective col2 value and row index will alter along with col1. Thus, they look unsorted.
'by' argument takes a list of column values.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by=['col1','col2'])
print sorted_df
Its output is as follows
#>col1 col2
2 1 2
1 1 3
3 1 4
0 2 1
Sorting Algorithm
sort_values() provides a provision to choose the algorithm from mergesort, heapsort and quicksort. Mergesort is the only stable algorithm.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='c ol1' ,kind='mergesort')
print sorted_df
Its output is as follows
#>col1 col2
1 1 3
2 1 2
3 1 4
0 2 1Python Pandas - Working with Text Data
In this chapter, we will discuss the string operations with our basic Series/Index. In the subsequent chapters, we will learn how to apply these string functions on the DataFrame. Pandas provides a set of string functions which make it easy to operate on string data. Most importantly, these functions ignore (or exclude) missing/NaN values. Almost, all of these methods work with Python string functions. So, convert the Series Object to String Object and then perform the operation.
Let us now see how each operation performs.
Function | Description |
---|---|
lower() | Converts strings in the Series/Index to lower case. |
upper() | Converts strings in the Series/Index to upper case. |
len() | Computes String length(). |
strip() | Helps strip whitespace(including newline) from each string in the Series/index from both the sides. |
split(' ') | Splits each string with the given pattern. |
cat(sep=' ') | Concatenates the series/index elements with given separator. |
get_dummies() | Returns the DataFrame with One-Hot Encoded values. |
contains(pattern) | Returns count of appearance of pattern in each element. |
replace(a,b) | Replaces the value a with the value b. |
repeat(value) | Repeats each element with specified number of times. |
count(pattern) | Returns count of appearance of pattern in each element. |
startswith(pattern) | Returns true if the element in the Series/Index starts with the pattern. |
endswith(pattern) | Returns true if the element in the Series/Index ends with the pattern. |
find(pattern) | Returns the first position of the first occurrence of the pattern. |
findall(pattern) | Returns a list of all occurrence of the pattern. |
swapcase | Swaps the case lower/upper. |
islower() | Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean |
isupper() | Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean. |
isnumeric() | Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean. |
Let us now create a Series and see how all the above functions work.
import pandas as pd
import numpy as np
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
print s
Its output is as follows
0 Tom
1 William Rick
2 John
3 Alber@t
4 NaN
5 1234
6 Steve Smith
dtype: object
lower()
import pandas as pd
import numpy as np
s = pd.Series(['Tom', 'William Rick'
, 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
print s.str.lower()
Its output is as follows
0 tom
1 william rick
2 john
3 alber@t
4 NaN
5 1234
6 steve smith
dtype: object
upper()
import pandas as pd
import numpy as np
s = pd.Series(['Tom', 'William Rick', ' John', 'Alber@t', np.nan, '1234','SteveSmith'])
print s.str.upper()
Its output is as follows
0 TOM
1 WILLIAM RICK
2 JOHN
3 ALBER@T
4 NaN
5 1234
6 STEVE SMITH
dtype: object
len()
import pandas as pd
import numpy as np
s = pd.Series(['Tom', 'William R ick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
print s.str.len()
Its output is as follows
0 3.0
1 12.0
2 4.0
3 7.0
4 NaN
5 4.0
6 10.0
dtype: float64
strip()
import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("After Stripping:")
print s.str.strip()
Its output is as follows
0 Tom 1 William Rick 2 John3 Alber@t dtype: object After Stripping: 0 Tom 1 William Rick 2 John 3 Alber@t dtype: object
split(pattern)
import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("Split Pattern:")
print s.str.split(' ')
Its output is as follows
0 Tom 1 William Rick2 John 3 Alber@t dtype: object Split Pattern: 0 [Tom, , , , , , , , , , ] 1 [, , , , , William, Rick] 2 [John] 3 [Alber@t] dtype: object
cat(sep=pattern)
import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s.str.cat(sep='_')
Its output is as follows
Tom _ William Rick_John_Alber@t
get_dummies()
import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' Willia m Rick', 'John', 'Alber@t'])
print s.str.get_dummies()
Its output is as follows
#>William Rick Alber@t John Tom
0 0 0 0 1
1 1 0 0 0
2 0 0 1 0
3 0 1 0 0
contains ()
import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s.str.contains(' ')
Its output is as follows
0 True
1 True
2 False
3 False
dtype: bool
replace(a,b)
import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("After replacing @ with $:")
print s.str.replace('@','$')
Its output is as follows
0 Tom 1 William Rick 2 John 3 Alber@t dtype: objectAfter replacing @ with $: 0 Tom 1 William Rick 2 John 3 Alber$t dtype: object
repeat(value)
import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s.str.repeat(2)
Its output is as follows
0 Tom Tom
1 William Rick William Rick
2 JohnJohn
3 Alber@tAlber@t
dtype: object
count(pattern)
import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print ("The number of 'm's in each string:")
print s.str.count('m')
Its output is as follows
The number of 'm's in each string: 0 1 1 1 2 0 3 0
startswith(pattern)
import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John' , 'Alber@t'])
print ("Strings that start with 'T':")
print s.str. startswith ('T')
Its output is as follows
0 True
1 False
2 False
3 False
dtype: bool
endswith(pattern)
import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print ("Strings that end w
ith 't':")
print s.str.endswith('t')
Its output is as follows
Strings that end with 't':
0 False
1 False
2 False
3 True
dtype: bool
find(pattern)
import pandas as pd
s = pd.Series(['Tom ', ' William Ri
ck', 'John', 'Alber@t'])
print s.str.find('e')
Its output is as follows
0 -1
1 -1
2 -1
3 3
dtype: int64
"-1" indicates that there no such pattern available in the element.
findall(pattern)
import pandas as pd
s = pd.Series(['Tom ', ' William Rick ', 'John', 'Alber@t'])
print s.str.findall('e')
Its output is as follows
0 [ ]
1 [ ]
2 [ ]
3 [e]
dtype: object
Null list([ ]) indicates that there is no such pattern available in the element.
swapcase()
import pandas as pd
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print s.str.swapcase()
Its output is as follows
0 tOM
1 wILLIAM rICK
2 jOHN
3 aLBER@T
dtype: object
islower()
import pandas as pd
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print s.str.islower()
Its output is as follows
0 False
1 False
2 False
3 False
dtype: bool
isupper()
import pandas as pd
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print s.str.isupper()
Its output is as follows
0 False
1 False
2 False
3 False
dtype: bool
isnumeric()
import pandas as pd
s = pd.Series(['Tom', 'William Rick', ' John', 'Alber@t'])
print s.str.isnumeric()
Its output is as follows
0 False
1 False
2 False
3 False
dtype: boolPython Pandas - Indexing and Selecting Data
In this chapter, we will discuss how to slice and dice the date and generally get the subset of pandas object.
The Python and NumPy indexing operators "[ ]" and attribute operator "." provide quick and easy access to Pandas data structures across a wide range of use cases. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommend that you take advantage of the optimized pandas data access methods explained in this chapter.
Pandas now supports three types of Multi-axes indexing; the three types are mentioned in the following table −
Indexing | Description |
---|---|
.loc() | Label based |
.iloc() | Integer based |
.ix() | Both Label and Integer based |
.loc()
Pandas provide various methods to have purely label based indexing. When slicing, the start bound is also included. Integers are valid labels, but they refer to the label and not the position.
.loc() has multiple access methods like
- A single scalar label
- A list of labels
- A slice object
- A Boolean array
loc takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.
Example 1
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], colum ns = ['A', 'B', 'C', 'D'])
#select all rows for a specific column
print df.loc[:,'A']
Its output is as follows
a 0.391548
b -0.070649
c -0.317212
d -2.162406
e 2.202797
f 0.613709
g 1.050559
h 1.122680
Name: A, dtype: float64
Example 2
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], col umns = ['A', 'B', 'C', 'D'])
# Select all rows for multiple columns, say list[]
print df.loc[:,['A','C'] ]
Its output is as follows
#> A C
a 0.391548 0.745623
b -0.070649 1.620406
c -0.317212 1.448365
d -2.162406 -0.873557
e 2.202797 0.528067
f 0.613709 0.286414
g 1.050559 0.216526
h 1.122680 -1.621420
Example 3
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.ran dom.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
# Select few rows for multiple columns, say list[]
print df.loc[ ['a','b','f','h'],['A','C'] ]
Its output is as follows
#> A C
a 0.391548 0.745623
b -0.070649 1.620406
f 0.613709 0.286414
h 1.122680 -1.621420
Example 4
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8 , 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
# Select range of rows for all columns
print df.loc['a':'h']
Its output is as follows
#> A B C D a 0.391548 -0.224297 0.745623 0.054301 b -0.070649 -0.880130 1.620406 1.419743 c -0.317212 -1.929698 1.448365 0.616899 d -2.162406 0.614256 -0.873557 1.093958 e 2.202797 -2.315915 0.528067 0.612482 f 0.613709 -0.157674 0.286414 -0.500517 g 1.050559 -2.272099 0.216526 0.928449 h 1.122680 0.324368 -1.621420 -0.741470
Example 5
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
# for getting values with a boolean array
print df.loc['a']>0
Its output is as follows
A False
B True
C False
D False
Name: a, dtype: bool
.iloc()
Pandas provide various methods in order to get purely integer based indexing. Like python and numpy, these are 0-based indexing.
The various access methods are as follows
- An Integer
- A list of integers
- A range of values
Example 1
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.ra ndom.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
# select all rows for a specific column
print df.iloc[:4]
Its output is as follows
#> A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
Example 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand n(8, 4), columns = ['A', 'B', 'C', 'D'] )
# Integer slicing
print df.iloc[:4]
print df.iloc[1:5, 2:4]
Its output is as follows
#> A B C D0 0.699435 0.256239 -1.270702 -0.645195 1 -0.685354 0.890791 -0.813012 0.631615 2 -0.783192 -0.531378 0.025070 0.230806 3 0.539042 -1.284314 0.826977 -0.026251 C D 1 -0.813012 0.631615 2 0.025070 0.230806 3 0.826977 -0.026251 4 1.423332 1.130568
Example 3
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
# Slicing through list of values
print df.iloc[ [1, 3, 5], [1, 3] ]
print df.iloc[1:3, :]
print df.iloc[:,1:3]
Its output is as follows
#> B D1 0.890791 0.631615 3 -1.284314 -0.026251 5 -0.512888 -0.518930 A B C D 1 -0.685354 0.890791 -0.813012 0.631615 2 -0.783192 -0.531378 0.025070 0.230806 B C 0 0.256239 -1.270702 1 0.890791 -0.813012 2 -0.531378 0.025070 3 -1.284314 0.826977 4 -0.460729 1.423332 5 -0.512888 0.581409 6 -1.204853 0.098060 7 -0.947857 0.641358
.ix()
Besides pure label based and integer based, Pandas provides a hybrid method for selections and subsetting the object using the .ix() operator.
Example 1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4 ), columns = ['A', 'B', 'C', 'D'])
# Integer slicing
print df.ix[:4]
Its output is as follows
#> A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
Example 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
# Index slicing
print df.ix[:,'A']
Its output is as follows
0 0.699435
1 -0.685354
2 -0.783192
3 0.539042
4 -1.044209
5 -1.415411
6 1.062095
7 0.994204
Name: A, dtype: float64
Use of Notations
Getting values from the Pandas object with Multi-axes indexing uses the following notation
Object | Indexers | Return Type |
---|---|---|
Series | s.loc[indexer] | Scalar value |
DataFrame | df.loc[row_index,col_index] | Series object |
Panel | p.loc[item_index,major_index, minor_index] | p.loc[item_index,major_index, minor_index] |
Note − .iloc() & .ix() applies the same indexing options and Return value.
Let us now see how each operation can be performed on the DataFrame object. We will use the basic indexing operator '[ ]' −
Example 1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print df['A']
Its output is as follows
0 -0.478893
1 0.391931
2 0.336825
3 -1.055102
4 -0.165218
5 -0.328641
6 0.567721
7 -0.759399
Name: A, dtype: float64
Note − We can pass a list of values to [ ] to select those columns.
Example 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print df[ ['A','B'] ]
Its output is as follows
#> A B
0 -0.478893 -0.606311
1 0.391931 -0.949025
2 0.336825 0.093717
3 -1.055102 -0.012944
4 -0.165218 1.550310
5 -0.328641 -0.226363
6 0.567721 -0.312585
7 -0.759399 -0.372696
Example 3
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print df[2:2]
Its output is as follows
Columns: [A, B, C, D]
Index: [
]
Attribute Access
Columns can be selected using the attribute operator '.'.
Example
import pandas as pd
import numpy
as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print df.A
Its output is as follows
0 -0.478893
1 0.391931
2 0.336825
3 -1.055102
4 -0.165218
5 -0.328641
6 0.567721
7 -0.759399
Name: A, dtype: float64