Pandas Dataframes/Series {#Pandas-Dataframes/Series}

20 min

Exercies (10 min)

A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

Selecting values (iloc[...,...]) {#Selecting-values-(iloc[...,...])}

To access a value at the position [i,j] of a DataFrame, we have two options, depending on what is the meaning of i in use. Remember that a DataFrame provides a index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

dataframe.iloc can specify by numerical index analogously to 2D version of character selection in strings.

dataframe.iloc[rows, columns]

In [4]:

import pandas as pd
data = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_europe.csv', index_col='country')
#data
print(data.iloc[0:3, 0])

#With labels 
#print(data.loc["Albania", "gdpPercap_1952"])

#All columns (just like usual slicing)

#print(data.loc["Albania", :])

country
Albania    1601.056136
Austria    6137.076492
Belgium    8343.105127
Name: gdpPercap_1952, dtype: float64

Use DataFrame.loc[..., ...] to select values by their (entry) label.

  • Can specify location by row name analogously to 2D version of dictionary keys.

In [72]:

data = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_europe.csv', index_col='country')
print(data.loc["Albania", "gdpPercap_1952"])

1601.056136

Use : on its own to mean all columns or all rows.

  • Just like Python’s usual slicing notation.

In [9]:

print(data.loc["Italy",:])

gdpPercap_1952     4931.404155
gdpPercap_1957     6248.656232
gdpPercap_1962     8243.582340
gdpPercap_1967    10022.401310
gdpPercap_1972    12269.273780
gdpPercap_1977    14255.984750
gdpPercap_1982    16537.483500
gdpPercap_1987    19207.234820
gdpPercap_1992    22013.644860
gdpPercap_1997    24675.024460
gdpPercap_2002    27968.098170
gdpPercap_2007    28569.719700
Name: Italy, dtype: float64

In [ ]:

print(data.loc["Albania", :])

Select multiple columns or rows using DataFrame.loc and a named slice.

In [5]:

print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])

             gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country                                                    
Italy           8243.582340    10022.401310    12269.273780
Montenegro      4649.593785     5907.850937     7778.414017
Netherlands    12790.849560    15363.251360    18794.745670
Norway         13450.401510    16361.876470    18965.055510
Poland          5338.752143     6557.152776     8006.506993

In the above code, we discover that slicing using loc is inclusive at both ends, which differs from slicing using iloc, where slicing indicates everything up to but not including the final index.

Result of slicing can be used in further operations. {#Result-of-slicing-can-be-used-in-further-operations.}

  • Usually don’t just print a slice.
  • All the statistical operators that work on entire dataframes work the same way on slices.
  • E.g., calculate max of a slice.

In [10]:

print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())

gdpPercap_1962    13450.40151
gdpPercap_1967    16361.87647
gdpPercap_1972    18965.05551
dtype: float64

In [11]:

# Calculate minimum of slice

print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())

gdpPercap_1962    4649.593785
gdpPercap_1967    5907.850937
gdpPercap_1972    7778.414017
dtype: float64

Use comparisons to select data based on value.

  • Comparison is applied element by element.

  • Returns a similarly-shaped dataframe of True and False.

In [15]:

subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
#print('Subset of data:\n', subset)

# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)

#Select values or NaN using a Boolean mask.
mask = subset > 10000
print(subset[mask])

Where are values large?
              gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country                                                    
Italy                 False            True            True
Montenegro            False           False           False
Netherlands            True            True            True
Norway                 True            True            True
Poland                False           False           False
             gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country                                                    
Italy                   NaN     10022.40131     12269.27378
Montenegro              NaN             NaN             NaN
Netherlands     12790.84956     15363.25136     18794.74567
Norway          13450.40151     16361.87647     18965.05551
Poland                  NaN             NaN             NaN

Get the value where the mask is true, and NaN (Not a Number) where it is false. Useful because NaNs are ignored by operations like max, min, average, etc.

  • A frame full of Booleans is sometimes called a mask because of how it can be used.

In [9]:

mask = subset > 10000
print(subset[mask])

             gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country                                                    
Italy                   NaN     10022.40131     12269.27378
Montenegro              NaN             NaN             NaN
Netherlands     12790.84956     15363.25136     18794.74567
Norway          13450.40151     16361.87647     18965.05551
Poland                  NaN             NaN             NaN
  • Get the value where the mask is true, and NaN (Not a Number) where it is false.
  • Useful because NaNs are ignored by operations like max, min, average, etc.

Group By: split-apply-combine {#Group-By:-split-apply-combine}

Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyse their data.

  1. We may have a glance by splitting the countries in two groups during the years surveyed, those who presented a GDP higher than the European average and those with a lower GDP.
  2. We then estimate a wealthy score based on the historical (from 1962 to 2007) values, where we account how many times a country has participated in the groups of lower or higher GDP

In [21]:

mask_higher = data > data.mean()

wealth_score = mask_higher.aggregate('sum', axis=1) / len(data.columns)
wealth_score

Out[21]:

country
Albania                   0.000000
Austria                   1.000000
Belgium                   1.000000
Bosnia and Herzegovina    0.000000
Bulgaria                  0.000000
Croatia                   0.000000
Czech Republic            0.500000
Denmark                   1.000000
Finland                   1.000000
France                    1.000000
Germany                   1.000000
Greece                    0.333333
Hungary                   0.000000
Iceland                   1.000000
Ireland                   0.333333
Italy                     0.500000
Montenegro                0.000000
Netherlands               1.000000
Norway                    1.000000
Poland                    0.000000
Portugal                  0.000000
Romania                   0.000000
Serbia                    0.000000
Slovak Republic           0.000000
Slovenia                  0.333333
Spain                     0.333333
Sweden                    1.000000
Switzerland               1.000000
Turkey                    0.000000
United Kingdom            1.000000
dtype: float64

Note: axis : (default 0) {0 or ‘index’, 1 or ‘columns’} 0 or ‘index’: apply function to each column. 1 or ‘columns’: apply function to each row.

Finally, for each group in the wealth_score table, we sum their (financial) contribution across the years surveyed:

In [22]:

data.groupby(wealth_score).sum()

Out[22]:

gdpPercap_1952

gdpPercap_1957

gdpPercap_1962

gdpPercap_1967

gdpPercap_1972

gdpPercap_1977

gdpPercap_1982

gdpPercap_1987

gdpPercap_1992

gdpPercap_1997

gdpPercap_2002

gdpPercap_2007

0.000000

36916.854200

46110.918793

56850.065437

71324.848786

88569.346898

104459.358438

113553.768507

119649.599409

92380.047256

103772.937598

118590.929863

149577.357928

0.333333

16790.046878

20942.456800

25744.935321

33567.667670

45277.839976

53860.456750

59679.634020

64436.912960

67918.093220

80876.051580

102086.795210

122803.729520

0.500000

11807.544405

14505.000150

18380.449470

21421.846200

25377.727380

29056.145370

31914.712050

35517.678220

36310.666080

40723.538700

45564.308390

51403.028210

1.000000

104317.277560

127332.008735

149989.154201

178000.350040

215162.343140

241143.412730

263388.781960

296825.131210

315238.235970

346930.926170

385109.939210

427850.333420

Exercises {#Exercises}

  1. Assume Pandas has been imported into your notebook and the Gapminder GDP data for Europe has been loaded:
import pandas as pd

df = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

Write an expression to find the Per Capita GDP of Serbia in 2007.

In [26]:

df = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_europe.csv', index_col='country')
print(df.loc['Serbia','gdpPercap_2007'])

9786.534714

In [27]:

df.loc["Serbia"][-1]

Out[27]:

9786.534714
  1. Explain in simple terms what idxmin and idxmax do in the short program below. When would you use these methods?
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.idxmin())
print(data.idxmax())

In [28]:

print(data.idxmin())

gdpPercap_1952    Bosnia and Herzegovina
gdpPercap_1957    Bosnia and Herzegovina
gdpPercap_1962    Bosnia and Herzegovina
gdpPercap_1967    Bosnia and Herzegovina
gdpPercap_1972    Bosnia and Herzegovina
gdpPercap_1977    Bosnia and Herzegovina
gdpPercap_1982                   Albania
gdpPercap_1987                   Albania
gdpPercap_1992                   Albania
gdpPercap_1997                   Albania
gdpPercap_2002                   Albania
gdpPercap_2007                   Albania
dtype: object

Key Points {#Key-Points}

  • Use DataFrame.iloc[..., ...] to select values by integer location.

  • Use : on its own to mean all columns or all rows.

  • Select multiple columns or rows using DataFrame.loc and a named slice.

  • Result of slicing can be used in further operations.

  • Use comparisons to select data based on value.

  • Select values or NaN using a Boolean mask.

Data prep with Pandas {#Data-prep-with-Pandas}

20 min

In [29]:

import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)

In [3]:

df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split()) 
df

Out[3]:

W

X

Y

Z

A

2.706850

0.628133

0.907969

0.503826

B

0.651118

-0.319318

-0.848077

0.605965

C

-2.018168

0.740122

0.528813

-0.589001

D

0.188695

-0.758872

-0.933237

0.955057

E

0.190794

1.978757

2.605967

0.683509

  • Create new columns

In [41]:

df['K', :] = df[1,:] + df[1,:] 
df
df.iloc[6,:] = df.iloc[1,:] + df.iloc[1,:]

Out[41]:

gdpPercap_1952

gdpPercap_1957

gdpPercap_1962

gdpPercap_1967

gdpPercap_1972

gdpPercap_1977

gdpPercap_1982

gdpPercap_1987

gdpPercap_1992

gdpPercap_1997

gdpPercap_2002

gdpPercap_2007

6

country

Sweden

8527.844662

9911.878226

12329.441920

15258.296970

17832.02464

18855.725210

20667.381250

23586.929270

23880.016830

25266.594990

29341.630930

33859.748350

NaN

Switzerland

14734.232750

17909.489730

20431.092700

22966.144320

27195.11304

26982.290520

28397.715120

30281.704590

31871.530300

32135.323010

34480.957710

37506.419070

NaN

Turkey

1969.100980

2218.754257

2322.869908

2826.356387

3450.69638

4269.122326

4241.356344

5089.043686

5678.348271

6601.429915

6508.085718

8458.276384

NaN

United Kingdom

9979.508487

11283.177950

12477.177070

14142.850890

15895.11641

17428.748460

18232.424520

21664.787670

22705.092540

26074.531360

29478.999190

33203.261280

NaN

6

12274.152984

17685.196060

21501.442220

25669.204800

33323.25120

39498.844600

43194.167240

47375.652140

54084.037360

58191.841320

64835.215380

72252.985400

NaN

  • Reorder columns in a data frame

In [5]:

df = df[['newColumn', 'W', 'X', 'Y', 'Z']]
df

Out[5]:

newColumn

W

X

Y

Z

A

1.131958

2.706850

0.628133

0.907969

0.503826

B

0.286647

0.651118

-0.319318

-0.848077

0.605965

C

0.151122

-2.018168

0.740122

0.528813

-0.589001

D

0.196184

0.188695

-0.758872

-0.933237

0.955057

E

2.662266

0.190794

1.978757

2.605967

0.683509

Group by {#Group-by}

The method group-by allow you to group rows in a data frame and apply a function to it.

In [65]:

#Let's create a DF
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
print(df)

#Group by company

by_comp = df.groupby("Company")
#by_comp

# Try some functions
by_comp.mean()
by_comp.count()
by_comp.describe()
by_comp.describe().transpose()

  Company   Person  Sales
0    GOOG      Sam    200
1    GOOG  Charlie    120
2    MSFT      Amy    340
3    MSFT  Vanessa    124
4      FB     Carl    243
5      FB    Sarah    350

Out[65]:

Company

FB

GOOG

MSFT

Sales

count

2.000000

2.000000

2.000000

mean

296.500000

160.000000

232.000000

std

75.660426

56.568542

152.735065

min

243.000000

120.000000

124.000000

25%

269.750000

140.000000

178.000000

50%

296.500000

160.000000

232.000000

75%

323.250000

180.000000

286.000000

max

350.000000

200.000000

340.000000

We can also merge data from different dataframes.

It's very useful when we need a variable from a different file.

You can use a ‘left’, ‘right’, ‘outer’, ‘inner’

Types

Taken from

In [56]:

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                     'C': ['C0', 'C1', 'C2', 'C3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})
print(left)
print(right)
## Merge
pd.merge(left, right, how='outer', on=['key'])

    A   B   C key
0  A0  B0  C0  K0
1  A1  B1  C1  K1
2  A2  B2  C2  K2
3  A3  B3  C3  K3
    C   D key
0  C0  D0  K0
1  C1  D1  K1
2  C2  D2  K2
3  C3  D3  K3

Out[56]:

A

B

C_x

key

C_y

D

0

A0

B0

C0

K0

C0

D0

1

A1

B1

C1

K1

C1

D1

2

A2

B2

C2

K2

C2

D2

3

A3

B3

C3

K3

C3

D3

Join (union) {#Join-(union)}

In [58]:

left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2'])

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                      'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [59]:

 left.join(right)

Out[59]:

A

B

C

D

K0

A0

B0

C0

D0

K1

A1

B1

NaN

NaN

K2

A2

B2

C2

D2

In [60]:

right.join(left)

Out[60]:

C

D

A

B

K0

C0

D0

A0

B0

K2

C2

D2

A2

B2

K3

C3

D3

NaN

NaN

In [61]:

left.join(right, how='outer')

Out[61]:

A

B

C

D

K0

A0

B0

C0

D0

K1

A1

B1

NaN

NaN

K2

A2

B2

C2

D2

K3

NaN

NaN

C3

D3

Some additional operations you can use with a pandas data frame

  • unique: returns unique values in a series.
  • nunique: returns the number of distinct observations over requested axis.
  • value_counts: returns an object containing counts of unique values in sorted order.

In [62]:

df['Company'].unique()

Out[62]:

array(['GOOG', 'MSFT', 'FB'], dtype=object)

In [63]:

df['Company'].nunique()

Out[63]:

3

In [20]:

df['Company'].value_counts()

Out[20]:

FB      2
GOOG    2
MSFT    2
Name: Company, dtype: int64

There are some other very useful tricks you can do with pandas data frames. Such as profiling a dataframe. Profiling df.profile_report() is a simple and easy way to go furhter into knowing your data. Some other tips and tricks

In [0]:

#Install
#pip install pandas-profiling

In [0]:

uploaded = files.upload()

In [0]:

import pandas as pd
import pandas_profiling
import io

data = pd.read_csv(io.BytesIO(uploaded['gapminder_

In [0]:

print(data.iloc[:,1:3])

Note: There are many other method we can use to explore the data and more effective exploration of a data set with pandas profiling.

Check this out!

In [0]:

pandas_profiling.ProfileReport(data.iloc[:,0:6])

Some other useful tools to work with data frames {#Some-other-useful-tools-to-work-with-data-frames}

When you are working with large data frames you might want to know if there are missing values and how many are there.

  • .isna() will create a table with booleans.
    • True if a value is NaN

In [67]:

df.isna().head()

Out[67]:

Company

Person

Sales

0

False

False

False

1

False

False

False

2

False

False

False

3

False

False

False

4

False

False

False

You can count how many Nan values you have per variable

In [68]:

df.isna().sum()

Out[68]:

Company    0
Person     0
Sales      0
dtype: int64

In [69]:

df1 = df.copy()

You can discard these values

In [71]:

df.dropna(axis=0) #for rows
df.dropna(axis= 1) #for columns

Out[71]:

Company

Person

Sales

0

GOOG

Sam

200

1

GOOG

Charlie

120

2

MSFT

Amy

340

3

MSFT

Vanessa

124

4

FB

Carl

243

5

FB

Sarah

350

Standardize and resize data directly in the dataframe {#Standardize-and-resize-data-directly-in-the-dataframe}

Here we can do it manually (if like to do things like that) but we can also use methods already created.

For example ScikitLearn provides:

  • Simple and efficient tools for data mining and data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib

The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

In general, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers are more appropriate.

In [73]:

from sklearn import preprocessing
#Save columns names
names = data.iloc[:,2:8].columns
#Create scaler 
scaler = preprocessing.MinMaxScaler() #StandardScaler() #MaxAbsScaler

#Transform your data frame (numeric variables )
data1 = data.iloc[:,2:8]
data1 = scaler.fit_transform(data1) 
data1 = pd.DataFrame(data1, columns=names) 
print(data1.head())
print(data.iloc[:,2:8].head())

   gdpPercap_1962  gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  \
0        0.032220        0.028270        0.018626        0.000193   
1        0.482925        0.512761        0.567146        0.691612   
2        0.495771        0.527883        0.567578        0.664689   
3        0.000000        0.000000        0.000000        0.000000   
4        0.135922        0.163734        0.153579        0.174119

   gdpPercap_1982  gdpPercap_1987  
0        0.000000        0.000000  
1        0.725414        0.717533  
2        0.700492        0.675728  
3        0.020016        0.020688  
4        0.185462        0.161892  
                        gdpPercap_1962  gdpPercap_1967  gdpPercap_1972  \
country                                                                  
Albania                    2312.888958     2760.196931     3313.422188   
Austria                   10750.721110    12834.602400    16661.625600   
Belgium                   10991.206760    13149.041190    16672.143560   
Bosnia and Herzegovina     1709.683679     2172.352423     2860.169750   
Bulgaria                   4254.337839     5577.002800     6597.494398

                        gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  
country                                                                 
Albania                    3533.003910     3630.880722     3738.932735  
Austria                   19749.422300    21597.083620    23687.826070  
Belgium                   19117.974480    20979.845890    22525.563080  
Bosnia and Herzegovina     3528.481305     4126.613157     4314.114757  
Bulgaria                   7612.240438     8224.191647     8239.854824

Exercise {#Exercise}

With the file gapminder_all.csv try to:

  1. Filter only those countries located in Latin America.
  2. Select the columns corresponding to the gdpPercap and the population
  3. Explore the data frame using 3 different methods 4
  4. Show how many contries had a gdpPercap higher than the mean in 1977.
  5. Check if there are some missing values (NaN) in the data

Lists {#Lists}

15 min

Exercises (10 min)

A list stores many values in a single structure.

  • Doing calculations with a hundred variables called pressure_001, pressure_002, etc., would be at least as slow as doing them by hand.

  • Use a list to store many values together.

    • Contained within square brackets [...].
    • Values separated by commas ,. Use len to find out how many values are in a list.

In [74]:

pressures = [0.273, 0.275, 0.277, 0.275, 0.276]
print('pressures:', pressures)
print('length:', len(pressures))

pressures: [0.273, 0.275, 0.277, 0.275, 0.276]
length: 5

Use an item’s index to fetch it from a list.

In [22]:

print('zeroth item of pressures:', pressures[0])

zeroth item of pressures: 0.273

Lists’ values can be replaced by assigning to them.

In [23]:

pressures[0] = 0.265
print('pressures is now:', pressures)

pressures is now: [0.265, 0.275, 0.277, 0.275, 0.276]

Use list_name.append to add items to the end of a list.

In [75]:

primes = [2, 3, 5]
print('primes is initially:', primes)
primes.append(7)
#primes.append(9)
#print('primes has become:', primes)

primes is initially: [2, 3, 5]

Use del to remove items from a list entirely.

In [76]:

primes = [2, 3, 5, 7, 9]
print('primes before removing last item:', primes)
del primes[4]
print('primes after removing last item:', primes)

primes before removing last item: [2, 3, 5, 7, 9]
primes after removing last item: [2, 3, 5, 7]

The empty list contains no values.

  • Use [ ] on its own to represent a list that doesn’t contain any values.

Lists may contain values of different types.

In [26]:

goals = [1, 'Create lists.', 2, 'Extract items from lists.', 3, 'Modify lists.']

Character strings can be indexed like lists.

In [27]:

element = 'carbon'
print('zeroth character:', element[0])
print('third character:', element[3])

zeroth character: c
third character: b

Character strings are immutable.

  • Cannot change the characters in a string after it has been created.
    • Immutable: can’t be changed after creation.
    • In contrast, lists are mutable: they can be modified in place.
  • Python considers the string to be a single value with parts, not a collection of values.

In [28]:

element[0] = 'C'

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-6dc46761ce07> in <module>()
----> 1 element[0] = 'C'

TypeError: 'str' object does not support item assignment

Exercises {#Exercises}

Given this:

print('string to list:', list('tin'))
print('list to string:', ''.join(['g', 'o', 'l', 'd']))
  1. What does list('some string') do?
  2. What does '-'.join(['x', 'y', 'z']) generate?

In [77]:

print(list('YaQueremosComer'))

['Y', 'a', 'Q', 'u', 'e', 'r', 'e', 'm', 'o', 's', 'C', 'o', 'm', 'e', 'r']

What does the following program print?

element = 'helium'
print(element[-1])
  1. How does Python interpret a negative index?
  2. If a list or string has N elements, what is the most negative index that can safely be used with it, and what location does that index represent?
  3. If values is a list, what does del values[-1] do?
  4. How can you display all elements but the last one without changing values? (Hint: you will need to combine slicing and negative indexing.)

What does the following program print?

element = 'fluorine'
print(element[::2])
print(element[::-1])
  1. If we write a slice as low:high:stride, what does stride do?
  2. What expression would select all of the even-numbered items from a collection?

Key Points {#Key-Points}

  • A list stores many values in a single structure.

  • Use an item’s index to fetch it from a list.

  • Lists’ values can be replaced by assigning to them.

  • Appending items to a list lengthens it.

  • Use del to remove items from a list entirely.

  • The empty list contains no values.

  • Lists may contain values of different types.

  • Character strings can be indexed like lists.

  • Character strings are immutable.

  • Indexing beyond the end of the collection is an error.

For Loops {#For-Loops}

10 min

Exercises (15 min)

A for loop executes commands once for each value in a collection.

“for each thing in this group, do these operations”

In [78]:

for number in [2, 3, 5]:
    print(number)

2
3
5
  • This for loop is equivalent to:

In [30]:

print(2)
print(3)
print(5)

2
3
5

A for loop is made up of a collection, loop variable and a body.

Parts of a for loop

  • The collection, [2, 3, 5], is what the loop is being run on.
  • The body, print(number), specifies what to do for each value in the collection.
  • The loop variable, number, is what changes for each iteration of the loop.

    • The “current thing”.
    • Python uses indentation rather than {} or begin/end to show nesting.
  • Use range to iterate over a sequence of numbers.

The first line of the for loop must end with a colon, and the body must be indented.

  • The colon at the end of the first line signals the start of a block of statements.
  • Python uses indentation rather than {} or begin/end to show nesting.
    • Any consistent indentation is legal, but almost everyone uses four spaces.

In [80]:

for number in [2, 3, 5]:
    print(number)

2
3
5

Indentation is always meaningful in Python.

In [81]:

firstName = "Jon"
  lastName = "Smith"

  File "<ipython-input-81-6966a7c3a64d>", line 2
    lastName = "Smith"
    ^
IndentationError: unexpected indent

Loop variables can be called anything.

  • As with all variables, loop variables are:
    • Created on demand.
    • Meaningless: their names can be anything at all.

In [33]:

for kitten in [2, 3, 5]:
    print(kitten)

2
3
5

The body of a loop can contain many statements.

  • But no loop should be more than a few lines long.
  • Hard for human beings to keep larger chunks of code in mind.

In [82]:

primes = [2, 3, 5]
for p in primes:
    squared = p ** 2
    cubed = p ** 3
    print(p, squared, cubed)

2 4 8
3 9 27
5 25 125

Use range to iterate over a sequence of numbers.

  • The built-in function range produces a sequence of numbers. Not a list: the numbers are produced on demand to make looping over large ranges more efficient.
  • range(N) is the numbers 0..N-1
    • Exactly the legal indices of a list or character string of length N

In [83]:

print('a range is not a list: range(0, 3)')
for number in range(0, 3):
    print(number)

a range is not a list: range(0, 3)
0
1
2

The Accumulator pattern turns many values into one.

  • Initialize an accumulator variable to zero, the empty string, or the empty list.

In [86]:

total = 0
for number in range(10):
    total = total + (number + 1)
    print(total)

1
3
6
10
15
21
28
36
45
55

Exercises {#Exercises}

Create a table showing the numbers of the lines that are executed when this program runs, and the values of the variables after each line is executed.

total = 0
for char in "tin":
    total = total + 1

Fill in the blanks in the program below so that it prints “nit” (the reverse of the original character string “tin”).

original = "tin"
result = ____
for char in original:
    result = ____
print(result)

In [0]:

original = "tin"
result = ""
for char in original:
    result = char + result
    print(result)

t
it
nit

Fill in the blanks in each of the programs below to produce the indicated result.

In [0]:

# Total length of the strings in the list: ["red", "green", "blue"] => 12
total = 0
for word in ["red", "green", "blue"]:
    ____ = ____ + len(word)
print(total)

In [87]:

# List of word lengths: ["red", "green", "blue"] => [3, 5, 4]
lengths = []
for word in ["red", "green", "blue"]:
    lengths.append(len(word))
print(lengths)

[3, 5, 4]

In [0]:

# Concatenate all words: ["red", "green", "blue"] => "redgreenblue"
words = ["red", "green", "blue"]
result = ____
for ____ in ____:
    ____
print(result)

In [0]:

# Create acronym: ["red", "green", "blue"] => "RGB"
# write the whole thing

Find the error to the following code

students = ['Ana', 'Juan', 'Susan']
for m in students:
print(m)

Cumulative sum. Reorder and properly indent the lines of code below so that they print a list with the cumulative sum of data. The result should be [1, 3, 5, 10].

In [0]:

cumulative.append(sum)
for number in data:
cumulative = []
sum += number
sum = 0
print(cumulative)
data = [1,2,2,5]

Key Points {#Key-Points}

  • A for loop executes commands once for each value in a collection.

  • A for loop is made up of a collection, a loop variable, and a body.

  • The first line of the for loop must end with a colon, and the body must be indented.

  • Indentation is always meaningful in Python.

  • Loop variables can be called anything (but it is strongly advised to have a meaningful name to the looping variable).

  • The body of a loop can contain many statements.

  • Use range to iterate over a sequence of numbers.

  • The Accumulator pattern turns many values into one.

Looping Over Data Sets {#Looping-Over-Data-Sets}

5 min

Exercises (10 min)

Use a for loop to process files given a list of their names.

In [88]:

import pandas as pd
for filename in ['/home/mcubero/dataSanJose19/data/gapminder_gdp_africa.csv', '/home/mcubero/dataSanJose19/data/gapminder_gdp_asia.csv']:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())

/home/mcubero/dataSanJose19/data/gapminder_gdp_africa.csv gdpPercap_1952    298.846212
gdpPercap_1957    335.997115
gdpPercap_1962    355.203227
gdpPercap_1967    412.977514
gdpPercap_1972    464.099504
gdpPercap_1977    502.319733
gdpPercap_1982    462.211415
gdpPercap_1987    389.876185
gdpPercap_1992    410.896824
gdpPercap_1997    312.188423
gdpPercap_2002    241.165877
gdpPercap_2007    277.551859
dtype: float64
/home/mcubero/dataSanJose19/data/gapminder_gdp_asia.csv gdpPercap_1952    331.0
gdpPercap_1957    350.0
gdpPercap_1962    388.0
gdpPercap_1967    349.0
gdpPercap_1972    357.0
gdpPercap_1977    371.0
gdpPercap_1982    424.0
gdpPercap_1987    385.0
gdpPercap_1992    347.0
gdpPercap_1997    415.0
gdpPercap_2002    611.0
gdpPercap_2007    944.0
dtype: float64

Use glob.glob to find sets of files whose names match a pattern.

  • In Unix, the term “globbing” means “matching a set of files with a pattern”.

  • '*' meaning “match zero or more characters”

  • Python contains the glob library to provide pattern matching functionality.

In [90]:

import glob
print('all csv files in data directory:', glob.glob('/home/mcubero/dataSanJose19/data/*.csv'))

all csv files in data directory: ['/home/mcubero/dataSanJose19/data/gapminder_all.csv', '/home/mcubero/dataSanJose19/data/gapminder_gdp_africa.csv', '/home/mcubero/dataSanJose19/data/gapminder_gdp_americas.csv', '/home/mcubero/dataSanJose19/data/gapminder_gdp_asia.csv', '/home/mcubero/dataSanJose19/data/gapminder_gdp_europe.csv', '/home/mcubero/dataSanJose19/data/gapminder_gdp_oceania.csv', '/home/mcubero/dataSanJose19/data/processed.csv']

In [91]:

print('all PDB files:', glob.glob('*.pdb'))

all PDB files: []

Use glob and for to process batches of files.

In [92]:

for filename in glob.glob('/home/mcubero/dataSanJose19/data/gapminder_*.csv'):
    data = pd.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())

/home/mcubero/dataSanJose19/data/gapminder_all.csv 298.8462121
/home/mcubero/dataSanJose19/data/gapminder_gdp_africa.csv 298.8462121
/home/mcubero/dataSanJose19/data/gapminder_gdp_americas.csv 1397.7171369999999
/home/mcubero/dataSanJose19/data/gapminder_gdp_asia.csv 331.0
/home/mcubero/dataSanJose19/data/gapminder_gdp_europe.csv 973.5331947999999
/home/mcubero/dataSanJose19/data/gapminder_gdp_oceania.csv 10039.595640000001

Exercises {#Exercises}

Which of these files is not matched by the expression glob.glob('data/as.csv')?

  1. data/gapminder_gdp_africa.csv
  2. data/gapminder_gdp_americas.csv
  3. data/gapminder_gdp_asia.csv
  4. 1 and 2 are not matched.

Key Points {#Key-Points}

  • Use a for loop to process files given a list of their names.

  • Use glob.glob to find sets of files whose names match a pattern.

  • Use glob and for to process batches of files.

STRETCHING TIME¶ {#STRETCHING-TIME!}

Writing functions {#Writing-functions}

15 min

Exercises (20 min)

  • Break programs down into functions to make them easier to understand.

    • Human beings can only keep a few items in working memory at a time.
    • Encapsulate complexity so that we can treat it as a single “thing”.
  • Write one time, use many times.

To define a function use def then the name of the function like this:

def say_hi(parameter1, parameter2): 
  print('Hello')

Remember, defining a function does not run it, you must call the function to execute it.

In [93]:

def print_date(year, month, day):
    joined = str(year) + '/' + str(month) + '/' + str(day)
    print(joined)

print_date(1871, 3, 19)

1871/3/19

In [43]:

print_date(month=3, day=19, year=1871)

1871/3/19
  • Use return ... to give a value back to the caller.
  • May occur anywhere in the function.

In [94]:

def average(values):
    if len(values) == 0:
        return None
    return sum(values) / len(values)

Remember: every function returns something

  • A function that doesn’t explicitly return a value automatically returns None.

Exercises {#Exercises}

What does the following program print?

def report(pressure):
    print('pressure is', pressure)

print('calling', report, 22.5)

In [96]:

def report(pressure):
    print('pressure is', pressure)

print('calling', report(22.5))

pressure is 22.5
calling None

Fill in the blanks to create a function that takes a single filename as an argument, loads the data in the file named by the argument, and returns the minimum value in that data.

import pandas as pd

def min_in_data(____):
    data = ____
    return ____

In [98]:

import pandas as pd

def min_in_data(data):
    data = pd.read_csv(data)
    return data.min()
min_in_data('/home/mcubero/dataSanJose19/data/gapminder_gdp_africa.csv')

Out[98]:

country           Algeria
gdpPercap_1952    298.846
gdpPercap_1957    335.997
gdpPercap_1962    355.203
gdpPercap_1967    412.978
gdpPercap_1972      464.1
gdpPercap_1977     502.32
gdpPercap_1982    462.211
gdpPercap_1987    389.876
gdpPercap_1992    410.897
gdpPercap_1997    312.188
gdpPercap_2002    241.166
gdpPercap_2007    277.552
dtype: object

The code below will run on a label-printer for chicken eggs. A digital scale will report a chicken egg mass (in grams) to the computer and then the computer will print a label.

Please re-write the code so that the if-block is folded into a function.

import random
 for i in range(10):

    # simulating the mass of a chicken egg
    # the (random) mass will be 70 +/- 20 grams
    mass=70+20.0*(2.0*random.random()-1.0)

    print(mass)

    #egg sizing machinery prints a label
    if(mass>=85):
       print("jumbo")
    elif(mass>=70):
       print("large")
    elif(mass<70 and mass>=55):
       print("medium")
    else:
       print("small")

Assume that the following code has been executed:

In [46]:

import pandas as pd

df = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_asia.csv', index_col=0)
japan = df.loc['Japan']
japan

Out[46]:

gdpPercap_1952     3216.956347
gdpPercap_1957     4317.694365
gdpPercap_1962     6576.649461
gdpPercap_1967     9847.788607
gdpPercap_1972    14778.786360
gdpPercap_1977    16610.377010
gdpPercap_1982    19384.105710
gdpPercap_1987    22375.941890
gdpPercap_1992    26824.895110
gdpPercap_1997    28816.584990
gdpPercap_2002    28604.591900
gdpPercap_2007    31656.068060
Name: Japan, dtype: float64

1.Complete the statements below to obtain the average GDP for Japan across the years reported for the 1980s.

year = 1983
gdp_decade = 'gdpPercap_' + str(year // ____)
avg = (japan.loc[gdp_decade + ___] + japan.loc[gdp_decade + ___]) / 2

2.Abstract the code above into a single function.

def avg_gdp_in_decade(country, continent, year):
    df = pd.read_csv('data/gapminder_gdp_'+___+'.csv',delimiter=',',index_col=0)
    ____
    ____
    ____
    return avg
  1. .How would you generalize this function if you did not know beforehand which specific years occurred as columns in the data? For instance, what if we also had data from years ending in 1 and 9 for each decade? (Hint: use the columns to filter out the ones that correspond to the decade, instead of enumerating them in the code.)

Key Points {#Key-Points}

  • Break programs down into functions to make them easier to understand.

  • Define a function using def with a name, parameters, and a block of code.

  • Defining a function does not run it.

  • Arguments in call are matched to parameters in definition.

  • Functions may return a result to their caller using return.

Variable Scope {#Variable-Scope}

10 min

Exercise (10 min)

The scope of a variable is the part of a program that can ‘see’ that variable.

  • There are only so many sensible names for variables.
  • People using functions shouldn’t have to worry about what variable names the author of the function used.
  • People writing functions shouldn’t have to worry about what variable names the function’s caller uses.
  • The part of a program in which a variable is visible is called its scope.

In [99]:

pressure = 103.9

def adjust(t):
    temperature = t * 1.43 / pressure
    return temperature
  • pressure is a global variable.
    • Defined outside any particular function.
    • Visible everywhere.
  • t and temperature are local variables in adjust.
    • Defined in the function.
    • Not visible in the main program.
    • Remember: a function parameter is a variable that is automatically assigned a value when the function is called.

In [100]:

print('adjusted:', adjust(0.9))
print('temperature after call:', temperature)

adjusted: 0.01238691049085659

----------------------------------------------------------------------
NameError                            Traceback (most recent call last)
<ipython-input-100-e73c01f89950> in <module>()
      1 print('adjusted:', adjust(0.9))
----> 2 print('temperature after call:', temperature)

NameError: name 'temperature' is not defined

Exercises {#Exercises}

Trace the values of all variables in this program as it is executed. (Use ‘—’ as the value of variables before and after they exist.)

limit = 100

def clip(value):
    return min(max(0.0, value), limit)

value = -22.5
print(clip(value))

Read the traceback below, and identify the following:

  1. How many levels does the traceback have?
  2. What is the file name where the error occurred?
  3. What is the function name where the error occurred?
  4. On which line number in this function did the error occur?
  5. What is the type of error?
  6. What is the error message?
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-e4c4cbafeeb5> in <module>()
      1 import errors_02
----> 2 errors_02.print_friday_message()

/Users/ghopper/thesis/code/errors_02.py in print_friday_message()
     13
     14 def print_friday_message():
---> 15     print_message("Friday")

/Users/ghopper/thesis/code/errors_02.py in print_message(day)
      9         "sunday": "Aw, the weekend is almost over."
     10     }
---> 11     print(messages[day])
     12
     13

KeyError: 'Friday'

Key Points {#Key-Points}

  • The scope of a variable is the part of a program that can ‘see’ that variable.

Conditionals {#Conditionals}

15 min

Exercise (15 min)

Use if statements to control whether or not a block of code is executed.

  • An if statement (more properly called a conditional statement) controls whether some block of code is executed or not.
  • Structure is similar to a for statement:
    • First line opens with if and ends with a colon
    • Body containing one or more statements is indented (usually by 4 spaces)

In [52]:

mass = 2.07

if mass > 3.0:
    print (mass, 'is large')

In [102]:

masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')
    else:
        print(m, 'is small')

3.54 is large
2.07 is small
9.22 is large
1.86 is small
1.71 is small

In [104]:

thing1 = [3.54, 2.07, 9.22]
if masses > thing1:
    print (masses, 'is large')

[3.54, 2.07, 9.22, 1.86, 1.71] is large

Conditionals are often used inside loops.

  • Not much point using a conditional when we know the value (as above).
  • But useful when we have a collection to process.

In [54]:

masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')

3.54 is large
9.22 is large

Use else to execute a block of code when an if condition is not true.

  • else can be used following an if.
  • Allows us to specify an alternative to execute when the if branch isn’t taken.

In [55]:

masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')
    else:
        print(m, 'is small')

3.54 is large
2.07 is small
9.22 is large
1.86 is small
1.71 is small

Use elif to specify additional tests.

  • May want to provide several alternative choices, each with its own test.
  • Use elif (short for “else if”) and a condition to specify these.
  • Always associated with an if.
  • Must come before the else (which is the “catch all”).

  • Complete the next conditional

In [56]:

masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in ____:
    if m > 9.0:
        print(__, 'is HUGE')
    elif m > 3.0:
        print(m, 'is large')
    ___:
        print(m, 'is small')

  File "<ipython-input-56-97e8fa260561>", line 7
    ___:
        ^
SyntaxError: invalid syntax

Conditions are tested once, in order.

  • Python steps through the branches of the conditional in order, testing each in turn.
  • So ordering matters.

In [0]:

grade = 85
if grade >= 70:
    print('grade is C')
elif grade >= 80:
    print('grade is B')
elif grade >= 90:
    print('grade is A')

grade is C
  • Does not automatically go back and re-evaluate if values change.

In [0]:

velocity = 10.0
if velocity > 20.0:
    print('moving too fast')
else:
    print('adjusting velocity')
    velocity = 50.0

Often use conditionals in a loop to “evolve” the values of variables.

In [105]:

velocity = 10.0
for i in range(5): # execute the loop 5 times
    print(i, ':', velocity)
    if velocity > 20.0:
        print('moving too fast')
        velocity = velocity - 5.0
    else:
        print('moving too slow')
        velocity = velocity + 10.0
print('final velocity:', velocity)

0 : 10.0
moving too slow
1 : 20.0
moving too slow
2 : 30.0
moving too fast
3 : 25.0
moving too fast
4 : 20.0
moving too slow
final velocity: 30.0

Conditionals are useful to check for errors!

Often, you want some combination of things to be true. You can combine relations within a conditional using and and or. Continuing the example above, suppose you have

In [0]:

mass     = [ 3.54,  2.07,  9.22,  1.86,  1.71]
velocity = [10.00, 20.00, 30.00, 25.00, 20.00]

i = 0
for i in range(5):
    if mass[i] > 5 and velocity[i] > 20:
        print("Fast heavy object.  Duck!")
    elif mass[i] > 2 and mass[i] <= 5 and velocity[i] <= 20:
        print("Normal traffic")
    elif mass[i] <= 2 and velocity[i] <= 20:
        print("Slow light object.  Ignore it")
    else:
        print("Whoa!  Something is up with the data.  Check it")

Normal traffic
Normal traffic
Fast heavy object.  Duck!
Whoa!  Something is up with the data.  Check it
Slow light object.  Ignore it

Just like with arithmetic, you can and should use parentheses whenever there is possible ambiguity. A good general rule is to always use parentheses when mixing and and or in the same condition. That is, instead of:

In [0]:

if mass[i] <= 2 or mass[i] >= 5 and velocity[i] > 20:

write one of these:

In [0]:

if (mass[i] <= 2 or mass[i] >= 5) and velocity[i] > 20:
if mass[i] <= 2 or (mass[i] >= 5 and velocity[i] > 20):

so it is perfectly clear to a reader (and to Python) what you really mean.

Exercise {#Exercise}

What does this program print?

pressure = 71.9
if pressure > 50.0:
    pressure = 25.0
elif pressure <= 50.0:
    pressure = 0.0
print(pressure)

In [106]:

pressure = 71.9
if pressure > 50.0:
    pressure = 25.0
elif pressure <= 50.0:
    pressure = 0.0
print(pressure)

25.0

Trimming Values Fill in the blanks so that this program creates a new list containing zeroes where the original list’s values were negative and ones where the original list’s values were positive.

In [107]:

original = [-1.5, 0.2, 0.4, 0.0, -1.3, 0.4]
result = []
for value in original:
    if value < 0.0:
        result.append(0)
    else:
        result.append(1)
print(result)

[0, 1, 1, 1, 0, 1]
  • Modify this program so that it only processes files with fewer than 50 records.

In [108]:

import glob
import pandas as pd
for filename in glob.glob('/home/mcubero/dataSanJose19/data/*.csv'):
    contents = pd.read_csv(filename)
    if len(contents) < 50:
        print(filename, len(contents))

/home/mcubero/dataSanJose19/data/gapminder_gdp_americas.csv 25
/home/mcubero/dataSanJose19/data/gapminder_gdp_asia.csv 33
/home/mcubero/dataSanJose19/data/gapminder_gdp_europe.csv 30
/home/mcubero/dataSanJose19/data/gapminder_gdp_oceania.csv 2
/home/mcubero/dataSanJose19/data/processed.csv 2

Modify this program so that it finds the largest and smallest values in the list no matter what the range of values originally is.

values = [...some test data...]
smallest, largest = None, None
for v in values:
    if ____:
        smallest, largest = v, v
    ____:
        smallest = min(____, v)
        largest = max(____, v)
print(smallest, largest)

What are the advantages and disadvantages of using this method to find the range of the data?

  • Using functions with conditionals in Pandas

Functions will often contain conditionals. Here is a short example that will indicate which quartile the argument is in based on hand-coded values for the quartile cut points.

In [9]:

def calculate_life_quartile(exp):
    if exp < 58.41:
        # This observation is in the first quartile
        return 1
    elif exp >= 58.41 and exp < 67.05:
        # This observation is in the second quartile
       return 2
    elif exp >= 67.05 and exp < 71.70:
        # This observation is in the third quartile
       return 3
    elif exp >= 71.70:
        # This observation is in the fourth quartile
       return 4
    else:
        # This observation has bad data
       return None

calculate_life_quartile(62.5)

Out[9]:

2

That function would typically be used within a for loop, but Pandas has a different, more efficient way of doing the same thing, and that is by applying a function to a dataframe or a portion of a dataframe. Here is an example, using the definition above.

In [59]:

data = pd.read_csv('/home/mcubero/dataSanJose19/data/all-Americas.csv')
data
#data['life_qrtl'] = data['lifeExp'].apply(calculate_life_quartile)

Out[59]:

continent

country

gdpPercap_1952

gdpPercap_1957

gdpPercap_1962

gdpPercap_1967

gdpPercap_1972

gdpPercap_1977

gdpPercap_1982

gdpPercap_1987

gdpPercap_1992

gdpPercap_1997

gdpPercap_2002

gdpPercap_2007

0

Americas

Argentina

5911.315053

6856.856212

7133.166023

8052.953021

9443.038526

10079.026740

8997.897412

9139.671389

9308.418710

10967.281950

8797.640716

12779.379640

1

Americas

Bolivia

2677.326347

2127.686326

2180.972546

2586.886053

2980.331339

3548.097832

3156.510452

2753.691490

2961.699694

3326.143191

3413.262690

3822.137084

2

Americas

Brazil

2108.944355

2487.365989

3336.585802

3429.864357

4985.711467

6660.118654

7030.835878

7807.095818

6950.283021

7957.980824

8131.212843

9065.800825

3

Americas

Canada

11367.161120

12489.950060

13462.485550

16076.588030

18970.570860

22090.883060

22898.792140

26626.515030

26342.884260

28954.925890

33328.965070

36319.235010

4

Americas

Chile

3939.978789

4315.622723

4519.094331

5106.654313

5494.024437

4756.763836

5095.665738

5547.063754

7596.125964

10118.053180

10778.783850

13171.638850

5

Americas

Colombia

2144.115096

2323.805581

2492.351109

2678.729839

3264.660041

3815.807870

4397.575659

4903.219100

5444.648617

6117.361746

5755.259962

7006.580419

6

Americas

Costa Rica

2627.009471

2990.010802

3460.937025

4161.727834

5118.146939

5926.876967

5262.734751

5629.915318

6160.416317

6677.045314

7723.447195

9645.061420

7

Americas

Cuba

5586.538780

6092.174359

5180.755910

5690.268015

5305.445256

6380.494966

7316.918107

7532.924763

5592.843963

5431.990415

6340.646683

8948.102923

8

Americas

Dominican Republic

1397.717137

1544.402995

1662.137359

1653.723003

2189.874499

2681.988900

2861.092386

2899.842175

3044.214214

3614.101285

4563.808154

6025.374752

9

Americas

Ecuador

3522.110717

3780.546651

4086.114078

4579.074215

5280.994710

6679.623260

7213.791267

6481.776993

7103.702595

7429.455877

5773.044512

6873.262326

10

Americas

El Salvador

3048.302900

3421.523218

3776.803627

4358.595393

4520.246008

5138.922374

4098.344175

4140.442097

4444.231700

5154.825496

5351.568666

5728.353514

11

Americas

Guatemala

2428.237769

2617.155967

2750.364446

3242.531147

4031.408271

4879.992748

4820.494790

4246.485974

4439.450840

4684.313807

4858.347495

5186.050003

12

Americas

Haiti

1840.366939

1726.887882

1796.589032

1452.057666

1654.456946

1874.298931

2011.159549

1823.015995

1456.309517

1341.726931

1270.364932

1201.637154

13

Americas

Honduras

2194.926204

2220.487682

2291.156835

2538.269358

2529.842345

3203.208066

3121.760794

3023.096699

3081.694603

3160.454906

3099.728660

3548.330846

14

Americas

Jamaica

2898.530881

4756.525781

5246.107524

6124.703451

7433.889293

6650.195573

6068.051350

6351.237495

7404.923685

7121.924704

6994.774861

7320.880262

15

Americas

Mexico

3478.125529

4131.546641

4581.609385

5754.733883

6809.406690

7674.929108

9611.147541

8688.156003

9472.384295

9767.297530

10742.440530

11977.574960

16

Americas

Nicaragua

3112.363948

3457.415947

3634.364406

4643.393534

4688.593267

5486.371089

3470.338156

2955.984375

2170.151724

2253.023004

2474.548819

2749.320965

17

Americas

Panama

2480.380334

2961.800905

3536.540301

4421.009084

5364.249663

5351.912144

7009.601598

7034.779161

6618.743050

7113.692252

7356.031934

9809.185636

18

Americas

Paraguay

1952.308701

2046.154706

2148.027146

2299.376311

2523.337977

3248.373311

4258.503604

3998.875695

4196.411078

4247.400261

3783.674243

4172.838464

19

Americas

Peru

3758.523437

4245.256698

4957.037982

5788.093330

5937.827283

6281.290855

6434.501797

6360.943444

4446.380924

5838.347657

5909.020073

7408.905561

20

Americas

Puerto Rico

3081.959785

3907.156189

5108.344630

6929.277714

9123.041742

9770.524921

10330.989150

12281.341910

14641.587110

16999.433300

18855.606180

19328.709010

21

Americas

Trinidad and Tobago

3023.271928

4100.393400

4997.523971

5621.368472

6619.551419

7899.554209

9119.528607

7388.597823

7370.990932

8792.573126

11460.600230

18008.509240

22

Americas

United States

13990.482080

14847.127120

16173.145860

19530.365570

21806.035940

24072.632130

25009.559140

29884.350410

32003.932240

35767.433030

39097.099550

42951.653090

23

Americas

Uruguay

5716.766744

6150.772969

5603.357717

5444.619620

5703.408898

6504.339663

6920.223051

7452.398969

8137.004775

9230.240708

7727.002004

10611.462990

24

Americas

Venezuela

7689.799761

9802.466526

8422.974165

9541.474188

10505.259660

13143.950950

11152.410110

9883.584648

10733.926310

10165.495180

8605.047831

11415.805690

There is a lot in that second line, so let’s take it piece by piece. On the right side of the = we start with data['lifeExp'], which is the column in the dataframe called data labeled lifExp. We use the apply() to do what it says, apply the calculate_life_quartile to the value of this column for every row in the dataframe.

Key Points {#Key-Points}

  • Use if statements to control whether or not a block of code is executed.

  • Conditionals are often used inside loops.

  • Use else to execute a block of code when an if condition is not true.

  • Use elif to specify additional tests.

  • Conditions are tested once, in order.

  • Create a table showing variables’ values to trace a program’s execution.

Plotting {#Plotting}

20 min Exercises (15 min)

We are going to use matplotlib.

matplotlib is the most widely used scientific plotting library in Python.

  • Commonly use a sub-library called matplotlib.pyplot.
  • The Jupyter Notebook will render plots inline if we ask it to using a “magic” command.

In [61]:

#%matplotlib inline
import matplotlib.pyplot as plt
  • Simple plots are then (fairly) simple to create

In [62]:

time = [0, 1, 2, 3]
position = [0, 100, 200, 300]

plt.plot(time, position)
plt.xlabel('Time (hr)')
plt.ylabel('Position (km)')

Out[62]:

Text(0,0.5,'Position (km)')

Plot data directly from a Pandas dataframe {#Plot-data-directly-from-a-Pandas-dataframe}

  • We can also plot Pandas dataframes.
  • This implicitly uses matplotlib.pyplot.
  • Before plotting, we convert the column headings from a string to integer data type, since they represent numerical values

In [64]:

import pandas as pd

data = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_oceania.csv', index_col='country')

# Extract year from last 4 characters of each column name
years = data.columns.str.strip('gdpPercap_')
# Convert year values to integers, saving results back to dataframe
data.columns = years.astype(int)

data.loc['Australia'].plot()

Out[64]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fdc600bc588>

Select and transform data, then plot it {#Select-and-transform-data,-then-plot-it}

  • By default, DataFrame.plot plots with the rows as the X axis.
  • We can transpose the data in order to plot multiple series.

In [65]:

data.T.plot()
plt.ylabel('GDP per capita')

Out[65]:

Text(0,0.5,'GDP per capita')

Many styles of plot are available. {#Many-styles-of-plot-are-available.}

  • For example, do a bar plot using a fancier style.

In [66]:

plt.style.use('ggplot')
data.T.plot(kind='bar')
plt.ylabel('GDP per capita')

Out[66]:

Text(0,0.5,'GDP per capita')

Data can also be plotted by calling the matplotlib plot function directly. {#Data-can-also-be-plotted-by-calling-the-matplotlib-plot-function-directly.}

  • The command is plt.plot(x, y)
  • The color / format of markers can also be specified as an optical argument: e.g. ‘b-‘ is a blue line, ‘g–’ is a green dashed line.

Get Australia data from dataframe

In [67]:

years = data.columns
gdp_australia = data.loc['Australia']

plt.plot(years, gdp_australia, 'g--')

Out[67]:

[<matplotlib.lines.Line2D at 0x7fdc5fece550>]

Can plot many sets of data together.

In [68]:

# Select two countries' worth of data.
gdp_australia = data.loc['Australia']
gdp_nz = data.loc['New Zealand']

# Plot with differently-colored markers.
plt.plot(years, gdp_australia, 'b-', label='Australia')
plt.plot(years, gdp_nz, 'g-', label='New Zealand')

# Create legend.
plt.legend(loc='upper left')
plt.xlabel('Year')
plt.ylabel('GDP per capita ($)')

Out[68]:

Text(0,0.5,'GDP per capita ($)')

Add a legend {#Add-a-legend}

Often when plotting multiple datasets on the same figure it is desirable to have a legend describing the data.

This can be done in matplotlib in two stages:

  • Provide a label for each dataset in the figure:

In [69]:

plt.plot(years, gdp_australia, label='Australia')
plt.plot(years, gdp_nz, label='New Zealand')

Out[69]:

[<matplotlib.lines.Line2D at 0x7fdc5ff24b70>]

  • Instruct matplotlib to create the legend.

In [70]:

plt.legend()

No handles with labels found to put in legend.

Out[70]:

<matplotlib.legend.Legend at 0x7fdc5fde26d8>

By default matplotlib will attempt to place the legend in a suitable position. If you would rather specify a position this can be done with the loc= argument, e.g to place the legend in the upper left corner of the plot, specify loc='upper left'

  • Plot a scatter plot correlating the GDP of Australia and New Zealand
  • Use either plt.scatter or DataFrame.plot.scatter

In [71]:

plt.scatter(gdp_australia, gdp_nz)

Out[71]:

<matplotlib.collections.PathCollection at 0x7fdc5fd5f6d8>

In [72]:

data.T.plot.scatter(x = 'Australia', y = 'New Zealand')

Out[72]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fdc5fd99860>

Exercises {#Exercises}

Fill in the blanks below to plot the minimum GDP per capita over time for all the countries in Europe. Modify it again to plot the maximum GDP per capita over time for Europe.

In [73]:

data_europe = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_europe.csv', index_col='country')
data_europe.____.plot(label='min')
data_europe.____
plt.legend(loc='best')
plt.xticks(rotation=90)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-73-e4db3abbc459> in <module>()
      1 data_europe = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_europe.csv', index_col='country')
----> 2 data_europe.____.plot(label='min')
      3 data_europe.____
      4 plt.legend(loc='best')
      5 plt.xticks(rotation=90)

/usr/local/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
   3612             if name in self._info_axis:
   3613                 return self[name]
-> 3614             return object.__getattribute__(self, name)
   3615 
   3616     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute '____'

Modify the example in the notes to create a scatter plot showing the relationship between the minimum and maximum GDP per capita among the countries in Asia for each year in the data set. What relationship do you see (if any)?

In [0]:

data_asia = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_asia.csv', index_col='country')
data_asia.describe().T.plot(kind='scatter', x='min', y='max')

You might note that the variability in the maximum is much higher than that of the minimum. Take a look at the maximum and the max indexes:

In [0]:

data_asia = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_asia.csv', index_col='country')
data_asia.max().plot()
print(data_asia.idxmax())
print(data_asia.idxmin())

Saving your plot to a file {#Saving-your-plot-to-a-file}

If you are satisfied with the plot you see you may want to save it to a file, perhaps to include it in a publication. There is a function in the matplotlib.pyplot module that accomplishes this: savefig. Calling this function, e.g. with

In [0]:

plt.savefig('my_figure.png')

will save the current figure to the file my_figure.png. The file format will automatically be deduced from the file name extension (other formats are pdf, ps, eps and svg).

Note that functions in plt refer to a global figure variable and after a figure has been displayed to the screen (e.g. with plt.show) matplotlib will make this variable refer to a new empty figure. Therefore, make sure you call plt.savefig before the plot is displayed to the screen, otherwise you may find a file with an empty plot.

When using dataframes, data is often generated and plotted to screen in one line, and plt.savefig seems not to be a possible approach. One possibility to save the figure to file is then to

  • save a reference to the current figure in a local variable (with plt.gcf)
  • call the savefig class method from that variable.

In [0]:

fig = plt.gcf() # get current figure
data.plot(kind='bar')
fig.savefig('my_figure.png')

Making your plots accessible {#Making-your-plots-accessible}

Whenever you are generating plots to go into a paper or a presentation, there are a few things you can do to make sure that everyone can understand your plots.

  • Always make sure your text is large enough to read. Use the fontsize parameter in xlabel, ylabel, title, and legend, and tick_params with labelsize to increase the text size of the numbers on your axes.
  • Similarly, you should make your graph elements easy to see. Use s to increase the size of your scatterplot markers and linewidth to increase the sizes of your plot lines.
  • Using color (and nothing else) to distinguish between different plot elements will make your plots unreadable to anyone who is colorblind, or who happens to have a black-and-white office printer. For lines, the linestyle parameter lets you use different types of lines. For scatterplots, marker lets you change the shape of your points. If you’re unsure about your colors, you can use Coblis or Color Oracle to simulate what your plots would look like to those with colorblindness.

Key Points {#Key-Points}

  • matplotlib is the most widely used scientific plotting library in Python.

  • Plot data directly from a Pandas dataframe.

  • Select and transform data, then plot it.

  • Many styles of plot are available: see the Python Graph Gallery for more options.

  • Can plot many sets of data together.

Programming Style {#Programming-Style}

15 minutes

Exercises (15 min)

Coding style {#Coding-style}

Coding style helps us to understand the code better. It helps to maintain and change the code. Python relies strongly on coding style, as we may notice by the indentation we apply to lines to define different blocks of code. Python proposes a standard style through one of its first Python Enhancement Proposals (PEP), PEP8, and highlight the importance of readability in the Zen of Python.

Keep in mind:

  • document your code
  • use clear, meaningful variable names
  • use white-space, not tabs, to indent lines

Follow standard Python style in your code.

  • PEP8: a style guide for Python that discusses topics such as how you should name variables, how you should use indentation in your code, how you should structure your import statements, etc. Adhering to PEP8 makes it easier for other Python developers to read and understand your code, and to understand what their contributions should look like. The PEP8 application and Python library can check your code for compliance with PEP8.
  • Google style guide on Python supports the use of PEP8 and extend the coding style to more specific structure of a Python code, which may be interesting also to follow.

Use assertions to check for internal errors. {#Use-assertions-to-check-for-internal-errors.}

Assertions are a simple, but powerful method for making sure that the context in which your code is executing is as you expect.

In [109]:

def calc_bulk_density(mass, volume):
    '''Return dry bulk density = powder mass / powder volume.'''
    assert volume > 0
    return mass / volume

In [110]:

calc_bulk_density(60, -50)

----------------------------------------------------------------------
AssertionError                       Traceback (most recent call last)
<ipython-input-110-b0873c16a0ba> in <module>()
----> 1 calc_bulk_density(60, -50)

<ipython-input-109-fa5af01ee7ed> in calc_bulk_density(mass, volume)
      1 def calc_bulk_density(mass, volume):
      2     '''Return dry bulk density = powder mass / powder volume.'''
----> 3     assert volume > 0
      4     return mass / volume

AssertionError:

If the assertion is False, the Python interpreter raises an AssertionError runtime exception. The source code for the expression that failed will be displayed as part of the error message. To ignore assertions in your code run the interpreter with the ‘-O’ (optimize) switch. Assertions should contain only simple checks and never change the state of the program. For example, an assertion should never contain an assignment.

Use docstrings to provide online help. {#Use-docstrings-to-provide-online-help.}

  • If the first thing in a function is a character string that is not assignd to a variable, Python attaches it to the function as thee online help.
  • Called a docstring (short fo "documentation string").

In [111]:

def average(values):
    "Return average of values, or None if no values are supplied."

    if len(values) == 0:
        return None
    return sum(values) / len(values)

help(average)

Help on function average in module __main__:

average(values)
    Return average of values, or None if no values are supplied.

Also, you can comment your code using multiline strings. These start and end with three quote characters (either single or double) and end with three matching characters.

In [112]:

import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

In [77]:

"""This string spans
multiple lines.

Blank lines are allowed."""

Out[77]:

'This string spans\nmultiple lines.\n\nBlank lines are allowed.'

Exercises {#Exercises}

Highlight the lines in the code below that will be available as online help. Are there lines that should be made available, but won’t be? Will any lines produce a syntax error or a runtime error?

"Find maximum edit distance between multiple sequences."
# This finds the maximum distance between all sequences.

def overall_max(sequences):
    '''Determine overall maximum edit distance.'''

    highest = 0
    for left in sequences:
        for right in sequences:
            '''Avoid checking sequence against itself.'''
            if left != right:
                this = edit_distance(left, right)
                highest = max(highest, this)

    # Report.
    return highest

Turn the comment on the following function into a docstring and check that help displays it properly.

In [0]:

def middle(a, b, c):
    # Return the middle value of three.
    # Assumes the values can actually be compared.
    values = [a, b, c]
    values.sort()
    return values[1]

Clean up this code!

  1. Read this short program and try to predict what it does.
  2. Run it: how accurate was your prediction?
  3. Refactor the program to make it more readable. Remember to run it after each change to ensure its behavior hasn’t changed.
  4. Compare your rewrite with your neighbor’s. What did you do the same? What did you do differently, and why?
n = 10
s = 'et cetera'
print(s)
i = 0
while i < n:
    # print('at', j)
    new = ''
    for j in range(len(s)):
        left = j-1
        right = (j+1)%len(s)
        if s[left]==s[right]: new += '-'
        else: new += '*'
    s=''.join(new)
    print(s)
    i += 1

Key Points {#Key-Points}

  • Follow standard Python style in your code.

  • Use docstrings to provide online help.