A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.
Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.
What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.
To access a value at the position [i,j] of a DataFrame, we have two options, depending on what is the meaning of i in use. Remember that a DataFrame provides a index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.
dataframe.iloc
can specify by numerical index analogously to 2D version of character selection in strings.
dataframe.iloc[rows, columns]
import pandas as pd
data = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_europe.csv', index_col='country')
#data
print(data.iloc[0:3, 0])
#With labels
#print(data.loc["Albania", "gdpPercap_1952"])
#All columns (just like usual slicing)
#print(data.loc["Albania", :])
Use DataFrame.loc[..., ...] to select values by their (entry) label.
data = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_europe.csv', index_col='country')
print(data.loc["Albania", "gdpPercap_1952"])
Use : on its own to mean all columns or all rows.
print(data.loc["Italy",:])
print(data.loc["Albania", :])
Select multiple columns or rows using DataFrame.loc and a named slice.
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])
In the above code, we discover that slicing using loc
is inclusive at both ends, which differs from slicing using iloc
, where slicing indicates everything up to but not including the final index.
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())
# Calculate minimum of slice
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())
Use comparisons to select data based on value.
Comparison is applied element by element.
Returns a similarly-shaped dataframe of True and False.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
#print('Subset of data:\n', subset)
# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)
#Select values or NaN using a Boolean mask.
mask = subset > 10000
print(subset[mask])
Get the value where the mask is true, and NaN (Not a Number) where it is false. Useful because NaNs are ignored by operations like max, min, average, etc.
mask = subset > 10000
print(subset[mask])
Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyse their data.
mask_higher = data > data.mean()
wealth_score = mask_higher.aggregate('sum', axis=1) / len(data.columns)
wealth_score
Note: axis : (default 0) {0 or ‘index’, 1 or ‘columns’} 0 or ‘index’: apply function to each column. 1 or ‘columns’: apply function to each row.
Finally, for each group in the wealth_score table, we sum their (financial) contribution across the years surveyed:
data.groupby(wealth_score).sum()
import pandas as pd
df = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
Write an expression to find the Per Capita GDP of Serbia in 2007.
df = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_europe.csv', index_col='country')
print(df.loc['Serbia','gdpPercap_2007'])
df.loc["Serbia"][-1]
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.idxmin())
print(data.idxmax())
print(data.idxmin())
Use DataFrame.iloc[..., ...] to select values by integer location.
Use : on its own to mean all columns or all rows.
Select multiple columns or rows using DataFrame.loc and a named slice.
Result of slicing can be used in further operations.
Use comparisons to select data based on value.
Select values or NaN using a Boolean mask.
20 min
import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
df['K', :] = df[1,:] + df[1,:]
df
df.iloc[6,:] = df.iloc[1,:] + df.iloc[1,:]
df = df[['newColumn', 'W', 'X', 'Y', 'Z']]
df
The method group-by allow you to group rows in a data frame and apply a function to it.
#Let's create a DF
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
print(df)
#Group by company
by_comp = df.groupby("Company")
#by_comp
# Try some functions
by_comp.mean()
by_comp.count()
by_comp.describe()
by_comp.describe().transpose()
We can also merge data from different dataframes.
It's very useful when we need a variable from a different file.
You can use a ‘left’, ‘right’, ‘outer’, ‘inner’
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
print(left)
print(right)
## Merge
pd.merge(left, right, how='outer', on=['key'])
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
left.join(right)
right.join(left)
left.join(right, how='outer')
Some additional operations you can use with a pandas data frame
df['Company'].unique()
df['Company'].nunique()
df['Company'].value_counts()
There are some other very useful tricks you can do with pandas data frames. Such as profiling a dataframe.
Profiling df.profile_report()
is a simple and easy way to go furhter into knowing your data.
Some other tips and tricks
#Install
#pip install pandas-profiling
uploaded = files.upload()
import pandas as pd
import pandas_profiling
import io
data = pd.read_csv(io.BytesIO(uploaded['gapminder_
print(data.iloc[:,1:3])
pandas_profiling.ProfileReport(data.iloc[:,0:6])
When you are working with large data frames you might want to know if there are missing values and how many are there.
df.isna().head()
You can count how many Nan values you have per variable
df.isna().sum()
df1 = df.copy()
You can discard these values
df.dropna(axis=0) #for rows
df.dropna(axis= 1) #for columns
Here we can do it manually (if like to do things like that) but we can also use methods already created.
For example ScikitLearn provides:
The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.
In general, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers are more appropriate.
from sklearn import preprocessing
#Save columns names
names = data.iloc[:,2:8].columns
#Create scaler
scaler = preprocessing.MinMaxScaler() #StandardScaler() #MaxAbsScaler
#Transform your data frame (numeric variables )
data1 = data.iloc[:,2:8]
data1 = scaler.fit_transform(data1)
data1 = pd.DataFrame(data1, columns=names)
print(data1.head())
print(data.iloc[:,2:8].head())
With the file gapminder_all.csv
try to:
A list stores many values in a single structure.
Doing calculations with a hundred variables called pressure_001, pressure_002, etc., would be at least as slow as doing them by hand.
Use a list to store many values together.
pressures = [0.273, 0.275, 0.277, 0.275, 0.276]
print('pressures:', pressures)
print('length:', len(pressures))
Use an item’s index to fetch it from a list.
print('zeroth item of pressures:', pressures[0])
Lists’ values can be replaced by assigning to them.
pressures[0] = 0.265
print('pressures is now:', pressures)
Use list_name.append to add items to the end of a list.
primes = [2, 3, 5]
print('primes is initially:', primes)
primes.append(7)
#primes.append(9)
#print('primes has become:', primes)
Use del to remove items from a list entirely.
primes = [2, 3, 5, 7, 9]
print('primes before removing last item:', primes)
del primes[4]
print('primes after removing last item:', primes)
The empty list contains no values.
Lists may contain values of different types.
goals = [1, 'Create lists.', 2, 'Extract items from lists.', 3, 'Modify lists.']
Character strings can be indexed like lists.
element = 'carbon'
print('zeroth character:', element[0])
print('third character:', element[3])
Character strings are immutable.
element[0] = 'C'
Given this:
print('string to list:', list('tin'))
print('list to string:', ''.join(['g', 'o', 'l', 'd']))
print(list('YaQueremosComer'))
What does the following program print?
element = 'helium'
print(element[-1])
What does the following program print?
element = 'fluorine'
print(element[::2])
print(element[::-1])
A list stores many values in a single structure.
Use an item’s index to fetch it from a list.
Lists’ values can be replaced by assigning to them.
Appending items to a list lengthens it.
Use del to remove items from a list entirely.
The empty list contains no values.
Lists may contain values of different types.
Character strings can be indexed like lists.
Character strings are immutable.
Indexing beyond the end of the collection is an error.
A for loop executes commands once for each value in a collection.
“for each thing in this group, do these operations”
for number in [2, 3, 5]:
print(number)
print(2)
print(3)
print(5)
A for loop is made up of a collection, loop variable and a body.
Parts of a for loop
The loop variable, number, is what changes for each iteration of the loop.
Python uses indentation rather than {} or begin/end to show nesting.
The first line of the for loop must end with a colon, and the body must be indented.
for number in [2, 3, 5]:
print(number)
Indentation is always meaningful in Python.
firstName = "Jon"
lastName = "Smith"
Loop variables can be called anything.
for kitten in [2, 3, 5]:
print(kitten)
The body of a loop can contain many statements.
primes = [2, 3, 5]
for p in primes:
squared = p ** 2
cubed = p ** 3
print(p, squared, cubed)
Use range to iterate over a sequence of numbers.
print('a range is not a list: range(0, 3)')
for number in range(0, 3):
print(number)
The Accumulator pattern turns many values into one.
total = 0
for number in range(10):
total = total + (number + 1)
print(total)
Create a table showing the numbers of the lines that are executed when this program runs, and the values of the variables after each line is executed.
total = 0
for char in "tin":
total = total + 1
Fill in the blanks in the program below so that it prints “nit” (the reverse of the original character string “tin”).
original = "tin"
result = ____
for char in original:
result = ____
print(result)
original = "tin"
result = ""
for char in original:
result = char + result
print(result)
Fill in the blanks in each of the programs below to produce the indicated result.
# Total length of the strings in the list: ["red", "green", "blue"] => 12
total = 0
for word in ["red", "green", "blue"]:
____ = ____ + len(word)
print(total)
# List of word lengths: ["red", "green", "blue"] => [3, 5, 4]
lengths = []
for word in ["red", "green", "blue"]:
lengths.append(len(word))
print(lengths)
# Concatenate all words: ["red", "green", "blue"] => "redgreenblue"
words = ["red", "green", "blue"]
result = ____
for ____ in ____:
____
print(result)
# Create acronym: ["red", "green", "blue"] => "RGB"
# write the whole thing
Find the error to the following code
students = ['Ana', 'Juan', 'Susan']
for m in students:
print(m)
Cumulative sum. Reorder and properly indent the lines of code below so that they print a list with the cumulative sum of data. The result should be [1, 3, 5, 10].
cumulative.append(sum)
for number in data:
cumulative = []
sum += number
sum = 0
print(cumulative)
data = [1,2,2,5]
A for loop executes commands once for each value in a collection.
A for loop is made up of a collection, a loop variable, and a body.
The first line of the for loop must end with a colon, and the body must be indented.
Indentation is always meaningful in Python.
Loop variables can be called anything (but it is strongly advised to have a meaningful name to the looping variable).
The body of a loop can contain many statements.
Use range to iterate over a sequence of numbers.
The Accumulator pattern turns many values into one.
Use a for loop to process files given a list of their names.
import pandas as pd
for filename in ['/home/mcubero/dataSanJose19/data/gapminder_gdp_africa.csv', '/home/mcubero/dataSanJose19/data/gapminder_gdp_asia.csv']:
data = pd.read_csv(filename, index_col='country')
print(filename, data.min())
Use glob.glob to find sets of files whose names match a pattern.
'*' meaning “match zero or more characters”
Python contains the glob library to provide pattern matching functionality.
import glob
print('all csv files in data directory:', glob.glob('/home/mcubero/dataSanJose19/data/*.csv'))
print('all PDB files:', glob.glob('*.pdb'))
Use glob and for to process batches of files.
for filename in glob.glob('/home/mcubero/dataSanJose19/data/gapminder_*.csv'):
data = pd.read_csv(filename)
print(filename, data['gdpPercap_1952'].min())
Which of these files is not matched by the expression glob.glob('data/as.csv')?
Use a for loop to process files given a list of their names.
Use glob.glob to find sets of files whose names match a pattern.
Use glob and for to process batches of files.
Break programs down into functions to make them easier to understand.
Encapsulate complexity so that we can treat it as a single “thing”.
To define a function use def
then the name of the function like this:
def say_hi(parameter1, parameter2):
print('Hello')
Remember, defining a function does not run it, you must call the function to execute it.
def print_date(year, month, day):
joined = str(year) + '/' + str(month) + '/' + str(day)
print(joined)
print_date(1871, 3, 19)
print_date(month=3, day=19, year=1871)
def average(values):
if len(values) == 0:
return None
return sum(values) / len(values)
Remember: every function returns something
What does the following program print?
def report(pressure):
print('pressure is', pressure)
print('calling', report, 22.5)
def report(pressure):
print('pressure is', pressure)
print('calling', report(22.5))
Fill in the blanks to create a function that takes a single filename as an argument, loads the data in the file named by the argument, and returns the minimum value in that data.
import pandas as pd
def min_in_data(____):
data = ____
return ____
import pandas as pd
def min_in_data(data):
data = pd.read_csv(data)
return data.min()
min_in_data('/home/mcubero/dataSanJose19/data/gapminder_gdp_africa.csv')
The code below will run on a label-printer for chicken eggs. A digital scale will report a chicken egg mass (in grams) to the computer and then the computer will print a label.
Please re-write the code so that the if-block is folded into a function.
import random
for i in range(10):
# simulating the mass of a chicken egg
# the (random) mass will be 70 +/- 20 grams
mass=70+20.0*(2.0*random.random()-1.0)
print(mass)
#egg sizing machinery prints a label
if(mass>=85):
print("jumbo")
elif(mass>=70):
print("large")
elif(mass<70 and mass>=55):
print("medium")
else:
print("small")
Assume that the following code has been executed:
import pandas as pd
df = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_asia.csv', index_col=0)
japan = df.loc['Japan']
japan
1.Complete the statements below to obtain the average GDP for Japan across the years reported for the 1980s.
year = 1983
gdp_decade = 'gdpPercap_' + str(year // ____)
avg = (japan.loc[gdp_decade + ___] + japan.loc[gdp_decade + ___]) / 2
2.Abstract the code above into a single function.
def avg_gdp_in_decade(country, continent, year):
df = pd.read_csv('data/gapminder_gdp_'+___+'.csv',delimiter=',',index_col=0)
____
____
____
return avg
Break programs down into functions to make them easier to understand.
Define a function using def with a name, parameters, and a block of code.
Defining a function does not run it.
Arguments in call are matched to parameters in definition.
Functions may return a result to their caller using return.
The scope of a variable is the part of a program that can ‘see’ that variable.
pressure = 103.9
def adjust(t):
temperature = t * 1.43 / pressure
return temperature
print('adjusted:', adjust(0.9))
print('temperature after call:', temperature)
Trace the values of all variables in this program as it is executed. (Use ‘—’ as the value of variables before and after they exist.)
limit = 100
def clip(value):
return min(max(0.0, value), limit)
value = -22.5
print(clip(value))
Read the traceback below, and identify the following:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-2-e4c4cbafeeb5> in <module>()
1 import errors_02
----> 2 errors_02.print_friday_message()
/Users/ghopper/thesis/code/errors_02.py in print_friday_message()
13
14 def print_friday_message():
---> 15 print_message("Friday")
/Users/ghopper/thesis/code/errors_02.py in print_message(day)
9 "sunday": "Aw, the weekend is almost over."
10 }
---> 11 print(messages[day])
12
13
KeyError: 'Friday'
Use if statements to control whether or not a block of code is executed.
mass = 2.07
if mass > 3.0:
print (mass, 'is large')
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
if m > 3.0:
print(m, 'is large')
else:
print(m, 'is small')
thing1 = [3.54, 2.07, 9.22]
if masses > thing1:
print (masses, 'is large')
Conditionals are often used inside loops.
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
if m > 3.0:
print(m, 'is large')
Use else to execute a block of code when an if condition is not true.
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
if m > 3.0:
print(m, 'is large')
else:
print(m, 'is small')
Use elif to specify additional tests.
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in ____:
if m > 9.0:
print(__, 'is HUGE')
elif m > 3.0:
print(m, 'is large')
___:
print(m, 'is small')
Conditions are tested once, in order.
grade = 85
if grade >= 70:
print('grade is C')
elif grade >= 80:
print('grade is B')
elif grade >= 90:
print('grade is A')
velocity = 10.0
if velocity > 20.0:
print('moving too fast')
else:
print('adjusting velocity')
velocity = 50.0
Often use conditionals in a loop to “evolve” the values of variables.
velocity = 10.0
for i in range(5): # execute the loop 5 times
print(i, ':', velocity)
if velocity > 20.0:
print('moving too fast')
velocity = velocity - 5.0
else:
print('moving too slow')
velocity = velocity + 10.0
print('final velocity:', velocity)
Conditionals are useful to check for errors!
Often, you want some combination of things to be true. You can combine relations within a conditional using and and or. Continuing the example above, suppose you have
mass = [ 3.54, 2.07, 9.22, 1.86, 1.71]
velocity = [10.00, 20.00, 30.00, 25.00, 20.00]
i = 0
for i in range(5):
if mass[i] > 5 and velocity[i] > 20:
print("Fast heavy object. Duck!")
elif mass[i] > 2 and mass[i] <= 5 and velocity[i] <= 20:
print("Normal traffic")
elif mass[i] <= 2 and velocity[i] <= 20:
print("Slow light object. Ignore it")
else:
print("Whoa! Something is up with the data. Check it")
Just like with arithmetic, you can and should use parentheses whenever there is possible ambiguity. A good general rule is to always use parentheses when mixing and and or in the same condition. That is, instead of:
if mass[i] <= 2 or mass[i] >= 5 and velocity[i] > 20:
write one of these:
if (mass[i] <= 2 or mass[i] >= 5) and velocity[i] > 20:
if mass[i] <= 2 or (mass[i] >= 5 and velocity[i] > 20):
so it is perfectly clear to a reader (and to Python) what you really mean.
What does this program print?
pressure = 71.9
if pressure > 50.0:
pressure = 25.0
elif pressure <= 50.0:
pressure = 0.0
print(pressure)
pressure = 71.9
if pressure > 50.0:
pressure = 25.0
elif pressure <= 50.0:
pressure = 0.0
print(pressure)
Trimming Values Fill in the blanks so that this program creates a new list containing zeroes where the original list’s values were negative and ones where the original list’s values were positive.
original = [-1.5, 0.2, 0.4, 0.0, -1.3, 0.4]
result = []
for value in original:
if value < 0.0:
result.append(0)
else:
result.append(1)
print(result)
import glob
import pandas as pd
for filename in glob.glob('/home/mcubero/dataSanJose19/data/*.csv'):
contents = pd.read_csv(filename)
if len(contents) < 50:
print(filename, len(contents))
Modify this program so that it finds the largest and smallest values in the list no matter what the range of values originally is.
values = [...some test data...]
smallest, largest = None, None
for v in values:
if ____:
smallest, largest = v, v
____:
smallest = min(____, v)
largest = max(____, v)
print(smallest, largest)
What are the advantages and disadvantages of using this method to find the range of the data?
Functions will often contain conditionals. Here is a short example that will indicate which quartile the argument is in based on hand-coded values for the quartile cut points.
def calculate_life_quartile(exp):
if exp < 58.41:
# This observation is in the first quartile
return 1
elif exp >= 58.41 and exp < 67.05:
# This observation is in the second quartile
return 2
elif exp >= 67.05 and exp < 71.70:
# This observation is in the third quartile
return 3
elif exp >= 71.70:
# This observation is in the fourth quartile
return 4
else:
# This observation has bad data
return None
calculate_life_quartile(62.5)
That function would typically be used within a for loop, but Pandas has a different, more efficient way of doing the same thing, and that is by applying a function to a dataframe or a portion of a dataframe. Here is an example, using the definition above.
data = pd.read_csv('/home/mcubero/dataSanJose19/data/all-Americas.csv')
data
#data['life_qrtl'] = data['lifeExp'].apply(calculate_life_quartile)
There is a lot in that second line, so let’s take it piece by piece. On the right side of the = we start with data['lifeExp'], which is the column in the dataframe called data labeled lifExp. We use the apply() to do what it says, apply the calculate_life_quartile to the value of this column for every row in the dataframe.
Use if statements to control whether or not a block of code is executed.
Conditionals are often used inside loops.
Use else to execute a block of code when an if condition is not true.
Use elif to specify additional tests.
Conditions are tested once, in order.
Create a table showing variables’ values to trace a program’s execution.
20 min Exercises (15 min)
We are going to use matplotlib.
matplotlib
is the most widely used scientific plotting library in Python.
#%matplotlib inline
import matplotlib.pyplot as plt
time = [0, 1, 2, 3]
position = [0, 100, 200, 300]
plt.plot(time, position)
plt.xlabel('Time (hr)')
plt.ylabel('Position (km)')
import pandas as pd
data = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_oceania.csv', index_col='country')
# Extract year from last 4 characters of each column name
years = data.columns.str.strip('gdpPercap_')
# Convert year values to integers, saving results back to dataframe
data.columns = years.astype(int)
data.loc['Australia'].plot()
data.T.plot()
plt.ylabel('GDP per capita')
plt.style.use('ggplot')
data.T.plot(kind='bar')
plt.ylabel('GDP per capita')
Get Australia data from dataframe
years = data.columns
gdp_australia = data.loc['Australia']
plt.plot(years, gdp_australia, 'g--')
Can plot many sets of data together.
# Select two countries' worth of data.
gdp_australia = data.loc['Australia']
gdp_nz = data.loc['New Zealand']
# Plot with differently-colored markers.
plt.plot(years, gdp_australia, 'b-', label='Australia')
plt.plot(years, gdp_nz, 'g-', label='New Zealand')
# Create legend.
plt.legend(loc='upper left')
plt.xlabel('Year')
plt.ylabel('GDP per capita ($)')
Often when plotting multiple datasets on the same figure it is desirable to have a legend describing the data.
This can be done in matplotlib in two stages:
plt.plot(years, gdp_australia, label='Australia')
plt.plot(years, gdp_nz, label='New Zealand')
plt.legend()
By default matplotlib will attempt to place the legend in a suitable position. If you would rather specify a position this can be done with the loc= argument, e.g to place the legend in the upper left corner of the plot, specify loc='upper left'
plt.scatter(gdp_australia, gdp_nz)
data.T.plot.scatter(x = 'Australia', y = 'New Zealand')
Fill in the blanks below to plot the minimum GDP per capita over time for all the countries in Europe. Modify it again to plot the maximum GDP per capita over time for Europe.
data_europe = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_europe.csv', index_col='country')
data_europe.____.plot(label='min')
data_europe.____
plt.legend(loc='best')
plt.xticks(rotation=90)
Modify the example in the notes to create a scatter plot showing the relationship between the minimum and maximum GDP per capita among the countries in Asia for each year in the data set. What relationship do you see (if any)?
data_asia = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_asia.csv', index_col='country')
data_asia.describe().T.plot(kind='scatter', x='min', y='max')
You might note that the variability in the maximum is much higher than that of the minimum. Take a look at the maximum and the max indexes:
data_asia = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_asia.csv', index_col='country')
data_asia.max().plot()
print(data_asia.idxmax())
print(data_asia.idxmin())
If you are satisfied with the plot you see you may want to save it to a file, perhaps to include it in a publication. There is a function in the matplotlib.pyplot module that accomplishes this: savefig. Calling this function, e.g. with
plt.savefig('my_figure.png')
will save the current figure to the file my_figure.png. The file format will automatically be deduced from the file name extension (other formats are pdf, ps, eps and svg).
Note that functions in plt refer to a global figure variable and after a figure has been displayed to the screen (e.g. with plt.show) matplotlib will make this variable refer to a new empty figure. Therefore, make sure you call plt.savefig before the plot is displayed to the screen, otherwise you may find a file with an empty plot.
When using dataframes, data is often generated and plotted to screen in one line, and plt.savefig seems not to be a possible approach. One possibility to save the figure to file is then to
fig = plt.gcf() # get current figure
data.plot(kind='bar')
fig.savefig('my_figure.png')
Whenever you are generating plots to go into a paper or a presentation, there are a few things you can do to make sure that everyone can understand your plots.
matplotlib is the most widely used scientific plotting library in Python.
Plot data directly from a Pandas dataframe.
Select and transform data, then plot it.
Many styles of plot are available: see the Python Graph Gallery for more options.
Can plot many sets of data together.
Coding style helps us to understand the code better. It helps to maintain and change the code. Python relies strongly on coding style, as we may notice by the indentation we apply to lines to define different blocks of code. Python proposes a standard style through one of its first Python Enhancement Proposals (PEP), PEP8, and highlight the importance of readability in the Zen of Python.
Keep in mind:
Follow standard Python style in your code.
Assertions are a simple, but powerful method for making sure that the context in which your code is executing is as you expect.
def calc_bulk_density(mass, volume):
'''Return dry bulk density = powder mass / powder volume.'''
assert volume > 0
return mass / volume
calc_bulk_density(60, -50)
If the assertion is False, the Python interpreter raises an AssertionError runtime exception. The source code for the expression that failed will be displayed as part of the error message. To ignore assertions in your code run the interpreter with the ‘-O’ (optimize) switch. Assertions should contain only simple checks and never change the state of the program. For example, an assertion should never contain an assignment.
def average(values):
"Return average of values, or None if no values are supplied."
if len(values) == 0:
return None
return sum(values) / len(values)
help(average)
Also, you can comment your code using multiline strings. These start and end with three quote characters (either single or double) and end with three matching characters.
import this
"""This string spans
multiple lines.
Blank lines are allowed."""
Highlight the lines in the code below that will be available as online help. Are there lines that should be made available, but won’t be? Will any lines produce a syntax error or a runtime error?
"Find maximum edit distance between multiple sequences."
# This finds the maximum distance between all sequences.
def overall_max(sequences):
'''Determine overall maximum edit distance.'''
highest = 0
for left in sequences:
for right in sequences:
'''Avoid checking sequence against itself.'''
if left != right:
this = edit_distance(left, right)
highest = max(highest, this)
# Report.
return highest
Turn the comment on the following function into a docstring and check that help displays it properly.
def middle(a, b, c):
# Return the middle value of three.
# Assumes the values can actually be compared.
values = [a, b, c]
values.sort()
return values[1]
Clean up this code!
n = 10
s = 'et cetera'
print(s)
i = 0
while i < n:
# print('at', j)
new = ''
for j in range(len(s)):
left = j-1
right = (j+1)%len(s)
if s[left]==s[right]: new += '-'
else: new += '*'
s=''.join(new)
print(s)
i += 1
Follow standard Python style in your code.
Use docstrings to provide online help.