Session 1

Plotting and programming with Python 3

Open a Jupyter Notebook

(10 minutes)

http://swcarpentry.github.io/python-novice-gapminder/01-run-quit/index.html

Once you have your Notebook open let`s start with the session.

Heo

Variables and Assignment

(15 min)

Exercises (10min)

Variables are names for determined values.

Note:

  • To assing the value to a variable use = like this:

    variable = value

  • The variable is created when a value is assingned.

For example let's type in our names and ages:

In [2]:
age = 21
name = 'Mariana'
ejemplo= 5+5
print(age, name, ejemplo)
21 Mariana 10
In [4]:
#1variable = 1
variable1 = 1

Variable names

  • Only letters, digits and underscore _ (for long names)

  • Do not start with a digit

  • It is possible to start with an underscore like __real_age, but this has a special meaning.

(We won't do that yet)

Display values with:

print()

This built-in function prints values as text

Examples:

In [5]:
print('My name is', name, ', I am ', age)
My name is Mariana , I am  21

Please remember that variables must be created before they are used!

Otherwise Python reports an error

In [8]:
cat_name = "Salem"
print(cat_name)
Salem

Be aware that it is the order of execution of cells that is important in a Jupyter notebook, not the order in which they appear.

Python will remember all the code that was run previously, including any variables you have defined, irrespective of the order in the notebook. Therefore if you define variables lower down the notebook and then (re)run cells further up, those defined further down will still be present.

Variables can be reused

In [9]:
age = age + 1 
print('Next year I will be', age)
Next year I will be 22

Also, you can index to get a single character from a string.

  • 'AB' != 'BA' -> ordering matters, because we can treat the string as a list of characters.

  • Python uses zero-based indexing.

  • Use square brackets to get a character in a specific position.
atom_name = 'helium'
print(atom_name[0])
In [11]:
atom_name = 'helium'
print(atom_name)
print(atom_name[0])
helium
h

Slicing to get a substring!

  • Quick concepts

    • Substring:A part of a string. (1 to n number of characters)
    • Element: An item on a list. In a string, the elements are the characters.
    • Slice: part of a string (or any list-like thing)

How to slice strings:

  • [start:stop], where start is replaced with the index of the first element we want and stop is replaced with the index of the element just after the last element we want.

  • Taking a slice does not change the contents of the original string. Instead, the slice is a copy of part of the original string.

In [13]:
atom_name = 'sodium'
print(atom_name[0:3])

# Use len to find the lenght of a string
print(len('helium'))
sod
6

Final considerations:

  • Python is case-sensitive upper- and lower-case letters are different. Name != name are different variables.

  • Use meaningful variable names

flabadab = 42
ewr_422_yY = 'Ahmed'
print(ewr_422_yY, 'is', flabadab, 'years old')
In [15]:
hola = 1

HOLA= 2
print(hola, HOLA)
1 2

Excercises

  • What is the final value of position in the program below? (Try to predict the value without running the program, then check your prediction.)
initial = 'left'
position = initial
initial = 'right'
In [18]:
initial = 'left'
position = initial
initial = 'right'
print(initial, position)
right left
  • If you assign a = 123, what happens if you try to get the second digit of a via a[1]?
In [19]:
a = 123
a[1]
----------------------------------------------------------------------
TypeError                            Traceback (most recent call last)
<ipython-input-19-d77b277d0f9d> in <module>()
      1 a = 123
----> 2 a[1]

TypeError: 'int' object is not subscriptable
  • What does the following program print?
atom_name = 'carbon'
print('atom_name[1:3] is:', atom_name[1:3])
In [20]:
atom_name = 'carbon'
print('atom_name[1:3] is:', atom_name[1:3])
atom_name[1:3] is: ar

1.What does thing[low:high] do?

2.What does thing[low:] (without a value after the colon) do?

3.What does thing[:high] (without a value before the colon) do?

4.What does thing[:] (just a colon) do?

5.What does thing[number:some-negative-number] do?

6.What happens when you choose a high value which is out of range? (i.e., try atom_name[0:15])

KEY POINTS - Variable and Assignment

Use variables to store values.

Use print to display values.

Variables persist between cells.

Variables must be created before they are used.

Variables can be used in calculations.

Use an index to get a single character from a string.

Use a slice to get a substring.

Use the built-in function len to find the length of a string.

Python is case-sensitive.

Use meaningful variable names.

Data Types and Type Conversion

(10 min)

Exercises (10min)

Every value has a type.

Existing types are:

  • Integer (int): positive or negative whole numbers
  • Floating point numbers (float): real numbers
  • Character string (str): text
    • Written in either single or double quotes (but they have to match)

Use the built-in function type to find out what type a value or variable has

print(type(variable))
In [22]:
print(type(52))

fitness = 'average'
print(type(fitness))
<class 'float'>
<class 'str'>

Operations or methods for a given type

Types control what operations (or methodds) can be performed on a given value

  • A value’s type determines what the program can do to it.
In [23]:
#This works
print(5 - 3)
2
In [24]:
#This works?
print('hello' - 'h')
----------------------------------------------------------------------
TypeError                            Traceback (most recent call last)
<ipython-input-24-c60e4b1b64d0> in <module>()
      1 #This works?
----> 2 print('hello' - 'h')

TypeError: unsupported operand type(s) for -: 'str' and 'str'
  • Operators that can be used on strings

    • +
    • *

    "Adding" characters strings concatenates them.

In [26]:
 full_name = 'Ahmed'+ 'Walsh'
print(full_name)
AhmedWalsh

Multiplying a character string by an integer N creates a new string that consists of that character string repeated N times.

In [27]:
separator = '=' * 10
print(separator)
==========
  • Try function len on strings and numbers
In [28]:
#Text
print(len(full_name))
10
In [29]:
#Number
print(len(256))
----------------------------------------------------------------------
TypeError                            Traceback (most recent call last)
<ipython-input-29-00ccdb9afd37> in <module>()
      1 #Number
----> 2 print(len(256))

TypeError: object of type 'int' has no len()

Excercises

  1. What type of value is 3.4? How can you find out?
  1. What type of value is 3.25 + 4?
  1. What type of value (integer, floating point number, or character string) would you use to represent each of the following? Try to come up with more than one good answer for each problem. For example, in # 1, when would counting days with a floating point variable make more sense than using an integer?

Number of days since the start of the year.

Time elapsed from the start of the year until now in days.

Serial number of a piece of lab equipment.

A lab specimen’s age Current population of a city.

Average population of a city over time.

  1. Which of the following will return the floating point number 2.0? Note: there may be more than one right answer.
first = 1.0
second = "1"
third = "1.1"

1.first + float(second)

2.float(second) + float(third)

3.first + int(third)

4.first + int(float(third))

5.int(first) + int(float(third))

  1. 2.0 * second

Key Points - Data Types and Type Conversion

  • Every value has a type.

  • Use the built-in function type to find the type of a value.

  • Types control what operations can be done on values.

  • Strings can be added and multiplied.

  • Strings have a length (but numbers don’t).

  • Must convert numbers to strings or vice versa when operating on them.

  • Can mix integers and floats freely in operations.

  • Variables only change value when something is assigned to them.

STRETCHING TIME!

Numpy

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object

  • sophisticated (broadcasting) functions

  • tools for integrating C/C++ and Fortran code

  • useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

We can create random numers easily

In [34]:
import numpy as np
In [30]:
from numpy import random

r = random.randint(1, 35)
print(r)
26

A very useful data structure, very similar to lists with a few exceptions such as:

  • The number of elements in the array cannot be changed. (This means we cant .append() or remove)
  • All element in the arrays must be the type of variables
  • Once the array is created we can't change the data type.

Advantages over lists:

  • Arrays can be n-dimensional (vector, matrix, tensor)
  • Supports arithmetic algebraic
  • Arrays work faster than lists
In [32]:
mylist = [1,2,3,4 ]
mylist
Out[32]:
[1, 2, 3, 4]
In [35]:
 np.array(mylist)
Out[35]:
array([1, 2, 3, 4])
In [36]:
mymatrix = [[1,2,3], [4,5,6], [7,8,9]]
mymatrix
Out[36]:
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
In [37]:
m = np.array(mymatrix)
m
Out[37]:
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
In [38]:
 type(m)
Out[38]:
numpy.ndarray

There are many ways to create an array, let's try some:

Let's asume a is any numpy array.

Function Description
a.shape Returns a tuple with the numer of elements per dimension
a.ndim Number of dimension
a.size Number of elements in an array
a.dtype Data type of the elements in the array
a.T Transposes the array
a.flat Collapses the array in 1 dimension
a.copy() Returns a copy of the array
a.fill() Fills the array with a determined value
a.reshape() Returns an array with the same data but in the shape we indicate
a.resize() Changes the shape of the array, but this does not creates a copy of the original array
a.sort() Reorders the array
In [43]:
# Try some!
print(m.dtype)
print(m)
int64
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Slicing, but in arrays

It's more flexible. Slicings follows this structure:

beginning:end:step

In [52]:
a = np.linspace(0, 10, 11)
#blank spaces mean "everthing" 

# all 
print(a[1:11])
[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
In [0]:
 #  3 to 8
print(a[3:8])

#  1 to 9 with steps of size 2 (odd)
print(a[1:11:2])
[3. 4. 5. 6. 7.]
[1. 3. 5. 7. 9.]
In [54]:
# conditionals 
print(a[a > 4])
[ 5.  6.  7.  8.  9. 10.]

Exercises

Try it yourself

  1. Create a non squared matrix, and print the dimensions (.shape)
  2. Use .reshape to change the shape of the array
  3. Try . size is equal to np.prod(a.shape)

Built-in Functions and Help

(10 min)

Exercises (10 min)

Arguments in a function

  • A function may take zero or more arguments

  • An argument is a value passed into a function

For example some of the functions we used so far have arguments

  • len takes exactly one argument
len(_x_)
  • int, str and float create a new value from an existing one

  • print takes zero or more arguments

    • With no arguments prints a blank line
    • Must use the parethesis to use the function
print('before')
print()
print('after')

Commonly-used built-in functions include max, min and round

  • max()
  • min()
  • round()

You can also combine some functions

print(max(1, 2, 3))
print(min('a', 'A', '0'))

But these functions may only work for certain (combination of) arguments.

For example:

  • max and min must be given at least one argument.

    • “Largest of the empty set” is a meaningless question.
  • And they must be given things that can meaningfully be compared.

In [59]:
print(max(1,4,6,9,100000))
print(min('a', 'A', 0))
100000
----------------------------------------------------------------------
TypeError                            Traceback (most recent call last)
<ipython-input-59-4a37fdd5cb05> in <module>()
      1 print(max(1,4,6,9,100000))
----> 2 print(min('a', 'A', 0))

TypeError: '<' not supported between instances of 'int' and 'str'

Functions may have default values for some arguments

  • round will round off a floating-point number.
  • By default, rounds to zero decimal places.
In [61]:
round(3.712)

# We can specify the number of decimal places we want. Let's try:

round(3.712, 2 )
Out[61]:
3.71

You can use the built-in function help to get help for a function.

  • Every built-in has online documentation.
In [62]:
help(round)
Help on built-in function round in module builtins:

round(...)
    round(number[, ndigits]) -> number
    
    Round a number to a given precision in decimal digits (default 0 digits).
    This returns an int when called with one argument, otherwise the
    same type as the number. ndigits may be negative.

In [63]:
help(max)
Help on built-in function max in module builtins:

max(...)
    max(iterable, *[, default=obj, key=func]) -> value
    max(arg1, arg2, *args, *[, key=func]) -> value
    
    With a single iterable argument, return its biggest item. The
    default keyword-only argument specifies an object to return if
    the provided iterable is empty.
    With two or more arguments, return the largest argument.

Python reports a syntax error when it can’t understand the source of a program.

# Forgot to close the quote marks around the string.
name = 'Feng

Try it!

In [66]:
name = 'Feng'
print(name)
Feng
  • The message indicates a problem on first line of the input (“line 1”).
    • In this case the “ipython-input” section of the file name tells us that we are working with input into IPython, the Python interpreter used by the Jupyter Notebook.
  • The -6- part of the filename indicates that the error occurred in cell 6 of our Notebook.
  • Next is the problematic line of code, indicating the problem with a ^ pointer.

Python reports a runtime error when something goes wrong while a program is executing.

In [68]:
age = 53
remaining = 100 - age # mis-spelled 'age'
print(remaining)
47
  • Fix syntax errors by reading the source and runtime errors by tracing execution.

Getting help in Jupyter

There are 2 ways to get help in a Jupyter Notebook

  1. Place the cursor anywhere in the function invocation (i.e., the function name or its parameters), hold down shift, and press tab.

  2. Or type a function name with a question mark after it.

Every function returns something

  • Every function call produces some result.
  • If the function doesn’t have a useful result to return, it usually returns the special value None.
In [69]:
max?
In [70]:
result = print('example')
print('result of print is', result)
example
result of print is None

Excercises

What happens when...

  1. Explain in simple terms the order of operations in the following program: when does the addition happen, when does the subtraction happen, when is each function called, etc.
  2. What is the final value of radiance?
In [71]:
radiance = 1.0
radiance = max(2.1, 2.0 + min(radiance, 1.1 * radiance - 0.5))
# The value is ... 2.6
print(radiance)
2.6

Spot the Difference!

  1. Predict what each of the print statements in the program below will print.
  2. Does max(len(rich), poor) run or produce an error message? If it runs, does its result make any sense?
easy_string = "abc"
print(max(easy_string))
rich = "gold"
poor = "tin"
print(max(rich, poor))
print(max(len(rich), len(poor)))

Why not?

Why don’t max and min return None when they are given no arguments?

KEY POINTS

  • Use comments to add documentation to programs.

  • A function may take zero or more arguments.

  • Commonly-used built-in functions include max, min, and round.

  • Functions may only work for certain (combinations of) arguments.

  • Functions may have default values for some arguments.

  • Use the built-in function help to get help for a function.

  • The Jupyter Notebook has two ways to get help.

  • Every function returns something.

  • Python reports a syntax error when it can’t understand the source of a program.

  • Python reports a runtime error when something goes wrong while a program is executing.

  • Fix syntax errors by reading the source code, and runtime errors by tracing the program’s execution.

Libraries

10 min

Most of the power of a programming language is in its libraries.

  • Library = collection of files (modules) that functions for use by other programs.

  • The Python standard library is an extensive suite of modules that comes with Python itself.

Other libraries available in PyPI (the Python Package Index).

A program must import a library module before using it.

  • Use import to load a library module into a program’s memory.
  • Then refer to things from the module as module_name.thing_name. Python uses . to mean “part of”.
In [72]:
import math

print('pi is', math.pi)
print('cos(pi) is', math.cos(math.pi))
pi is 3.141592653589793
cos(pi) is -1.0

We can also use help() to learn about the content of a library module, just like we do with functions!

In [73]:
help(math)
Help on module math:

NAME
    math

MODULE REFERENCE
    https://docs.python.org/3.6/library/math
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module is always available.  It provides access to the
    mathematical functions defined by the C standard.

FUNCTIONS
    acos(...)
        acos(x)
        
        Return the arc cosine (measured in radians) of x.
    
    acosh(...)
        acosh(x)
        
        Return the inverse hyperbolic cosine of x.
    
    asin(...)
        asin(x)
        
        Return the arc sine (measured in radians) of x.
    
    asinh(...)
        asinh(x)
        
        Return the inverse hyperbolic sine of x.
    
    atan(...)
        atan(x)
        
        Return the arc tangent (measured in radians) of x.
    
    atan2(...)
        atan2(y, x)
        
        Return the arc tangent (measured in radians) of y/x.
        Unlike atan(y/x), the signs of both x and y are considered.
    
    atanh(...)
        atanh(x)
        
        Return the inverse hyperbolic tangent of x.
    
    ceil(...)
        ceil(x)
        
        Return the ceiling of x as an Integral.
        This is the smallest integer >= x.
    
    copysign(...)
        copysign(x, y)
        
        Return a float with the magnitude (absolute value) of x but the sign 
        of y. On platforms that support signed zeros, copysign(1.0, -0.0) 
        returns -1.0.
    
    cos(...)
        cos(x)
        
        Return the cosine of x (measured in radians).
    
    cosh(...)
        cosh(x)
        
        Return the hyperbolic cosine of x.
    
    degrees(...)
        degrees(x)
        
        Convert angle x from radians to degrees.
    
    erf(...)
        erf(x)
        
        Error function at x.
    
    erfc(...)
        erfc(x)
        
        Complementary error function at x.
    
    exp(...)
        exp(x)
        
        Return e raised to the power of x.
    
    expm1(...)
        expm1(x)
        
        Return exp(x)-1.
        This function avoids the loss of precision involved in the direct evaluation of exp(x)-1 for small x.
    
    fabs(...)
        fabs(x)
        
        Return the absolute value of the float x.
    
    factorial(...)
        factorial(x) -> Integral
        
        Find x!. Raise a ValueError if x is negative or non-integral.
    
    floor(...)
        floor(x)
        
        Return the floor of x as an Integral.
        This is the largest integer <= x.
    
    fmod(...)
        fmod(x, y)
        
        Return fmod(x, y), according to platform C.  x % y may differ.
    
    frexp(...)
        frexp(x)
        
        Return the mantissa and exponent of x, as pair (m, e).
        m is a float and e is an int, such that x = m * 2.**e.
        If x is 0, m and e are both 0.  Else 0.5 <= abs(m) < 1.0.
    
    fsum(...)
        fsum(iterable)
        
        Return an accurate floating point sum of values in the iterable.
        Assumes IEEE-754 floating point arithmetic.
    
    gamma(...)
        gamma(x)
        
        Gamma function at x.
    
    gcd(...)
        gcd(x, y) -> int
        greatest common divisor of x and y
    
    hypot(...)
        hypot(x, y)
        
        Return the Euclidean distance, sqrt(x*x + y*y).
    
    isclose(...)
        isclose(a, b, *, rel_tol=1e-09, abs_tol=0.0) -> bool
        
        Determine whether two floating point numbers are close in value.
        
           rel_tol
               maximum difference for being considered "close", relative to the
               magnitude of the input values
            abs_tol
               maximum difference for being considered "close", regardless of the
               magnitude of the input values
        
        Return True if a is close in value to b, and False otherwise.
        
        For the values to be considered close, the difference between them
        must be smaller than at least one of the tolerances.
        
        -inf, inf and NaN behave similarly to the IEEE 754 Standard.  That
        is, NaN is not close to anything, even itself.  inf and -inf are
        only close to themselves.
    
    isfinite(...)
        isfinite(x) -> bool
        
        Return True if x is neither an infinity nor a NaN, and False otherwise.
    
    isinf(...)
        isinf(x) -> bool
        
        Return True if x is a positive or negative infinity, and False otherwise.
    
    isnan(...)
        isnan(x) -> bool
        
        Return True if x is a NaN (not a number), and False otherwise.
    
    ldexp(...)
        ldexp(x, i)
        
        Return x * (2**i).
    
    lgamma(...)
        lgamma(x)
        
        Natural logarithm of absolute value of Gamma function at x.
    
    log(...)
        log(x[, base])
        
        Return the logarithm of x to the given base.
        If the base not specified, returns the natural logarithm (base e) of x.
    
    log10(...)
        log10(x)
        
        Return the base 10 logarithm of x.
    
    log1p(...)
        log1p(x)
        
        Return the natural logarithm of 1+x (base e).
        The result is computed in a way which is accurate for x near zero.
    
    log2(...)
        log2(x)
        
        Return the base 2 logarithm of x.
    
    modf(...)
        modf(x)
        
        Return the fractional and integer parts of x.  Both results carry the sign
        of x and are floats.
    
    pow(...)
        pow(x, y)
        
        Return x**y (x to the power of y).
    
    radians(...)
        radians(x)
        
        Convert angle x from degrees to radians.
    
    sin(...)
        sin(x)
        
        Return the sine of x (measured in radians).
    
    sinh(...)
        sinh(x)
        
        Return the hyperbolic sine of x.
    
    sqrt(...)
        sqrt(x)
        
        Return the square root of x.
    
    tan(...)
        tan(x)
        
        Return the tangent of x (measured in radians).
    
    tanh(...)
        tanh(x)
        
        Return the hyperbolic tangent of x.
    
    trunc(...)
        trunc(x:Real) -> Integral
        
        Truncates x to the nearest Integral toward 0. Uses the __trunc__ magic method.

DATA
    e = 2.718281828459045
    inf = inf
    nan = nan
    pi = 3.141592653589793
    tau = 6.283185307179586

FILE
    /usr/local/lib/python3.6/lib-dynload/math.cpython-36m-x86_64-linux-gnu.so


Import specific items from a library module to shorten programs.

  • Use from ... import ... to load only specific items from a library module.
  • Then refer to them directly without library name as prefix.
In [74]:
from math import cos, pi

print('cos(pi) is', cos(pi))
cos(pi) is -1.0

Create an alias for a library module when importing it to shorten programs.

  • Use import ... as ... to give a library a short alias while importing it.
  • Then refer to items in the library using that shortened name.

Exercises

Exploring the Math Module

  • What function from the math module can you use to calculate a square root without using sqrt?
  • Since the library contains this function, why does sqrt exist?

Locating the Right Module

You want to select a random character from a string:

bases = 'ACTTGCTTGAC'
  1. Which standard library module could help you?
  2. Which function would you select from that module? Are there alternatives?
  3. Try to write a program that uses the function.
In [78]:
from random import randrange
bases = 'ACTTGCTTGAC'
print(bases[randrange(len(bases))])
A

Jigsaw Puzzle (Parson’s Problem)

Rearrange the following statements so that a random DNA base is printed and its index in the string. Not all statements may be needed. Feel free to use/add intermediate variables.

In [ ]:
bases="ACTTGCTTGAC"
import math
import random
___ = random.randrange(n_bases)
___ = len(bases)
print("random base ", bases[___], "base index", ___)

Importing with aliases

  1. Fill in the blanks so that the program below prints 90.0.
  2. Rewrite the program so that it uses import without as.
  3. Which form do you find easier to read?
In [80]:
import math as m
angle = m.degrees(m.pi / 2)
print(angle)
90.0

Importing Specific Items

  1. Fill in the blanks so that the program below prints 90.0.
  2. Do you find this version easier to read than preceding ones?
  3. Why wouldn’t programmers always use this form of import?
In [ ]:
____ math import ____, ____
angle = degrees(pi / 2)
print(angle)

Reading Error Messages

  1. Read the code below and try to identify what the errors are without running it.
  2. Run the code, and read the error message. What type of error is it?
In [79]:
from math import log
log(0)
----------------------------------------------------------------------
ValueError                           Traceback (most recent call last)
<ipython-input-79-d72e1d780bab> in <module>()
      1 from math import log
----> 2 log(0)

ValueError: math domain error

Keypoints

  • Most of the power of a programming language is in its libraries.
  • A program must import a library module in order to use it.
  • Use help to learn about the contents of a library module.
  • Import specific items from a library to shorten programs.
  • Create an alias for a library when importing it to shorten programs.

Data Frames

Reading tabular data into DataFrame

(15 min)

Exercises (10 min)

Pandas

A widely know library for statistics, particularly on tabular data.

  • Borrows many features from R´s dataframes.

Dataframe: A 2-dimensional table whose columns have names and potentially have different data types.

In [81]:
#Let's import pandas
import pandas as pd

data = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_oceania.csv')
print(data)
       country  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
0    Australia     10039.59564     10949.64959     12217.22686   
1  New Zealand     10556.57566     12247.39532     13175.67800   

   gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  \
0     14526.12465     16788.62948     18334.19751     19477.00928   
1     14463.91893     16046.03728     16233.71770     17632.41040   

   gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  \
0     21888.88903     23424.76683     26997.93657     30687.75473   
1     19007.19129     18363.32494     21050.41377     23189.80135   

   gdpPercap_2007  
0     34435.36744  
1     25185.00911  

File not found

Our lessons store their data files in a data sub-directory, which is why the path to the file is data/gapminder_gdp_oceania.csv. If you forget to include data/, or if you include it but your copy of the file is somewhere else, you will get a runtime error that ends with a line like this:

ERROR
OSError: File b'gapminder_gdp_oceania.csv' does not exist

Use index_col to specify that a column’s values should be used as row headings.

  • Row headings are numbers (0 and 1 in this case).
  • Really want to index by country.
  • Pass the name of the column to read_csv as its index_col parameter to do this.
In [84]:
data = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_oceania.csv', index_col='country')
data
Out[84]:
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
country
Australia 10039.59564 10949.64959 12217.22686 14526.12465 16788.62948 18334.19751 19477.00928 21888.88903 23424.76683 26997.93657 30687.75473 34435.36744
New Zealand 10556.57566 12247.39532 13175.67800 14463.91893 16046.03728 16233.71770 17632.41040 19007.19129 18363.32494 21050.41377 23189.80135 25185.00911

Use DataFrame.info to explore a little the dataframe

In [85]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Australia to New Zealand
Data columns (total 12 columns):
gdpPercap_1952    2 non-null float64
gdpPercap_1957    2 non-null float64
gdpPercap_1962    2 non-null float64
gdpPercap_1967    2 non-null float64
gdpPercap_1972    2 non-null float64
gdpPercap_1977    2 non-null float64
gdpPercap_1982    2 non-null float64
gdpPercap_1987    2 non-null float64
gdpPercap_1992    2 non-null float64
gdpPercap_1997    2 non-null float64
gdpPercap_2002    2 non-null float64
gdpPercap_2007    2 non-null float64
dtypes: float64(12)
memory usage: 208.0+ bytes

What we know?

  • data is a DataFrame
  • It has two rows named 'Australia' and 'New Zealand'
  • There are Twelve columns, each of which has two actual 64-bit floating point values.
  • No null values, aka no missing observations.
  • Uses 208.0 bytes of memory

The DataFrame.columns variable stores information about the dataframe's columns

  • Note that this is data, not a method.
    • Like math.pi.
    • So do not use () to try to call it.
  • Called a member variable, or just member.

data is the dataframe, not a method, don't use () to try to call it.

In [86]:
print(data.columns)
Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
       'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
       'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
      dtype='object')

Use DataFrame.T to transpose a dataframe

  • Sometimes want to treat columns as rows and vice versa.
  • Transpose (written .T) doesn’t copy the data, just changes the program’s view of it.
  • Like columns, it is a member variable.
In [87]:
print(data.T)
data.T
country           Australia  New Zealand
gdpPercap_1952  10039.59564  10556.57566
gdpPercap_1957  10949.64959  12247.39532
gdpPercap_1962  12217.22686  13175.67800
gdpPercap_1967  14526.12465  14463.91893
gdpPercap_1972  16788.62948  16046.03728
gdpPercap_1977  18334.19751  16233.71770
gdpPercap_1982  19477.00928  17632.41040
gdpPercap_1987  21888.88903  19007.19129
gdpPercap_1992  23424.76683  18363.32494
gdpPercap_1997  26997.93657  21050.41377
gdpPercap_2002  30687.75473  23189.80135
gdpPercap_2007  34435.36744  25185.00911
Out[87]:
country Australia New Zealand
gdpPercap_1952 10039.59564 10556.57566
gdpPercap_1957 10949.64959 12247.39532
gdpPercap_1962 12217.22686 13175.67800
gdpPercap_1967 14526.12465 14463.91893
gdpPercap_1972 16788.62948 16046.03728
gdpPercap_1977 18334.19751 16233.71770
gdpPercap_1982 19477.00928 17632.41040
gdpPercap_1987 21888.88903 19007.19129
gdpPercap_1992 23424.76683 18363.32494
gdpPercap_1997 26997.93657 21050.41377
gdpPercap_2002 30687.75473 23189.80135
gdpPercap_2007 34435.36744 25185.00911

Use DataFrame.describe to get summary statistics about data

DataFrame.describe() gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'

In [89]:
data.describe()
Out[89]:
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
count 2.000000 2.000000 2.000000 2.000000 2.00000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000
mean 10298.085650 11598.522455 12696.452430 14495.021790 16417.33338 17283.957605 18554.709840 20448.040160 20894.045885 24024.175170 26938.778040 29810.188275
std 365.560078 917.644806 677.727301 43.986086 525.09198 1485.263517 1304.328377 2037.668013 3578.979883 4205.533703 5301.853680 6540.991104
min 10039.595640 10949.649590 12217.226860 14463.918930 16046.03728 16233.717700 17632.410400 19007.191290 18363.324940 21050.413770 23189.801350 25185.009110
25% 10168.840645 11274.086022 12456.839645 14479.470360 16231.68533 16758.837652 18093.560120 19727.615725 19628.685413 22537.294470 25064.289695 27497.598692
50% 10298.085650 11598.522455 12696.452430 14495.021790 16417.33338 17283.957605 18554.709840 20448.040160 20894.045885 24024.175170 26938.778040 29810.188275
75% 10427.330655 11922.958888 12936.065215 14510.573220 16602.98143 17809.077557 19015.859560 21168.464595 22159.406358 25511.055870 28813.266385 32122.777857
max 10556.575660 12247.395320 13175.678000 14526.124650 16788.62948 18334.197510 19477.009280 21888.889030 23424.766830 26997.936570 30687.754730 34435.367440

Not particularly useful with just two records, but very helpful when there are thousands

Exercises

Read the data in gapminder_gdp_americas.csv (should be in the same directory as gapminder_gdp_oceania.csv)

Tip

check the parameters to define the index.

Inspect the data.

Use the function help(americas.head) and help(americas.tail) the answer:

  1. What method callwill display the first three rows of this data?
  2. What method call will display the last three columns of this data?

Reading files in other directories

The data for your current project is stored in a file called microbes.csv, which is located in a folder called field_data. You are doing analysis in a notebook called analysis.ipynb in a sibling folder called thesis:

your_home_directory
+-- field_data/
|   +-- microbes.csv
+-- thesis/
    +-- analysis.ipynb

What value(s) should you pass to read_csv to read microbes.csv in analysis.ipynb?

Writing Data

As well as the read_csv function for reading data from a file, Pandas provides a to_csv function to write dataframes to files. Applying what you’ve learned about reading from files, write one of your dataframes to a file called processed.csv. You can use help to get information on how to use to_csv.

In [96]:
data2 = data.copy()
#pd.to_csv
#data2.to_csv?
data2.to_csv('/home/mcubero/dataSanJose19/data/processed.csv')

Key Points

  • Use the Pandas library to get basic statistics out of tabular data.

  • Use index_col to specify that a column’s values should be used as row headings.

  • Use DataFrame.info to find out more about a dataframe.

  • The DataFrame.columns variable stores information about the dataframe’s columns.

  • Use DataFrame.T to transpose a dataframe.

  • Use DataFrame.describe to get summary statistics about data.

STRETCHING TIME!

Recap

Feedback & questions