Session 1¶ {#Session-1}
Plotting and programming with Python 3¶ {#Plotting-and-programming-with-Python-3}
Open a Jupyter Notebook¶ {#Open-a-Jupyter-Notebook}
(10 minutes)
http://swcarpentry.github.io/python-novice-gapminder/01-run-quit/index.html
Once you have your Notebook open let`s start with the session.
Heo
Variables and Assignment¶ {#Variables-and-Assignment}
(15 min)
Exercises (10min)
Variables are names for determined values.
Note:
-
To assing the value to a variable use = like this:
variable = value
-
The variable is created when a value is assingned.
For example let's type in our names and ages:
In [2]:
age = 21
name = 'Mariana'
ejemplo= 5+5
print(age, name, ejemplo)
21 Mariana 10
In [4]:
#1variable = 1
variable1 = 1
Variable names¶ {#Variable-names}
-
Only letters, digits and underscore _ (for long names)
-
Do not start with a digit
-
It is possible to start with an underscore like
__real_age
, but this has a special meaning.
(We won't do that yet)
Display values with:¶ {#Display-values-with:}
print()
This built-in function prints values as text
Examples:
In [5]:
print('My name is', name, ', I am ', age)
My name is Mariana , I am 21
Please remember that variables must be created before they are used {#Please-remember-that-variables-must-be-created-before-they-are-used!}
Otherwise Python reports an error
In [8]:
cat_name = "Salem"
print(cat_name)
Salem
Be aware that it is the order of execution of cells that is important in a Jupyter notebook, not the order in which they appear.
Python will remember all the code that was run previously, including any variables you have defined, irrespective of the order in the notebook. Therefore if you define variables lower down the notebook and then (re)run cells further up, those defined further down will still be present.
Variables can be reused¶ {#Variables-can-be-reused}
In [9]:
age = age + 1
print('Next year I will be', age)
Next year I will be 22
Also, you can index to get a single character from a string.
-
'AB' != 'BA' -> ordering matters, because we can treat the string as a list of characters.
-
Python uses zero-based indexing.
-
Use square brackets to get a character in a specific position.
atom_name = 'helium'
print(atom_name[0])
In [11]:
atom_name = 'helium'
print(atom_name)
print(atom_name[0])
helium
h
Slicing to get a substring {#Slicing-to-get-a-substring!}
-
Quick concepts
- Substring:A part of a string. (1 to n number of characters)
- Element: An item on a list. In a string, the elements are the characters.
- Slice: part of a string (or any list-like thing)
How to slice strings:
-
[start:stop], where start is replaced with the index of the first element we want and stop is replaced with the index of the element just after the last element we want.
-
Taking a slice does not change the contents of the original string. Instead, the slice is a copy of part of the original string.
In [13]:
atom_name = 'sodium'
print(atom_name[0:3])
# Use len to find the lenght of a string
print(len('helium'))
sod
6
Final considerations:¶ {#Final-considerations:}
-
Python is case-sensitive upper- and lower-case letters are different. Name != name are different variables.
-
Use meaningful variable names
flabadab = 42
ewr_422_yY = 'Ahmed'
print(ewr_422_yY, 'is', flabadab, 'years old')
In [15]:
hola = 1
HOLA= 2
print(hola, HOLA)
1 2
Excercises¶ {#Excercises}
- What is the final value of position in the program below? (Try to predict the value without running the program, then check your prediction.)
initial = 'left'
position = initial
initial = 'right'
In [18]:
initial = 'left'
position = initial
initial = 'right'
print(initial, position)
right left
- If you assign
a = 123
, what happens if you try to get the second digit of a viaa[1]
?
In [19]:
a = 123
a[1]
----------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-19-d77b277d0f9d> in <module>()
1 a = 123
----> 2 a[1]
TypeError: 'int' object is not subscriptable
- What does the following program print?
atom_name = 'carbon'
print('atom_name[1:3] is:', atom_name[1:3])
In [20]:
atom_name = 'carbon'
print('atom_name[1:3] is:', atom_name[1:3])
atom_name[1:3] is: ar
1.What does thing[low:high] do?
2.What does thing[low:] (without a value after the colon) do?
3.What does thing[:high] (without a value before the colon) do?
4.What does thing[:] (just a colon) do?
5.What does thing[number:some-negative-number] do?
6.What happens when you choose a high value which is out of range? (i.e., try atom_name[0:15])
KEY POINTS - Variable and Assignment¶ {#KEY-POINTS---Variable-and-Assignment}
Use variables to store values.
Use print to display values.
Variables persist between cells.
Variables must be created before they are used.
Variables can be used in calculations.
Use an index to get a single character from a string.
Use a slice to get a substring.
Use the built-in function len to find the length of a string.
Python is case-sensitive.
Use meaningful variable names.
Data Types and Type Conversion¶ {#Data-Types-and-Type-Conversion}
(10 min)
Exercises (10min)
Every value has a type.¶ {#Every-value-has-a-type.}
Existing types are:
- Integer
(int)
: positive or negative whole numbers - Floating point numbers
(float)
: real numbers - Character string
(str)
: text- Written in either single or double quotes (but they have to match)
Use the built-in function type
to find out what type a value or
variable has
print(type(variable))
In [22]:
print(type(52))
fitness = 'average'
print(type(fitness))
<class 'float'>
<class 'str'>
Operations or methods for a given type¶ {#Operations-or-methods-for-a-given-type}
Types control what operations (or methodds) can be performed on a given value
- A value’s type determines what the program can do to it.
In [23]:
#This works
print(5 - 3)
2
In [24]:
#This works?
print('hello' - 'h')
----------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-24-c60e4b1b64d0> in <module>()
1 #This works?
----> 2 print('hello' - 'h')
TypeError: unsupported operand type(s) for -: 'str' and 'str'
-
Operators that can be used on strings
- +
- *
"Adding" characters strings concatenates them.
In [26]:
full_name = 'Ahmed'+ 'Walsh'
print(full_name)
AhmedWalsh
Multiplying a character string by an integer N creates a new string that consists of that character string repeated N times.
In [27]:
separator = '=' * 10
print(separator)
==========
- Try function len on strings and numbers
In [28]:
#Text
print(len(full_name))
10
In [29]:
#Number
print(len(256))
----------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-29-00ccdb9afd37> in <module>()
1 #Number
----> 2 print(len(256))
TypeError: object of type 'int' has no len()
Excercises¶ {#Excercises}
-
What type of value is 3.4? How can you find out?
-
What type of value is 3.25 + 4?
-
What type of value (integer, floating point number, or character string) would you use to represent each of the following? Try to come up with more than one good answer for each problem. For example, in # 1, when would counting days with a floating point variable make more sense than using an integer?
Number of days since the start of the year.
Time elapsed from the start of the year until now in days.
Serial number of a piece of lab equipment.
A lab specimen’s age Current population of a city.
Average population of a city over time.
- Which of the following will return the floating point number
2.0
? Note: there may be more than one right answer.
first = 1.0
second = "1"
third = "1.1"
1.first + float(second)
2.float(second) + float(third)
3.first + int(third)
4.first + int(float(third))
5.int(first) + int(float(third))
- 2.0 * second
Key Points - Data Types and Type Conversion¶ {#Key-Points---Data-Types-and-Type-Conversion}
-
Every value has a type.
-
Use the built-in function type to find the type of a value.
-
Types control what operations can be done on values.
-
Strings can be added and multiplied.
-
Strings have a length (but numbers don’t).
-
Must convert numbers to strings or vice versa when operating on them.
-
Can mix integers and floats freely in operations.
-
Variables only change value when something is assigned to them.
STRETCHING TIME {#STRETCHING-TIME!}
Numpy¶ {#Numpy}
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
-
a powerful N-dimensional array object
-
sophisticated (broadcasting) functions
-
tools for integrating C/C++ and Fortran code
-
useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
We can create random numers easily
In [34]:
import numpy as np
In [30]:
from numpy import random
r = random.randint(1, 35)
print(r)
26
A very useful data structure, very similar to lists with a few exceptions such as:
- The number of elements in the array cannot be changed. (This means we cant .append() or remove)
- All element in the arrays must be the type of variables
- Once the array is created we can't change the data type.
Advantages over lists:
- Arrays can be n-dimensional (vector, matrix, tensor)
- Supports arithmetic algebraic
- Arrays work faster than lists
In [32]:
mylist = [1,2,3,4 ]
mylist
Out[32]:
[1, 2, 3, 4]
In [35]:
np.array(mylist)
Out[35]:
array([1, 2, 3, 4])
In [36]:
mymatrix = [[1,2,3], [4,5,6], [7,8,9]]
mymatrix
Out[36]:
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
In [37]:
m = np.array(mymatrix)
m
Out[37]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [38]:
type(m)
Out[38]:
numpy.ndarray
There are many ways to create an array, let's try some:
Let's asume a
is any numpy array.
Function Description
a.shape
Returns a tuple with the numer of elements per dimension
a.ndim
Number of dimension
a.size
Number of elements in an array
a.dtype
Data type of the elements in the array
a.T
Transposes the array
a.flat
Collapses the array in 1 dimension
a.copy()
Returns a copy of the array
a.fill()
Fills the array with a determined value
a.reshape()
Returns an array with the same data but in the shape we indicate
a.resize()
Changes the shape of the array, but this does not creates a copy of the original array
a.sort()
Reorders the array
In [43]:
# Try some!
print(m.dtype)
print(m)
int64
[[1 2 3]
[4 5 6]
[7 8 9]]
Slicing, but in arrays¶ {#Slicing,-but-in-arrays}
It's more flexible. Slicings follows this structure:
beginning:end:step
In [52]:
a = np.linspace(0, 10, 11)
#blank spaces mean "everthing"
# all
print(a[1:11])
[ 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.]
In [0]:
# 3 to 8
print(a[3:8])
# 1 to 9 with steps of size 2 (odd)
print(a[1:11:2])
[3. 4. 5. 6. 7.]
[1. 3. 5. 7. 9.]
In [54]:
# conditionals
print(a[a > 4])
[ 5. 6. 7. 8. 9. 10.]
Exercises¶ {#Exercises}
Try it yourself
- Create a non squared matrix, and print the dimensions (.shape)
- Use .reshape to change the shape of the array
- Try . size is equal to np.prod(a.shape)
Built-in Functions and Help¶ {#Built-in-Functions-and-Help}
(10 min)
Exercises (10 min)
Arguments in a function
-
A function may take zero or more arguments
-
An argument is a value passed into a function
For example some of the functions we used so far have arguments
len
takes exactly one argument
len(_x_)
-
int, str and float create a new value from an existing one
-
print takes zero or more arguments
- With no arguments prints a blank line
- Must use the parethesis to use the function
print('before')
print()
print('after')
Commonly-used built-in functions include max
, min
and round
¶ {#Commonly-used-built-in-functions-include-max,-min-and-round}
max()
min()
round()
You can also combine some functions
print(max(1, 2, 3))
print(min('a', 'A', '0'))
But these functions may only work for certain (combination of) arguments.
For example:
-
max and min must be given at least one argument.
- “Largest of the empty set” is a meaningless question.
- And they must be given things that can meaningfully be compared.
In [59]:
print(max(1,4,6,9,100000))
print(min('a', 'A', 0))
100000
----------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-59-4a37fdd5cb05> in <module>()
1 print(max(1,4,6,9,100000))
----> 2 print(min('a', 'A', 0))
TypeError: '<' not supported between instances of 'int' and 'str'
Functions may have default values for some arguments¶ {#Functions-may-have-default-values-for-some-arguments}
round
will round off a floating-point number.- By default, rounds to zero decimal places.
In [61]:
round(3.712)
# We can specify the number of decimal places we want. Let's try:
round(3.712, 2 )
Out[61]:
3.71
You can use the built-in function help
to get help for a function.¶ {#You-can-use-the-built-in-function-help-to-get-help-for-a-function.}
- Every built-in has online documentation.
In [62]:
help(round)
Help on built-in function round in module builtins:
round(...)
round(number[, ndigits]) -> number
Round a number to a given precision in decimal digits (default 0 digits).
This returns an int when called with one argument, otherwise the
same type as the number. ndigits may be negative.
In [63]:
help(max)
Help on built-in function max in module builtins:
max(...)
max(iterable, *[, default=obj, key=func]) -> value
max(arg1, arg2, *args, *[, key=func]) -> value
With a single iterable argument, return its biggest item. The
default keyword-only argument specifies an object to return if
the provided iterable is empty.
With two or more arguments, return the largest argument.
Python reports a syntax error when it can’t understand the source of a program.¶ {#Python-reports-a-syntax-error-when-it-can’t-understand-the-source-of-a-program.}
# Forgot to close the quote marks around the string.
name = 'Feng
Try it!
In [66]:
name = 'Feng'
print(name)
Feng
- The message indicates a problem on first line of the input (“line
1”).
- In this case the “ipython-input” section of the file name tells us that we are working with input into IPython, the Python interpreter used by the Jupyter Notebook.
- The -6- part of the filename indicates that the error occurred in cell 6 of our Notebook.
- Next is the problematic line of code, indicating the problem with a \^ pointer.
Python reports a runtime error when something goes wrong while a program is executing.¶ {#Python-reports-a-runtime-error-when-something-goes-wrong-while-a-program-is-executing.}
In [68]:
age = 53
remaining = 100 - age # mis-spelled 'age'
print(remaining)
47
- Fix syntax errors by reading the source and runtime errors by tracing execution.
Getting help in Jupyter¶ {#Getting-help-in-Jupyter}
There are 2 ways to get help in a Jupyter Notebook
-
Place the cursor anywhere in the function invocation (i.e., the function name or its parameters), hold down
shift
, and presstab
. -
Or type a function name with a question mark after it.
Every function returns something
- Every function call produces some result.
- If the function doesn’t have a useful result to return, it usually
returns the special value
None
.
In [69]:
max?
In [70]:
result = print('example')
print('result of print is', result)
example
result of print is None
Excercises¶ {#Excercises}
What happens when...¶ {#What-happens-when...}
- Explain in simple terms the order of operations in the following program: when does the addition happen, when does the subtraction happen, when is each function called, etc.
- What is the final value of radiance?
In [71]:
radiance = 1.0
radiance = max(2.1, 2.0 + min(radiance, 1.1 * radiance - 0.5))
# The value is ... 2.6
print(radiance)
2.6
Spot the Difference {#Spot-the-Difference!}
- Predict what each of the print statements in the program below will print.
- Does max(len(rich), poor) run or produce an error message? If it runs, does its result make any sense?
easy_string = "abc"
print(max(easy_string))
rich = "gold"
poor = "tin"
print(max(rich, poor))
print(max(len(rich), len(poor)))
Why not?¶ {#Why-not?}
Why don’t max and min return None when they are given no arguments?
KEY POINTS¶ {#KEY-POINTS}
-
Use comments to add documentation to programs.
-
A function may take zero or more arguments.
-
Commonly-used built-in functions include max, min, and round.
-
Functions may only work for certain (combinations of) arguments.
-
Functions may have default values for some arguments.
-
Use the built-in function help to get help for a function.
-
The Jupyter Notebook has two ways to get help.
-
Every function returns something.
-
Python reports a syntax error when it can’t understand the source of a program.
-
Python reports a runtime error when something goes wrong while a program is executing.
-
Fix syntax errors by reading the source code, and runtime errors by tracing the program’s execution.
Libraries¶ {#Libraries}
10 min
Most of the power of a programming language is in its libraries.
-
Library = collection of files (modules) that functions for use by other programs.
-
The Python standard library is an extensive suite of modules that comes with Python itself.
Other libraries available in PyPI (the Python Package Index).
A program must import a library module before using it.
- Use import to load a library module into a program’s memory.
- Then refer to things from the module as module_name.thing_name. Python uses . to mean “part of”.
In [72]:
import math
print('pi is', math.pi)
print('cos(pi) is', math.cos(math.pi))
pi is 3.141592653589793
cos(pi) is -1.0
We can also use help() to learn about the content of a library module, just like we do with functions!
In [73]:
help(math)
Help on module math:
NAME
math
MODULE REFERENCE
https://docs.python.org/3.6/library/math
The following documentation is automatically generated from the Python
source files. It may be incomplete, incorrect or include features that
are considered implementation detail and may vary between Python
implementations. When in doubt, consult the module reference at the
location listed above.
DESCRIPTION
This module is always available. It provides access to the
mathematical functions defined by the C standard.
FUNCTIONS
acos(...)
acos(x)
Return the arc cosine (measured in radians) of x.
acosh(...)
acosh(x)
Return the inverse hyperbolic cosine of x.
asin(...)
asin(x)
Return the arc sine (measured in radians) of x.
asinh(...)
asinh(x)
Return the inverse hyperbolic sine of x.
atan(...)
atan(x)
Return the arc tangent (measured in radians) of x.
atan2(...)
atan2(y, x)
Return the arc tangent (measured in radians) of y/x.
Unlike atan(y/x), the signs of both x and y are considered.
atanh(...)
atanh(x)
Return the inverse hyperbolic tangent of x.
ceil(...)
ceil(x)
Return the ceiling of x as an Integral.
This is the smallest integer >= x.
copysign(...)
copysign(x, y)
Return a float with the magnitude (absolute value) of x but the sign
of y. On platforms that support signed zeros, copysign(1.0, -0.0)
returns -1.0.
cos(...)
cos(x)
Return the cosine of x (measured in radians).
cosh(...)
cosh(x)
Return the hyperbolic cosine of x.
degrees(...)
degrees(x)
Convert angle x from radians to degrees.
erf(...)
erf(x)
Error function at x.
erfc(...)
erfc(x)
Complementary error function at x.
exp(...)
exp(x)
Return e raised to the power of x.
expm1(...)
expm1(x)
Return exp(x)-1.
This function avoids the loss of precision involved in the direct evaluation of exp(x)-1 for small x.
fabs(...)
fabs(x)
Return the absolute value of the float x.
factorial(...)
factorial(x) -> Integral
Find x!. Raise a ValueError if x is negative or non-integral.
floor(...)
floor(x)
Return the floor of x as an Integral.
This is the largest integer <= x.
fmod(...)
fmod(x, y)
Return fmod(x, y), according to platform C. x % y may differ.
frexp(...)
frexp(x)
Return the mantissa and exponent of x, as pair (m, e).
m is a float and e is an int, such that x = m * 2.**e.
If x is 0, m and e are both 0. Else 0.5 <= abs(m) < 1.0.
fsum(...)
fsum(iterable)
Return an accurate floating point sum of values in the iterable.
Assumes IEEE-754 floating point arithmetic.
gamma(...)
gamma(x)
Gamma function at x.
gcd(...)
gcd(x, y) -> int
greatest common divisor of x and y
hypot(...)
hypot(x, y)
Return the Euclidean distance, sqrt(x*x + y*y).
isclose(...)
isclose(a, b, *, rel_tol=1e-09, abs_tol=0.0) -> bool
Determine whether two floating point numbers are close in value.
rel_tol
maximum difference for being considered "close", relative to the
magnitude of the input values
abs_tol
maximum difference for being considered "close", regardless of the
magnitude of the input values
Return True if a is close in value to b, and False otherwise.
For the values to be considered close, the difference between them
must be smaller than at least one of the tolerances.
-inf, inf and NaN behave similarly to the IEEE 754 Standard. That
is, NaN is not close to anything, even itself. inf and -inf are
only close to themselves.
isfinite(...)
isfinite(x) -> bool
Return True if x is neither an infinity nor a NaN, and False otherwise.
isinf(...)
isinf(x) -> bool
Return True if x is a positive or negative infinity, and False otherwise.
isnan(...)
isnan(x) -> bool
Return True if x is a NaN (not a number), and False otherwise.
ldexp(...)
ldexp(x, i)
Return x * (2**i).
lgamma(...)
lgamma(x)
Natural logarithm of absolute value of Gamma function at x.
log(...)
log(x[, base])
Return the logarithm of x to the given base.
If the base not specified, returns the natural logarithm (base e) of x.
log10(...)
log10(x)
Return the base 10 logarithm of x.
log1p(...)
log1p(x)
Return the natural logarithm of 1+x (base e).
The result is computed in a way which is accurate for x near zero.
log2(...)
log2(x)
Return the base 2 logarithm of x.
modf(...)
modf(x)
Return the fractional and integer parts of x. Both results carry the sign
of x and are floats.
pow(...)
pow(x, y)
Return x**y (x to the power of y).
radians(...)
radians(x)
Convert angle x from degrees to radians.
sin(...)
sin(x)
Return the sine of x (measured in radians).
sinh(...)
sinh(x)
Return the hyperbolic sine of x.
sqrt(...)
sqrt(x)
Return the square root of x.
tan(...)
tan(x)
Return the tangent of x (measured in radians).
tanh(...)
tanh(x)
Return the hyperbolic tangent of x.
trunc(...)
trunc(x:Real) -> Integral
Truncates x to the nearest Integral toward 0. Uses the __trunc__ magic method.
DATA
e = 2.718281828459045
inf = inf
nan = nan
pi = 3.141592653589793
tau = 6.283185307179586
FILE
/usr/local/lib/python3.6/lib-dynload/math.cpython-36m-x86_64-linux-gnu.so
Import specific items from a library module to shorten programs.
- Use from ... import ... to load only specific items from a library module.
- Then refer to them directly without library name as prefix.
In [74]:
from math import cos, pi
print('cos(pi) is', cos(pi))
cos(pi) is -1.0
Create an alias for a library module when importing it to shorten programs.
- Use import ... as ... to give a library a short alias while importing it.
- Then refer to items in the library using that shortened name.
Exercises¶ {#Exercises}
Exploring the Math Module¶ {#Exploring-the-Math-Module}
- What function from the math module can you use to calculate a square root without using sqrt?
- Since the library contains this function, why does sqrt exist?
Locating the Right Module¶ {#Locating-the-Right-Module}
You want to select a random character from a string:
bases = 'ACTTGCTTGAC'
- Which standard library module could help you?
- Which function would you select from that module? Are there alternatives?
- Try to write a program that uses the function.
In [78]:
from random import randrange
bases = 'ACTTGCTTGAC'
print(bases[randrange(len(bases))])
A
Jigsaw Puzzle (Parson’s Problem)¶ {#Jigsaw-Puzzle-(Parson’s-Problem)}
Rearrange the following statements so that a random DNA base is printed and its index in the string. Not all statements may be needed. Feel free to use/add intermediate variables.
In [ ]:
bases="ACTTGCTTGAC"
import math
import random
___ = random.randrange(n_bases)
___ = len(bases)
print("random base ", bases[___], "base index", ___)
Importing with aliases¶ {#Importing-with-aliases}
- Fill in the blanks so that the program below prints 90.0.
- Rewrite the program so that it uses import without as.
- Which form do you find easier to read?
In [80]:
import math as m
angle = m.degrees(m.pi / 2)
print(angle)
90.0
Importing Specific Items¶ {#Importing-Specific-Items}
- Fill in the blanks so that the program below prints 90.0.
- Do you find this version easier to read than preceding ones?
- Why wouldn’t programmers always use this form of import?
In [ ]:
____ math import ____, ____
angle = degrees(pi / 2)
print(angle)
Reading Error Messages¶ {#Reading-Error-Messages}
- Read the code below and try to identify what the errors are without running it.
- Run the code, and read the error message. What type of error is it?
In [79]:
from math import log
log(0)
----------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-79-d72e1d780bab> in <module>()
1 from math import log
----> 2 log(0)
ValueError: math domain error
Keypoints¶ {#Keypoints}
- Most of the power of a programming language is in its libraries.
- A program must import a library module in order to use it.
- Use help to learn about the contents of a library module.
- Import specific items from a library to shorten programs.
- Create an alias for a library when importing it to shorten programs.
Data Frames¶ {#Data-Frames}
Reading tabular data into DataFrame¶ {#Reading-tabular-data-into-DataFrame}
(15 min)
Exercises (10 min)
Pandas¶ {#Pandas}
A widely know library for statistics, particularly on tabular data.
- Borrows many features from R´s dataframes.
Dataframe: A 2-dimensional table whose columns have names and potentially have different data types.
In [81]:
#Let's import pandas
import pandas as pd
data = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_oceania.csv')
print(data)
country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 \
0 Australia 10039.59564 10949.64959 12217.22686
1 New Zealand 10556.57566 12247.39532 13175.67800
gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 \
0 14526.12465 16788.62948 18334.19751 19477.00928
1 14463.91893 16046.03728 16233.71770 17632.41040
gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 \
0 21888.88903 23424.76683 26997.93657 30687.75473
1 19007.19129 18363.32494 21050.41377 23189.80135
gdpPercap_2007
0 34435.36744
1 25185.00911
File not found¶ {#File-not-found}
Our lessons store their data files in a data sub-directory, which is why the path to the file is data/gapminder_gdp_oceania.csv. If you forget to include data/, or if you include it but your copy of the file is somewhere else, you will get a runtime error that ends with a line like this:
ERROR
OSError: File b'gapminder_gdp_oceania.csv' does not exist
Use index_col to specify that a column’s values should be used as row headings.¶ {#Use-index_col-to-specify-that-a-column’s-values-should-be-used-as-row-headings.}
- Row headings are numbers (0 and 1 in this case).
- Really want to index by country.
- Pass the name of the column to read_csv as its index_col parameter to do this.
In [84]:
data = pd.read_csv('/home/mcubero/dataSanJose19/data/gapminder_gdp_oceania.csv', index_col='country')
data
Out[84]:
gdpPercap_1952
gdpPercap_1957
gdpPercap_1962
gdpPercap_1967
gdpPercap_1972
gdpPercap_1977
gdpPercap_1982
gdpPercap_1987
gdpPercap_1992
gdpPercap_1997
gdpPercap_2002
gdpPercap_2007
country
Australia
10039.59564
10949.64959
12217.22686
14526.12465
16788.62948
18334.19751
19477.00928
21888.88903
23424.76683
26997.93657
30687.75473
34435.36744
New Zealand
10556.57566
12247.39532
13175.67800
14463.91893
16046.03728
16233.71770
17632.41040
19007.19129
18363.32494
21050.41377
23189.80135
25185.00911
Use DataFrame.info to explore a little the dataframe
In [85]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Australia to New Zealand
Data columns (total 12 columns):
gdpPercap_1952 2 non-null float64
gdpPercap_1957 2 non-null float64
gdpPercap_1962 2 non-null float64
gdpPercap_1967 2 non-null float64
gdpPercap_1972 2 non-null float64
gdpPercap_1977 2 non-null float64
gdpPercap_1982 2 non-null float64
gdpPercap_1987 2 non-null float64
gdpPercap_1992 2 non-null float64
gdpPercap_1997 2 non-null float64
gdpPercap_2002 2 non-null float64
gdpPercap_2007 2 non-null float64
dtypes: float64(12)
memory usage: 208.0+ bytes
What we know?
- data is a DataFrame
- It has two rows named 'Australia' and 'New Zealand'
- There are Twelve columns, each of which has two actual 64-bit floating point values.
- No null values, aka no missing observations.
- Uses 208.0 bytes of memory
The DataFrame.columns variable stores information about the dataframe's columns¶ {#The-DataFrame.columns-variable-stores-information-about-the-dataframe's-columns}
- Note that this is data, not a method.
- Like math.pi.
- So do not use () to try to call it.
- Called a member variable, or just member.
data is the dataframe, not a method, don't use () to try to call it.
In [86]:
print(data.columns)
Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
dtype='object')
Use DataFrame.T to transpose a dataframe
- Sometimes want to treat columns as rows and vice versa.
- Transpose (written .T) doesn’t copy the data, just changes the program’s view of it.
- Like columns, it is a member variable.
In [87]:
print(data.T)
data.T
country Australia New Zealand
gdpPercap_1952 10039.59564 10556.57566
gdpPercap_1957 10949.64959 12247.39532
gdpPercap_1962 12217.22686 13175.67800
gdpPercap_1967 14526.12465 14463.91893
gdpPercap_1972 16788.62948 16046.03728
gdpPercap_1977 18334.19751 16233.71770
gdpPercap_1982 19477.00928 17632.41040
gdpPercap_1987 21888.88903 19007.19129
gdpPercap_1992 23424.76683 18363.32494
gdpPercap_1997 26997.93657 21050.41377
gdpPercap_2002 30687.75473 23189.80135
gdpPercap_2007 34435.36744 25185.00911
Out[87]:
country
Australia
New Zealand
gdpPercap_1952
10039.59564
10556.57566
gdpPercap_1957
10949.64959
12247.39532
gdpPercap_1962
12217.22686
13175.67800
gdpPercap_1967
14526.12465
14463.91893
gdpPercap_1972
16788.62948
16046.03728
gdpPercap_1977
18334.19751
16233.71770
gdpPercap_1982
19477.00928
17632.41040
gdpPercap_1987
21888.88903
19007.19129
gdpPercap_1992
23424.76683
18363.32494
gdpPercap_1997
26997.93657
21050.41377
gdpPercap_2002
30687.75473
23189.80135
gdpPercap_2007
34435.36744
25185.00911
Use DataFrame.describe to get summary statistics about data¶ {#Use-DataFrame.describe-to-get-summary-statistics-about-data}
DataFrame.describe() gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'
In [89]:
data.describe()
Out[89]:
gdpPercap_1952
gdpPercap_1957
gdpPercap_1962
gdpPercap_1967
gdpPercap_1972
gdpPercap_1977
gdpPercap_1982
gdpPercap_1987
gdpPercap_1992
gdpPercap_1997
gdpPercap_2002
gdpPercap_2007
count
2.000000
2.000000
2.000000
2.000000
2.00000
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000
mean
10298.085650
11598.522455
12696.452430
14495.021790
16417.33338
17283.957605
18554.709840
20448.040160
20894.045885
24024.175170
26938.778040
29810.188275
std
365.560078
917.644806
677.727301
43.986086
525.09198
1485.263517
1304.328377
2037.668013
3578.979883
4205.533703
5301.853680
6540.991104
min
10039.595640
10949.649590
12217.226860
14463.918930
16046.03728
16233.717700
17632.410400
19007.191290
18363.324940
21050.413770
23189.801350
25185.009110
25%
10168.840645
11274.086022
12456.839645
14479.470360
16231.68533
16758.837652
18093.560120
19727.615725
19628.685413
22537.294470
25064.289695
27497.598692
50%
10298.085650
11598.522455
12696.452430
14495.021790
16417.33338
17283.957605
18554.709840
20448.040160
20894.045885
24024.175170
26938.778040
29810.188275
75%
10427.330655
11922.958888
12936.065215
14510.573220
16602.98143
17809.077557
19015.859560
21168.464595
22159.406358
25511.055870
28813.266385
32122.777857
max
10556.575660
12247.395320
13175.678000
14526.124650
16788.62948
18334.197510
19477.009280
21888.889030
23424.766830
26997.936570
30687.754730
34435.367440
Not particularly useful with just two records, but very helpful when there are thousands¶ {#Not-particularly-useful-with-just-two-records,-but-very-helpful-when-there-are-thousands}
Exercises¶ {#Exercises}
Read the data in gapminder_gdp_americas.csv
(should be in the same
directory as gapminder_gdp_oceania.csv
)
Tip¶ {#Tip}
check the parameters to define the index.
Inspect the data.
Use the function help(americas.head) and help(americas.tail) the answer:
- What method callwill display the first three rows of this data?
- What method call will display the last three columns of this data?
Reading files in other directories¶ {#Reading-files-in-other-directories}
The data for your current project is stored in a file called microbes.csv, which is located in a folder called field_data. You are doing analysis in a notebook called analysis.ipynb in a sibling folder called thesis:
your_home_directory
+-- field_data/
| +-- microbes.csv
+-- thesis/
+-- analysis.ipynb
What value(s) should you pass to read_csv to read microbes.csv in analysis.ipynb?
Writing Data¶ {#Writing-Data}
As well as the read_csv function for reading data from a file, Pandas provides a to_csv function to write dataframes to files. Applying what you’ve learned about reading from files, write one of your dataframes to a file called processed.csv. You can use help to get information on how to use to_csv.
In [96]:
data2 = data.copy()
#pd.to_csv
#data2.to_csv?
data2.to_csv('/home/mcubero/dataSanJose19/data/processed.csv')
Key Points¶ {#Key-Points}
-
Use the Pandas library to get basic statistics out of tabular data.
-
Use index_col to specify that a column’s values should be used as row headings.
-
Use DataFrame.info to find out more about a dataframe.
-
The DataFrame.columns variable stores information about the dataframe’s columns.
-
Use DataFrame.T to transpose a dataframe.
-
Use DataFrame.describe to get summary statistics about data.