In [ ]:

#General practice
#Wrap-up
#Feedback

Practice¶ {#Practice}

With the data in the file "Salaries.csv" in the folder data do the following:

Load the data
Check the structure of the file
Check the type of variables in the file
- Remember the method .info()
Select the numeric variables in a separate dataframe *Remember using columns
Check if there are some missing values
IF you have missing values, correct them
Make a quick plot for one of the variables, be creative!
Rescale the data using the method preprocessing.MinMaxScaler()

In [7]:

# Load the data
import pandas as pd
data = pd.read_csv('/home/mcubero/dataSanJose19/data/Salaries.csv')
#Check the structure of the file
data.head()

Out[7]:

Unnamed: 0

rank

discipline

yrs.since.phd

yrs.service

sex

salary

0

1

Prof

B

19

18

Male

139750

1

2

Prof

B

20

16

Male

173200

2

3

AsstProf

B

4

3

Male

79750

3

4

Prof

B

45

39

Male

115000

4

5

Prof

B

40

41

Male

141500

In [6]:

# Check the type of variables in the file
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 7 columns):
Unnamed: 0       397 non-null int64
rank             397 non-null object
discipline       397 non-null object
yrs.since.phd    397 non-null int64
yrs.service      397 non-null int64
sex              397 non-null object
salary           397 non-null int64
dtypes: int64(4), object(3)
memory usage: 21.8+ KB

In [11]:

#Select the numeric variables in a separate dataframe *Remember using columns
num = data.iloc[:,[3,4,6]]
num.head()

Out[11]:

yrs.since.phd

yrs.service

salary

0

19

18

139750

1

20

16

173200

2

4

3

79750

3

45

39

115000

4

40

41

141500

In [13]:

# Check if there are some missing values
num.isna().sum()

Out[13]:

yrs.since.phd    0
yrs.service      0
salary           0
dtype: int64

In [14]:

#Rescale the data using the method preprocessing.MinMaxScaler()
from sklearn import preprocessing
#Save columns names
names = num.columns
#Create scaler 
scaler = preprocessing.MinMaxScaler() #StandardScaler() #MaxAbsScaler

#Transform your data frame (numeric variables )
data1 = num
data1 = scaler.fit_transform(data1) 
data1 = pd.DataFrame(data1, columns=names) 
print(data1.head())
print(num.head())

   yrs.since.phd  yrs.service    salary
0       0.327273     0.300000  0.471668
1       0.345455     0.266667  0.664192
2       0.054545     0.050000  0.126335
3       0.800000     0.650000  0.329218
4       0.709091     0.683333  0.481740
   yrs.since.phd  yrs.service  salary
0             19           18  139750
1             20           16  173200
2              4            3   79750
3             45           39  115000
4             40           41  141500

Wrap-up¶ {#Wrap-up}

20 min

Python supports a large and diverse community across academia and industry.¶ {#Python-supports-a-large-and-diverse-community-across-academia-and-industry.}

NumPy

The Python 3 documentation covers the core language and the standard library.
PyCon is the largest annual conference for the Python community.
SciPy is a rich collection of scientific utilities. It is also the name of a series of annual conferences.
Jupyter is the home of Project Jupyter.
Pandas is the home of the Pandas data library.
Stack Overflow’s general Python section can be helpful, as well as the sections on NumPy, SciPy, and Pandas.

KEY POINTS¶ {#KEY-POINTS}

Python supports a large and diverse community across academia and industry.

Practice¶ {#Practice}

Wrap-up¶ {#Wrap-up}

Python supports a large and diverse community across academia and industry.¶ {#Python-supports-a-large-and-diverse-community-across-academia-and-industry.}

KEY POINTS¶ {#KEY-POINTS}

Feedback¶ {#Feedback}

THANK YOU {#THANK-YOU!}