Data Analytics - Week 1
Intention
As I studied machine learning for four weeks, I thought that I've only dealt with such processed datasets that I don't have to do much work. In reality, there will be much more datasets that are unprocessed I have to work on. So I've decided to learn data analytics in order to be able to deal with raw data and process it to make it available for machine learning models.
Basic python codes I'll be using this week
Data Structures (variables, lists, dictionaries, sets)
# Variables
x = 1
y = 2
z = x + y
print(z) # prints 3
# Lists
list1 = [1, 2, 3, 4, 5, 6, 7, 8]
list1.append(9)
print(list1[0]) # prints 1
# dictionaries
dict1 = {'one': 'single', 'two': 'double'}
dict1['three'] = 'triple' # append
# sets
set1 = set([1, 2, 3, 4, 5, 4, 3, 2, 1])
set1 # 1, 2, 3, 4, 5
set2 = set([4, 5, 6, 7, 8, 9])
set1 & set2 # ([4, 5])
set1 | set2 # ([1, 2, 3, 4, 5, 6, 7, 8, 9])
Conditionals (if, elif, for)
# if statements
age = 20;
if age >= 20:
print('adult')
elif age < 10:
print('very young')
else:
print('teenager')
# for loops
fruits = ['apple', 'peach', 'banana']
for fruit in fruits:
print(fruit)
Functions
def sum(a, b)
return a + b
sum(3, 4) # 7
def print_name(name)
print('hello ' + name)
print_name(eric) # prints hello eric
Pandas
Why Pandas?
Pandas is a data analysis library used in python. It is fast, concise, and accurate. You can use excel if you want, but it would take too long and if you make a mistake it isn't easy to fix it right away.
Using Pandas
import pandas as pd
Loading data
chicken07 = pd.read_csv('./data/chicken_07.csv')
chicken07.tail(5) # prints the last five elements of chicken07
chicken07.describe()
count 2.637900e+04 26379.000000
mean 2.019072e+07 12.346109
std 8.869258e+00 14.961707
min 2.019070e+07 5.000000
25% 2.019071e+07 5.000000
50% 2.019072e+07 5.000000
75% 2.019072e+07 14.000000
max 2.019073e+07 279.000000
Extracting specific data
chicken07['age'] # age is a column's name
set(chicken07['age']) # removes repetitions
set(chicken07['age']), len(set(chicken07['age'])) # prints the number of varieties
Combining data
chicken08 = pd.read_csv('./data/chicken_08.csv')
chicken09 = pd.read_csv('./data/chicken_09.csv')
chicken_data = pd.concat([chicken07, chicken08, chicken09])
This data's index would look weird, as it wouldn't be around 60,000, which is the combination of all index numbers in each data. Just combining them doesn't automatically start the index number from the previous one. So it'll be like chicken07 going from 0 to 20,000 then chicken08's index starting from 0 again.
Fixing index
chicken_data = chicken_data.reset_index(drop = True)
Visualization, Matplotlib
Matplotlib is a library used in python to visualize data quickly and in a form we can customize.
Importing Matplotlib
import pandas as pd
import matplotlib.pyplot as plt
We're using a part called pyplot from matplotlib -- using the shortcut plt.
Drawing a graph using matplotlib
Grouping data
sum = chicken_data.groupby('what we'll group by')['what we'll collect'].sum()
plt.figure(figsize=(8,5))
plt.bar(sum.index, sum) # x axis, y axis
plt.title('title')
plt.show()
Changing fonts and font families
plt.rcParams['font.size']
plt.rcParams['font.family']
Chaining
index = chicken_data.groupby('what we'll group by')
call_data = index['what we'll collect']
sum = call_data.sum()
sorted_sum = sum.sort_values(ascending=True)
plt.figure(figsize=(8,5))
plt.bar(sorted_sum.index, sorted_sum)
plt.title('title')
plt.show()
Reindexing (Important!)
weeks = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']
sum = chicken_data.groupby('what we'll group by')['what we'll collect'].sum().reindex(weeks)
Additional codes
plt.xlabel('xlabel')
plt.xtick(rotation = 45)
Double bar graphs
plt.bar(chicken_data.index, chicken_data)
plt.bar(pizza_data.index, pizza_data)