Intention

As I studied machine learning for four weeks, I thought that I've only dealt with such processed datasets that I don't have to do much work. In reality, there will be much more datasets that are unprocessed I have to work on. So I've decided to learn data analytics in order to be able to deal with raw data and process it to make it available for machine learning models.

Basic python codes I'll be using this week

Data Structures (variables, lists, dictionaries, sets)

# Variables
x = 1
y = 2
z = x + y
print(z) # prints 3

# Lists
list1 = [1, 2, 3, 4, 5, 6, 7, 8]
list1.append(9)
print(list1[0]) # prints 1

# dictionaries
dict1 = {'one': 'single', 'two': 'double'}
dict1['three'] = 'triple' # append

# sets
set1 = set([1, 2, 3, 4, 5, 4, 3, 2, 1])
set1 # 1, 2, 3, 4, 5
set2 = set([4, 5, 6, 7, 8, 9])
set1 & set2 # ([4, 5])
set1 | set2 # ([1, 2, 3, 4, 5, 6, 7, 8, 9])

Conditionals (if, elif, for)

# if statements
age = 20;
if age >= 20:
    print('adult')
elif age < 10:
    print('very young')
else:
    print('teenager')

# for loops
fruits = ['apple', 'peach', 'banana']
for fruit in fruits:
    print(fruit)

Functions

def sum(a, b)
    return a + b
sum(3, 4) # 7

def print_name(name)
    print('hello ' + name)
print_name(eric) # prints hello eric

Pandas

Why Pandas?

Pandas is a data analysis library used in python. It is fast, concise, and accurate. You can use excel if you want, but it would take too long and if you make a mistake it isn't easy to fix it right away.

Using Pandas

import pandas as pd

Loading data

chicken07 = pd.read_csv('./data/chicken_07.csv')

chicken07.tail(5) # prints the last five elements of chicken07

chicken07.describe()

count 2.637900e+04 26379.000000

mean 2.019072e+07 12.346109

std 8.869258e+00 14.961707

min 2.019070e+07 5.000000

25% 2.019071e+07 5.000000

50% 2.019072e+07 5.000000

75% 2.019072e+07 14.000000

max 2.019073e+07 279.000000

Extracting specific data

chicken07['age'] # age is a column's name

set(chicken07['age']) # removes repetitions

set(chicken07['age']), len(set(chicken07['age'])) # prints the number of varieties

Combining data

chicken08 = pd.read_csv('./data/chicken_08.csv')
chicken09 = pd.read_csv('./data/chicken_09.csv')

chicken_data = pd.concat([chicken07, chicken08, chicken09])

This data's index would look weird, as it wouldn't be around 60,000, which is the combination of all index numbers in each data. Just combining them doesn't automatically start the index number from the previous one. So it'll be like chicken07 going from 0 to 20,000 then chicken08's index starting from 0 again.

Fixing index

chicken_data = chicken_data.reset_index(drop = True)

Visualization, Matplotlib

Matplotlib is a library used in python to visualize data quickly and in a form we can customize.

Importing Matplotlib

import pandas as pd
import matplotlib.pyplot as plt

We're using a part called pyplot from matplotlib -- using the shortcut plt.

Drawing a graph using matplotlib

Grouping data

sum = chicken_data.groupby('what we'll group by')['what we'll collect'].sum()

plt.figure(figsize=(8,5))
plt.bar(sum.index, sum) # x axis, y axis
plt.title('title')
plt.show()

Changing fonts and font families

plt.rcParams['font.size']
plt.rcParams['font.family']

Chaining

index = chicken_data.groupby('what we'll group by')
call_data = index['what we'll collect']
sum = call_data.sum()
sorted_sum = sum.sort_values(ascending=True)

plt.figure(figsize=(8,5))
plt.bar(sorted_sum.index, sorted_sum)
plt.title('title')
plt.show()

Reindexing (Important!)

weeks = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']
sum = chicken_data.groupby('what we'll group by')['what we'll collect'].sum().reindex(weeks)

Additional codes

plt.xlabel('xlabel')
plt.xtick(rotation = 45)

Double bar graphs

plt.bar(chicken_data.index, chicken_data)
plt.bar(pizza_data.index, pizza_data)

Data Analytics - Week 1