# Intention

As I studied machine learning for four weeks, I thought that I've only dealt with such processed datasets that I don't have to do much work. In reality, there will be much more datasets that are unprocessed I have to work on. So I've decided to learn data analytics in order to be able to deal with raw data and process it to make it available for machine learning models.

# Basic python codes I'll be using this week

Data Structures (variables, lists, dictionaries, sets)

``````# Variables
x = 1
y = 2
z = x + y
print(z) # prints 3

# Lists
list1 = [1, 2, 3, 4, 5, 6, 7, 8]
list1.append(9)
print(list1[0]) # prints 1

# dictionaries
dict1 = {'one': 'single', 'two': 'double'}
dict1['three'] = 'triple' # append

# sets
set1 = set([1, 2, 3, 4, 5, 4, 3, 2, 1])
set1 # 1, 2, 3, 4, 5
set2 = set([4, 5, 6, 7, 8, 9])
set1 & set2 # ([4, 5])
set1 | set2 # ([1, 2, 3, 4, 5, 6, 7, 8, 9])
``````

Conditionals (if, elif, for)

``````# if statements
age = 20;
if age >= 20:
elif age < 10:
print('very young')
else:
print('teenager')

# for loops
fruits = ['apple', 'peach', 'banana']
for fruit in fruits:
print(fruit)
``````

Functions

``````def sum(a, b)
return a + b
sum(3, 4) # 7

def print_name(name)
print('hello ' + name)
print_name(eric) # prints hello eric
``````

# Pandas

Why Pandas?

Pandas is a data analysis library used in python. It is fast, concise, and accurate. You can use excel if you want, but it would take too long and if you make a mistake it isn't easy to fix it right away.

# Using Pandas

``````import pandas as pd
``````

``````chicken07 = pd.read_csv('./data/chicken_07.csv')

chicken07.tail(5) # prints the last five elements of chicken07

chicken07.describe()
``````

count 2.637900e+04 26379.000000

mean 2.019072e+07 12.346109

std 8.869258e+00 14.961707

min 2.019070e+07 5.000000

25% 2.019071e+07 5.000000

50% 2.019072e+07 5.000000

75% 2.019072e+07 14.000000

max 2.019073e+07 279.000000

Extracting specific data

``````chicken07['age'] # age is a column's name

set(chicken07['age']) # removes repetitions

set(chicken07['age']), len(set(chicken07['age'])) # prints the number of varieties
``````

Combining data

``````chicken08 = pd.read_csv('./data/chicken_08.csv')

chicken_data = pd.concat([chicken07, chicken08, chicken09])
``````

This data's index would look weird, as it wouldn't be around 60,000, which is the combination of all index numbers in each data. Just combining them doesn't automatically start the index number from the previous one. So it'll be like chicken07 going from 0 to 20,000 then chicken08's index starting from 0 again.

Fixing index

``````chicken_data = chicken_data.reset_index(drop = True)
``````

# Visualization, Matplotlib

Matplotlib is a library used in python to visualize data quickly and in a form we can customize.

Importing Matplotlib

``````import pandas as pd
import matplotlib.pyplot as plt
``````

We're using a part called pyplot from matplotlib -- using the shortcut plt.

# Drawing a graph using matplotlib

Grouping data

``````sum = chicken_data.groupby('what we'll group by')['what we'll collect'].sum()

plt.figure(figsize=(8,5))
plt.bar(sum.index, sum) # x axis, y axis
plt.title('title')
plt.show()
``````

Changing fonts and font families

``````plt.rcParams['font.size']
plt.rcParams['font.family']
``````

Chaining

``````index = chicken_data.groupby('what we'll group by')
call_data = index['what we'll collect']
sum = call_data.sum()
sorted_sum = sum.sort_values(ascending=True)
``````
``````plt.figure(figsize=(8,5))
plt.bar(sorted_sum.index, sorted_sum)
plt.title('title')
plt.show()
``````

Reindexing (Important!)

``````weeks = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']
sum = chicken_data.groupby('what we'll group by')['what we'll collect'].sum().reindex(weeks)
``````

``````plt.xlabel('xlabel')
``````plt.bar(chicken_data.index, chicken_data)