Data Analytics - Week 3

·

3 min read

Review

four steps of data analysis

Call -> Examine -> Process -> Visualize

import pandas as pd
import matplotlib.pyplot as plt

enroll = pd.read_csv('./data/enrolleds_detail.csv')

enroll_detail = enroll.groupby('lecture_id')['user_id'].count()

plt.rcParams['font.family'] = 'Malgun Gothic'

enroll_detail.sort_values(ascending=False)

lectures = pd.read_csv('./data/lectures.csv')

lecture_count = pd.DataFrame(enroll_detail)

lecture_count = lecture_count.reset_index()

lecture_count = lecture_count.rename(columns={'user_id':'count'})

lectures = lectures.set_index('lecture_id')

full_lecture = lecture_count.join(lectures, on='lecture_id')

plt.figure(figsize=(22, 5))
plt.bar(full_lecture['title'], full_lecture['count'])
plt.xticks(rotation=90)
plt.show()

That is the shortcut to what I have learned in the past two weeks. It may not look like much, but I think being able to draw a graph using python is phenomenal.

WordCloud

image.png

Wordcloud is one of the most vivid visualizations of data using python. However, some words may appear very small and hard to perceive at times.

Downloading wordcloud

conda install -c conda-forge wordcloud

Importing all necessary modules

import numpy as np
from PIL import Image
from wordcloud import WordCloud
import matplotlib.pyplot as plt

Using functions to compile files

result = ""
for num in range(1, 15):
    index = '{:02}'.format(num) # 2 digit, replace with 0 if none
    filename = "Sequence_"+index+".txt"
    text = open('./data/'+filename, 'r', encoding='utf-8-sig')
    result += text.read().replace("\n", " ")

Regular expression to remove special characters

import re

pattern = '[^\w\s]' # It's just a promise to do this to remove special characters

text = re.sub(pattern=pattern, repl='', string=result)
# pattern is to remove special characters
# repl is to replace the removed characters
# string is the input

Font Families

I'm a Korean, so in order to use Korean in matplotlib I have to check which font families are available by using the code below:

import matplotlib.font_manager as fm
for f in fm.fontManager.ttflist:
    if 'Gothic' in f.name:
        print(f.fname)

This would print the paths to the fonts I want. Copy and paste one of them into font_path and run the code below:

font_path = 'C:\Windows\Fonts\malgunsl.ttf'

wc = WordCloud(font_path=font_path, background_color="white")
wc.generate(text)

plt.figure(figsize=(50,50))
plt.axis("off")
plt.imshow(wc)
plt.show()

However, this only creates a wordcloud in a shape of a square. in order to make it so it belongs in a shape you want, you need to:

# Generate a word cloud image
mask = np.array(Image.open('./data/sparta.png'))
wc = WordCloud(font_path=font_path, background_color="white", mask=mask)
wc.generate(text)

f = plt.figure(figsize=(50,50))
f.add_subplot(1,2, 1)
plt.imshow(mask, cmap=plt.cm.gray)
plt.title('Original Stencil', size=40)
plt.axis("off")
f.add_subplot(1,2, 2)
plt.imshow(wc, interpolation='bilinear')
plt.title('Sparta Cloud', size=40)
plt.axis("off")
plt.show()

Saving WordCloud to file

mask = np.array(Image.open('./data/sparta.png'))
wc = WordCloud(font_path=font_path, background_color="white", mask=mask)
wc.generate(text)

f = plt.figure(figsize=(50,50))
plt.imshow(wc, interpolation='bilinear')
plt.title('my word cloud', size = 40)
plt.axis("off")
plt.show()
f.savefig('./data/myWordCloud.png')

Properly formatting data

# formatting data
format = '%Y-%m-%dT%H:%M:%S.%f' # year, month, date, hour, minute, seconds
sparta_data['done_date_time']=pd.to_datetime(sparta_data['done_date'], format=format)

# Categorizing by day name
sparta_data['day']=sparta_data['done_date_time'].dt.day_name()

weekdata = sparta_data.groupby('day')['user_id'].count()

# Putting it in right order
weeks = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekdata = weekdata.agg(weeks)

# Extracting time from data
sparta_data['done_date_time_hour'] = sparta_data['done_date_time'].dt.hour

hour_data = sparta_data.groupby('done_date_time_hour')['user_id'].count()

hour_data = hour_data.sort_index()

# Use of np.arange

np.arange(0, 23) # makes a range between 0 and 23

# Creating pivot tables

pivot_table=pd.pivot_table(sparta_data, values='user_id', aggfunc='count', index=['day'], columns=['done_date_time_hour']).agg(weeks)

# Creating hitmap

plt.figure(figsize=(14, 5))
plt.pcolor(pivot_table)
plt.title('hitmap')
plt.xlabel('time')
plt.ylabel('day')
plt.xticks(np.arange(0.5, len(pivot_table.columns), 1), pivot_table.columns)
plt.yticks(np.arange(0.5, len(pivot_table.index), 1), pivot_table.index)
plt.colorbar()
plt.show()

image.png

Review of what I've learned this week

WordClouds

lyrics=open('./data/mysong.txt', 'r', encoding='utf-8-sig')

result = lyrics.read().replace('\n', ' ')

mask = np.array(Image.open('./data/cartoon-rock-musician.jpg'))
font_path='C:\Windows\Fonts\malgunsl.ttf'
wc=WordCloud(font_path=font_path, background_color="white", mask=mask)
wc.generate(result)

plt.figure(figsize=(50, 50))
plt.axis("off")
plt.imshow(wc)

Data Analyzation

monday = sparta_data[sparta_data['day']=='Monday']
tuesday = sparta_data[sparta_data['day']=='Tuesday']

monday = monday.groupby('done_date_time_hour').count()['user_id']
tuesday = tuesday.groupby('done_date_time_hour').count()['user_id']

plt.figure(figsize=(10, 5))
plt.plot(monday.index, monday)
plt.xticks(np.arange(0, 23))
plt.plot(tuesday.index, tuesday)
plt.xticks(np.arange(0, 23))
plt.show()

image.png

Thoughts

I really am being sluggish these days, especially with overloaded schedules. I haven't even written thoughts in my previous blogs for a while. I am learning, but due to sporadic check-ins, I'm forgetting some of the codes and losing the skills that I used to have. I think I should get back to my routine of being productive.