Machine Learning - Week 3
What is Deep Learning?
When artificial intelligence was first made, with its basic linear regressions, scientists thought that some problems couldn't be solved merely using simple linear equations. Then, logistic regression was developed, using sigmoid functions scientists seemed to be happy with what they've achieved. However, years after, scientists developed a whole new form of machine learning: machine learning. It is the repetition of linear regression, except there are logistic regressions layered between linear regressions. It is called deep learning because it has deep layers of linear and logistic regressions. It is one of the most concentrated fields in artificial intelligence and is considered one of the best methods in machine learning.
Other terms for Deep Learning
Deep Neural Networks
Multilayer Perceptron (MLP)
The XOR problem
The very problem that created deep learning
The basic machine learning is basically made up of two logic gates: OR and AND gates. Such problems were proved to be solved with the equation: y=w_0+w_1x_1+w_2x_2 called the perceptron.
This was such an evolutionary development that in 1958, the New York Times announced that people will be able to create a thinking machine using perceptron.
The problem was the XOR problem. They couldn't display XOR functions in linear regression.
This is where the MLP came from, as the scientists multilayered the perceptrons in order to maximize the productiveness.
Backpropagation
When the book perceptions by Marvin Minsky claimed that MLP was unfeasible, the AI scientists backed up a bit and were hesitant to research further. This ends in 1974 by Paul Werbos.
His idea was that one could use w(weight) and b(bias) to solve such problems to get a certain output, and if the output was different from the output of the MLP, one has to adjust the w value and the b value. The best way was to find out the errors and reduce them using certain functions.
AI scientists were rather apathetic of this algorithm until 1986 when Dr. Hinton independently announce the same method.
So then the XOR problem became feasible to solve and that was the foundation of the backpropagation algorithm.
How to construct Deep Neural Networks
Deep neural networks are composed of three layers: Input, output, and hidden.
The number of perceptions in the hidden layer should be increasing to a certain point and then decreasing until the output layer. It is also important to place activation functions in the right place, usually, they are placed right after each hidden layer.
Width and Depth of the Network
The depth is referring to how many hidden layers a network has. To increase the depth, you simply need to add the hidden layers. To increase the width, you need to add the perceptions to the hidden layers.
Batch size, Epochs
Batch and Iterations
Let's say we have 10,000,000 datasets. To have them all placed in the memory and to have the machine learn them all would be impossible and you would probably have to buy a specific memory holder for it, which would be expensive.
So we divide them in machine learning to alleviate the pressure onto the memory holders and the machine. The measure or the unit you divide these datasets into are called batches, so for example if we are dividing the 10,000,000 datasets by 10,000 datasets that a batch in this case is 10,000 datasets. After we divide the datasets by batches, in this case, it would be 1,000. We repeat the learning process 1,000 times, which are called iterations.
Epochs
Usually, in machine learning, we repeat the learning process with the same dataset over and over, like when we practice for our test. Repeating the "whole" learning process is one epoch. Epoch is irrelevant to batches, so for the example above, repeating the iteration 1,000 times would be one epoch.
Activation Functions
Imagine a network as a set of neurons. Each neuron (perceptron) is connected to one another. For a neuron to send a message to another neuron it has to have a certain amount of electrical energy. The specific line that holds such a limit is called the Threshold. That's the exact notion of activation functions. For perceptron to proceed and connect to another perceptron, it has to meet the requirements of the activation functions. Activation functions have to be non-linear, like the sigmoid function we used last week. The sigmoid function prints near 0 when x is smaller than -6, deactivating the function. Prints near 1 when x is greater than 1, activating the function. There are a lot of functions utilized in activation functions, listed below.
Usually, people use the ReLU function. Because its learning rate is bigger, the cost of calculating is smaller, and the graph is simple. In machine learning, people change the activation to maximize the productivity of a module: this process is called module tuning.
Overfitting, Underfitting
There is a circumstance when the training loss is decreasing but the validation loss is increasing.
This is called overfitting. It is when the hardness of a problem is lower than the problem's complexity. Reversed, when the complexity of a problem is too low compared to the hardness of a problem, it is called underfitting. So we have to seek a problem with an affordable complexity and hardness. Way to adjust these problems are data augmentation, and dropouts.
Data augmentation
To solve overfitting, data augmentation is the best way of doing it. It's simple: adding different data into the dataset. It creates more variables and decreases the risk of overfitting. However, data isn't all over the place. It is very hard to find the specific data you want. So data augmentation adjusts the current data to create a new one, just like the image below.
Dropout
Dropout is simply disconnecting the connections between the perceptrons. It reduces the complexity of a problem, thus decreasing the risk of overfitting.
Ensemble
The ensemble is one of the easiest methods to try if you have enough computing power. It is processing multiple deep learning methods and voting which determine the result. There is a way of implementing the majority vote, or you can add a final layer to have it decide the final result.
Learning rate decay (Learning rate schedules)
You use learning rate decay when you want to reach the local minimum (not the global minimum) It's decreasing the learning rate to find out where the local minimum is.
Using Keras, you use either the tf.keras.callbacks.LearningRateScheduler() or the tf.keras.callbacks.ReduceLROnPlateau
Solving XOR problems with Google collab
XOR dataset setup
x_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float32)
y_data = np.array([[0], [1], [1], [0]], dtype=np.float32)
XOR logistic regression (to check it's not working)
model = Sequential([
Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer=SGD(lr=0.1))
model.fit(x_data, y_data, epochs=1000, verbose=0)
y_pred = model.predict(x_data)
print(y_pred)
XOR deep learning
model = Sequential([
Dense(8, activation='relu'), # hidden layer
Dense(1, activation='sigmoid'),
])
model.compile(loss='binary_crossentropy', optimizer=SGD(lr=0.1))
model.fit(x_data, y_data, epochs=1000, verbose=0)
y_pred = model.predict(x_data)
print(y_pred) # is closer to [0], [1], [1], [0]
Keras Functional API
import numpy as np
from tensorflow.keras.models import Sequential, Model # instead of sequential we use model
from tensorflow.keras.layers import Dense, Input # Bring input layer in addition to dense
from tensorflow.keras.optimizers import Adam, SGD
input = Input(shape=(2,)) # define input
hidden = Dense(8, activation='relu')(input)
output = Dense(1, activation='sigmoid')(hidden)
model = Model(inputs=input, outputs=output)
model.compile(loss='binary_crossentropy', optimizer=SGD(lr=0.1))
model.summary()
model.fit(x_data, y_data, epochs=1000, verbose=0)
y_pred = model.predict(x_data)
print(y_pred)
Deep Learning Training
We're using Kaggle for datasets, so we need to login.
import os
os.environ['KAGGLE_USERNAME'] = 'ericshindev' # username
os.environ['KAGGLE_KEY'] = '7e8c38399867d7c318135c93b4dbb1e9' # key
This time, the data is the image: we're letting the computer learn what sign language is which.
!kaggle datasets download -d datamunge/sign-language-mnist
!unzip sign-language-mnist.zip
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam, SGD
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
Loading dataset
Defining train set
train_df = pd.read_csv('sign_mnist_train.csv')
train_df.head()
Defining test set
test_df = pd.read_csv('mnist_train.csv')
test_df.head()
The computer will make the image black and white, then figure out the general shape by finding which pixel is filled and which is not.
Distribution of label
plt.figure(figsize=(16, 10))
sns.countplot(train_df['label'])
plt.show()
Preprocessing - dividing input and output
train_df = train_df.astype(np.float32)
x_train = train_df.drop(columns=['label'], axis=1).values
y_train = train_df[['label']].values
test_df = test_df.astype(np.float32)
x_test = test_df.drop(columns=['label'], axis=1).values
y_test = test_df[['label']].values
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
Previewing Data
index = 1
plt.title(str(y_train[index]))
plt.imshow(x_train[index].reshape((28, 28)), cmap='gray')
plt.show()
One-hot Encoding
encoder = OneHotEncoder()
y_train = encoder.fit_transform(y_train).toarray()
y_test = encoder.fit_transform(y_test).toarray()
print(y_train.shape)
Normalization
The image data is from 0 to 255. (unsigned) So we will divide them by 255 so its between 0 and 1.
x_train = x_train / 255.
x_test = x_test / 255.
Network
input = Input(shape=(784,))
hidden = Dense(1024, activation='relu')(input)
hidden = Dense(512, activation='relu')(hidden)
hidden = Dense(256, activation='relu')(hidden)
output = Dense(24, activation='softmax')(hidden)
model = Model(inputs=input, outputs=output)
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['acc'])
model.summary()
Learning
history = model.fit(
x_train,
y_train,
validation_data=(x_test, y_test),
epochs=20
)
The graph of the result
Loss
plt.figure(figsize=(16, 10))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
Accuracy
plt.figure(figsize=(16, 10))
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
All in one
import os
os.environ['KAGGLE_USERNAME'] = 'username' # username
os.environ['KAGGLE_KEY'] = 'key' # key
!kaggle datasets download -d oddrationale/mnist-in-csv
!unzip mnist-in-csv.zip
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam, SGD
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
train_df = pd.read_csv('mnist_train.csv')
test_df = pd.read_csv('mnist_test.csv')
plt.figure(figsize=(16, 10))
sns.countplot(train_df['label'])
plt.show()
train_df = train_df.astype(np.float32)
x_train = train_df.drop(columns=['label'], axis=1).values
y_train = train_df[['label']].values
test_df = test_df.astype(np.float32)
x_test = test_df.drop(columns=['label'], axis=1).values
y_test = test_df[['label']].values
index = 1
plt.title(str(y_train[index]))
plt.imshow(x_train[index].reshape((28, 28)), cmap='gray')
plt.show()
encoder = OneHotEncoder()
y_train = encoder.fit_transform(y_train).toarray()
y_test = encoder.fit_transform(y_test).toarray()
encoder = OneHotEncoder()
y_train = encoder.fit_transform(y_train).toarray()
y_test = encoder.fit_transform(y_test).toarray()
history = model.fit(
x_train,
y_train,
validation_data=(x_test, y_test),
epochs=20
)