MNIST手写数字识别-Python

上篇blog讲了神经网络中BP反向传播算法的推导，并且在Andrew Ng的课程中用Matlab实现了MNIST手写数字数据集的识别。这次决定用Python的sk-learn库来实现（调包）一次。

数据获取以及处理

Google一下，就能找到MNIST的网站，下载四个数据集。分别是:

train-images-idx3-ubyte.gz: training set images (9912422 bytes)
train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)

解压之后发现Window自作主张的把第二个-变成了.，文件后缀也变成了idx1-ubyte。不过这不是什么大问题，问题是，我们怎么把这个格式的文件变成我们想要的一组特征向量以及labels。

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):

offset type value description][description]

0000 32 bit integer 0x00000801(2049) magic number (MSB first)

0004 32 bit integer 60000 number of items

0008 unsigned byte ?? label

0009 unsigned byte ?? label

……..

xxxx unsigned byte ?? label

The labels values are 0 to 9.

TRAINING SET IMAGE FILE (train-images-idx3-ubyte):

offset type value description

0000 32 bit integer 0x00000803(2051) magic number

0004 32 bit integer 60000 number of images

0008 32 bit integer 28 number of rows

0012 32 bit integer 28 number of columns

0016 unsigned byte ?? pixel

0017 unsigned byte ?? pixel

……..

xxxx unsigned byte ?? pixel

Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).

网站也很贴心的给出了文件格式的描述，train_set和label都是有一个文件头的。然而还是拿这个ubyte文件没有办法（PS:我Python真的菜)，最后在网上找到了利用struct模块来处理二进制的方法:

def decode_idx3_ubyte(idx3_ubyte_file):
    """
    解析idx3文件的通用函数
    :param idx3_ubyte_file: idx3文件路径
    :return: 数据集
    """
    # 读取二进制数据
    bin_data = open(idx3_ubyte_file, 'rb').read()

    # 解析文件头信息，依次为魔数、图片数量、每张图片高、每张图片宽
    offset = 0
    fmt_header = '>iiii'
    magic_number, num_images, num_rows, num_cols = struct.unpack_from(fmt_header, bin_data, offset)
    print('魔数:%d, 图片数量: %d张, 图片大小: %d*%d' % (magic_number, num_images, num_rows, num_cols))

    # 解析数据集
    image_size = num_rows * num_cols
    offset += struct.calcsize(fmt_header)
    fmt_image = '>' + str(image_size) + 'B'
    images = np.empty((num_images, num_rows, num_cols))
    for i in range(num_images):
        if (i + 1) % 10000 == 0:
            print('已解析 %d' % (i + 1) + '张')
        images[i] = np.array(struct.unpack_from(fmt_image, bin_data, offset)).reshape((num_rows, num_cols))
        offset += struct.calcsize(fmt_image)
    return images

def decode_idx1_ubyte(idx1_ubyte_file):
    """
    解析idx1文件的通用函数
    :param idx1_ubyte_file: idx1文件路径
    :return: 数据集
    """
    # 读取二进制数据
    bin_data = open(idx1_ubyte_file, 'rb').read()

    # 解析文件头信息，依次为魔数和标签数
    offset = 0
    fmt_header = '>ii'
    magic_number, num_images = struct.unpack_from(fmt_header, bin_data, offset)
    print('魔数:%d, 图片数量: %d张' % (magic_number, num_images))

    # 解析数据集
    offset += struct.calcsize(fmt_header)
    fmt_image = '>B'
    labels = np.zeros((num_images, 10), dtype='int8')
    for i in range(num_images):
        if (i + 1) % 10000 == 0:
            print('已解析 %d' % (i + 1) + '张')
        digit = (int)(struct.unpack_from(fmt_image, bin_data, offset)[0])
        labels[i][digit] = 1
        offset += struct.calcsize(fmt_image)
    return labels

从二进制流中先读出文件头，确定图片数量，宽、高，然后通过struct.unpacked_from(fmt, binfile, offset)这个方法来读出每一个图片的相关信息，这里把整个数据集储存成了一个三维的ndarray。而label这边我对原有的代码进行了一些修改，把label的值（0~9），展开成了一个1x10的数组，方便后面使用神经网络来训练模型。

利用sk-learn库来训练神经网络

数字识别实际上就是个多分类问题，用神经网络可以很好的解决。在Andrew Ng的课里是手搓了一部分Matlab代码来实现数字识别的，今天用Python，做一次调包侠，就可以很简单（偷懒）的完成了。

def mnistUsingNN(train_dataSet, train_labels, test_dataSet, test_labels):
    # 初始化一个分类器，传入设定的参数
   clf = MLPClassifier(hidden_layer_sizes=(100, 50, 25),
                        activation='logistic', solver='adam',
                        learning_rate_init=0.001, max_iter=2000)
    print('开始训练模型')
    start = time.time()
    clf.fit(train_dataSet, train_labels)
    print('训练完毕, 时间:' + str(time.time() - start))

    res = clf.predict(test_dataSet)  # 对测试集进行预测
    error_num = 0  # 统计预测错误的数目
    num = len(test_dataSet)  # 测试集的数目
    for i in range(num):  # 遍历预测结果
        # 比较长度为10的数组，返回包含01的数组，0为不同，1为相同
        # 若预测结果与真实结果相同，则10个数字全为1，否则不全为1
        if np.sum(res[i] == test_labels[i]) < 10:
            error_num += 1
    print("Total num:", num, " Wrong num:",
          error_num, "  CorrectRate:", (1 - error_num / float(num)) * 100, '%')

这里我参数设了hidden layers为3，每层的个数分别为(100, 50, 25)，而默认值是一层，100个神经元。最大迭代次数设置成了2000，基本能够保证converge了，learning_rate设置成了0.0001。

参数的设置比如hidden layer的层数，一般来说是层数越多效果越好，一层的时候准确率在92%， 2层就到了93%，三层达到了95%，当然训练的时间也是在不断地增加，三层用mbp跑，花了213s，用神舟跑大概是280s；learning_rate对于最后的结果也是有比较大的影响，这里的learning_rate在0.001的时候准确率只有91%，调到0.0001之后达到了95%。

有一点需要注意，在传参数的时候，我们读进来的train_dataSet是三维的，需要reshape一下，变成60000x784，也就是每一张图片对应一个列向量。

1
2

train_dataSet = train_images.reshape(60000, 784)
test_dataSet = test_images.reshape(10000, 784)

总结

调包的过程确实是愉快而且轻松的，但仅仅是调包、调参，可能在这种比较简单的场景之下能够达到比较高的精确度，在复杂的情况下，精确度达不到要求的时候就需要对过程进行详细分析、找出问题所在（比如过拟合或者欠拟合）。这时候就要借助一些可视化的工具（比如learing-curve)来帮助分析。这些能力是更加重要，也是仅仅调包学不到的。

完整代码

# encoding: utf-8

import numpy as np
import struct
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
from sklearn import neighbors
import time

# 训练集文件
train_images_idx3_ubyte_file = './train-images-idx3-ubyte'
# 训练集标签文件
train_labels_idx1_ubyte_file = './train-labels-idx1-ubyte'

# 测试集文件
test_images_idx3_ubyte_file = './t10k-images-idx3-ubyte'
# 测试集标签文件
test_labels_idx1_ubyte_file = './t10k-labels-idx1-ubyte'

def decode_idx3_ubyte(idx3_ubyte_file):
    """
    解析idx3文件的通用函数
    :param idx3_ubyte_file: idx3文件路径
    :return: 数据集
    """
    # 读取二进制数据
    bin_data = open(idx3_ubyte_file, 'rb').read()

    # 解析文件头信息，依次为魔数、图片数量、每张图片高、每张图片宽
    offset = 0
    fmt_header = '>iiii'
    magic_number, num_images, num_rows, num_cols = struct.unpack_from(fmt_header, bin_data, offset)
    print('魔数:%d, 图片数量: %d张, 图片大小: %d*%d' % (magic_number, num_images, num_rows, num_cols))

    # 解析数据集
    image_size = num_rows * num_cols
    offset += struct.calcsize(fmt_header)
    fmt_image = '>' + str(image_size) + 'B'
    images = np.empty((num_images, num_rows, num_cols))
    for i in range(num_images):
        if (i + 1) % 10000 == 0:
            print('已解析 %d' % (i + 1) + '张')
        images[i] = np.array(struct.unpack_from(fmt_image, bin_data, offset)).reshape((num_rows, num_cols))
        offset += struct.calcsize(fmt_image)
    return images

def decode_idx1_ubyte(idx1_ubyte_file):
    """
    解析idx1文件的通用函数
    :param idx1_ubyte_file: idx1文件路径
    :return: 数据集
    """
    # 读取二进制数据
    bin_data = open(idx1_ubyte_file, 'rb').read()

    # 解析文件头信息，依次为魔数和标签数
    offset = 0
    fmt_header = '>ii'
    magic_number, num_images = struct.unpack_from(fmt_header, bin_data, offset)
    print('魔数:%d, 图片数量: %d张' % (magic_number, num_images))

    # 解析数据集
    offset += struct.calcsize(fmt_header)
    fmt_image = '>B'
    labels = np.zeros((num_images, 10), dtype='int8')
    for i in range(num_images):
        if (i + 1) % 10000 == 0:
            print('已解析 %d' % (i + 1) + '张')
        digit = (int)(struct.unpack_from(fmt_image, bin_data, offset)[0])
        labels[i][digit] = 1
        offset += struct.calcsize(fmt_image)
    return labels

def load_train_images(idx_ubyte_file=train_images_idx3_ubyte_file):
    
    return decode_idx3_ubyte(idx_ubyte_file)

def load_train_labels(idx_ubyte_file=train_labels_idx1_ubyte_file):

    return decode_idx1_ubyte(idx_ubyte_file)

def load_test_images(idx_ubyte_file=test_images_idx3_ubyte_file):
    """
    TEST SET IMAGE FILE (t10k-images-idx3-ubyte):
    [offset] [type]          [value]          [description]
    0000     32 bit integer  0x00000803(2051) magic number
    0004     32 bit integer  10000            number of images
    0008     32 bit integer  28               number of rows
    0012     32 bit integer  28               number of columns
    0016     unsigned byte   ??               pixel
    0017     unsigned byte   ??               pixel
    ........
    xxxx     unsigned byte   ??               pixel
    Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).

    :param idx_ubyte_file: idx文件路径
    :return: n*row*col维np.array对象，n为图片数量
    """
    return decode_idx3_ubyte(idx_ubyte_file)

def load_test_labels(idx_ubyte_file=test_labels_idx1_ubyte_file):
    """
    TEST SET LABEL FILE (t10k-labels-idx1-ubyte):
    [offset] [type]          [value]          [description]
    0000     32 bit integer  0x00000801(2049) magic number (MSB first)
    0004     32 bit integer  10000            number of items
    0008     unsigned byte   ??               label
    0009     unsigned byte   ??               label
    ........
    xxxx     unsigned byte   ??               label
    The labels values are 0 to 9.

    :param idx_ubyte_file: idx文件路径
    :return: n*1维np.array对象，n为图片数量
    """
    return decode_idx1_ubyte(idx_ubyte_file)

# 用KNN来实现
def mnistUsingKNN(train_dataSet, train_labels, test_dataSet, test_labels):

    knn = neighbors.KNeighborsClassifier(algorithm='kd_tree', n_neighbors=10)
    print('开始训练模型')
    start = time.time()
    knn.fit(train_dataSet, train_labels)
    print('训练完毕, 时间:' + str(time.time() - start))

    res = knn.predict(test_dataSet)  # 对测试集进行预测
    error_num = 0  # 统计预测错误的数目
    num = len(test_dataSet)  # 测试集的数目
    print(num)
    for i in range(num):  # 遍历预测结果
        # 比较长度为10的数组，返回包含01的数组，0为不同，1为相同
        # 若预测结果与真实结果相同，则10个数字全为1，否则不全为1
        if np.sum(res[i] == test_labels[i]) < 10:
            error_num += 1
    print("Total num:", num, " Wrong num:", \
          error_num, "  CorrectRate:", (1-error_num / float(num)) * 100, "%")

def mnistUsingNN(train_dataSet, train_labels, test_dataSet, test_labels):

    clf = MLPClassifier(hidden_layer_sizes=(100, 50, 25),
                        activation='logistic', solver='adam',
                        learning_rate_init=0.0001, max_iter=2000)
    print('开始训练模型')
    start = time.time()
    clf.fit(train_dataSet, train_labels)
    print('训练完毕, 时间:' + str(time.time() - start))

    res = clf.predict(test_dataSet)  # 对测试集进行预测

    error_num = 0  # 统计预测错误的数目
    num = len(test_dataSet)  # 测试集的数目

    for i in range(num):  # 遍历预测结果
        # 比较长度为10的数组，返回包含01的数组，0为不同，1为相同
        # 若预测结果与真实结果相同，则10个数字全为1，否则不全为1
        if np.sum(res[i] == test_labels[i]) < 10:
            error_num += 1
    print("Total num:", num, " Wrong num:",
          error_num, "  CorrectRate:", (1 - error_num / float(num)) * 100, '%')

if __name__ == '__main__':

    train_images = load_train_images()
    train_labels = load_train_labels()
    test_images = load_test_images()
    test_labels = load_test_labels()

    train_dataSet = train_images.reshape(60000, 784)
    test_dataSet = test_images.reshape(10000, 784)

    mnistUsingNN(train_dataSet, train_labels, test_dataSet, test_labels)
    # mnistUsingKNN(train_dataSet, train_labels, test_dataSet,test_labels)

Python

Share on

X Facebook LinkedIn Bluesky

Lei Li

数据获取以及处理

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):

TRAINING SET IMAGE FILE (train-images-idx3-ubyte):

利用sk-learn库来训练神经网络

总结

完整代码

Share on