Python:生成用于机器学习的测试数据集

作者：zhaozj

添加时间：2020-10-20 16:14:19

来源：

浏览：

打印本文小中大

每当我们想到机器学习时，想到的第一件事就是数据集。虽然您可以在Kaggle等网站上找到许多数据集，但有时提取自己的数据并生成自己的数据集很有用。生成自己的数据集可让您更好地控制数据，并训练您的机器学习模型。

在本文中，我们将使用Python中的Numpy库生成随机数据集。

需要的库：

-> numpy ： sudo pip安装numpy

->熊猫： sudo pip install pandas

-> Matplotlib： sudo pip安装matplotlib

正态分布：

在概率论中，正态分布或高斯分布是关于均值对称的非常常见的连续概率分布，这表明均值附近的数据比不均值的数据发生的频率更高。统计中使用的正态分布，通常用于表示实值随机变量。

正态分布是统计分析中最常见的分布类型。标准正态分布具有两个参数：平均值和标准偏差。平均值是分布的主要趋势。标准偏差是变异性的量度。它定义了正态分布的宽度。标准偏差确定值与平均值之间的下降幅度。它代表观测值与平均值之间的典型距离。它适合许多自然现象，例如身高，血压，测量误差和智商得分遵循正态分布。

正态分布图：

例：

filter_none

亮度_4

# importing libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

# initialize the parameters for the normal

# distribution, namely mean and std.

# deviation

# defining the mean

mu = 0.5

# defining the standard deviation

sigma = 0.1

# The random module uses the seed value as a base

# to generate a random number. If seed value is not

# present, it takes the system’s current time.

np.random.seed(0)

# define the x co-ordinates

X = np.random.normal(mu, sigma, (395, 1))

# define the y co-ordinates

Y = np.random.normal(mu * 2, sigma * 3, (395, 1))

# plot a graph

plt.scatter(X, Y, color = 'g')

plt.show()

输出：

让我们看一个更好的例子。

我们将生成一个包含4列的数据集。数据集中的每一列代表一个要素。数据集的第5列是输出标签。它在0-3之间变化。该数据集可用于训练分类器，例如逻辑回归分类器，神经网络分类器，支持向量机等。

filter_none

亮度_4

# importing libraries

import numpy as np

import pandas as pd

import math

import random

import matplotlib.pyplot as plt

# defining the columns using normal distribution

# column 1

point1 = abs(np.random.normal(1, 12, 100))

# column 2

point2 = abs(np.random.normal(2, 8, 100))

# column 3

point3 = abs(np.random.normal(3, 2, 100))

# column 4

point4 = abs(np.random.normal(10, 15, 100))

# x contains the features of our dataset

# the points are concatenated horizontally

# using numpy to form a feature vector.

x = np.c_[point1, point2, point3, point4]

# the output labels vary from 0-3

y = [int(np.random.randint(0, 4)) for i in range(100)]

# defining a pandas data frame to save

# the data for later use

data = pd.DataFrame()

# defining the columns of the dataset

data['col1'] = point1

data['col2'] = point2

data['col3'] = point3

data['col4'] = point4

# plotting the various features (x)

# against the labels (y).

plt.subplot(2, 2, 1)

plt.title('col1')

plt.scatter(y, point1, color ='r', label ='col1')

plt.subplot(2, 2, 2)

plt.title('Col2')

plt.scatter(y, point2, color = 'g', label ='col2')

plt.subplot(2, 2, 3)

plt.title('Col3')

plt.scatter(y, point3, color ='b', label ='col3')

plt.subplot(2, 2, 4)

plt.title('Col4')

plt.scatter(y, point4, color ='y', label ='col4')

# saving the graph

plt.savefig('data_visualization.jpg')

# displaying the graph

plt.show()

输出：

上一篇最难的五种编程语言

下一篇如何在Heroku上部署Django应用程序？

联系我们

/ CONTACT US

地址：四川省成都市航空路丰德国际广场

邮政编码：610000

电话：18215660330

传真：18215660330

手机：18215660330

邮箱：179001057@qq.com

投诉邮箱：179001057@qq.com