캐글 EDA - PUBG

업데이트: October 09, 2019

PUBG - Introduction

EDA 연습을 위해 kaggle competition 중 PUBG 승자예측 competition 을 따라해 보았습니다.

모든 코드는 다음 커널을 참고했습니다. PUGB - overall EDA & TOP 10% players

Content:

1-Database description
2-Exploratory Analysis
- a-Match types
- b-Kills and damage dealt
- c-Maximum distances
- d-Driving vs. Walking
- e-Weapons acquired
- f-Correlation map
3-Analysis of TOP10% of players

1-Database description [^](#1)

먼저, 기본 라이브러리들을 로드한다.

import numpy as np                    #linear algebra
import pandas as pd                   #dtabase manipulation
import matplotlib.pyplot as plt       #plotting libraries
import seaborn as sns                 #nice graphs and plots
import warnings                       #libraries to deal with warnings
warnings.filterwarnings("ignore")

train data를 가져온다.

train = pd.read_csv('./pubg-finish-placement-prediction/train_V2.csv')

데이터셋의 기본적인 정보를 살펴보자

train.head()

	Id	groupId	matchId	assists	damageDealt	killPlace	...	rideDistance	swimDistance	walkDistance	weaponsAcquired	winPoints	winPlacePerc
0	7f96b2f878858a	4d4b580de459be	a10357fd1a4a91	0	0.00	60	...	0.0000	0.00	244.80	1	1466	0.4444
1	eef90569b9d03c	684d5656442f9e	aeb375fc57110c	0	91.47	57	...	0.0045	11.04	1434.00	5	0	0.6400
2	1eaf90ac73de72	6a4a42c3245a74	110163d8bb94ae	1	68.00	47	...	0.0000	0.00	161.80	2	0	0.7755
3	4616d365dd2853	a930a9c79cd721	f1f1f4ef412d7e	0	32.90	75	...	0.0000	0.00	202.70	3	0	0.1667
4	315c96c26c9aac	de04010b3458dd	6dc8ff871e21e6	0	100.00	45	...	0.0000	0.00	49.75	2	0	0.1875

5 rows × 29 columns

train.shape

(4446966, 29)

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4446966 entries, 0 to 4446965
Data columns (total 29 columns):
Id                 object
groupId            object
matchId            object
assists            int64
boosts             int64
damageDealt        float64
DBNOs              int64
headshotKills      int64
heals              int64
killPlace          int64
killPoints         int64
kills              int64
killStreaks        int64
longestKill        float64
matchDuration      int64
matchType          object
maxPlace           int64
numGroups          int64
rankPoints         int64
revives            int64
rideDistance       float64
roadKills          int64
swimDistance       float64
teamKills          int64
vehicleDestroys    int64
walkDistance       float64
weaponsAcquired    int64
winPoints          int64
winPlacePerc       float64
dtypes: float64(6), int64(19), object(4)
memory usage: 983.9+ MB

29 개 컬럼
4 446 966 개 관측치

컬럼들에 대한 설명은 다음과 같다

groupId - Players team ID
matchId - Match ID
assists - Number of assisted kills. The killed is actually scored for the another teammate.
boosts - Number of boost items used by a player. These are for example: energy dring, painkillers, adrenaline syringe.
damageDealt - Damage dealt to the enemy
DBNOs - Down But No Out - when you lose all your HP but you’re not killed yet. All you can do is only to crawl.
headshotKills - Number of enemies killed with a headshot
heals - Number of healing items used by a player. These are for example: bandages, first-aid kits
killPlace - Ranking in a match based on kills.
killPoints - Ranking in a match based on kills points.
kills - Number of enemy players killed.
killStreaks - Max number of enemy players killed in a short amount of time.
longestKill - Longest distance between player and killed enemy.
matchDuration - Duration of a mach in seconds.
matchType - Type of match. There are three main modes: Solo, Duo or Squad. In this dataset however we have much more categories.
maxPlace - The worst place we in the match.
numGroups - Number of groups (teams) in the match.
revives - Number of times this player revived teammates.
rideDistance - Total distance traveled in vehicles measured in meters.
roadKills - Number of kills from a car, bike, boat, etc.
swimDistance - Total distance traveled by swimming (in meters).
teamKills - Number teammate kills (due to friendly fire).
vehicleDestroys - Number of vehicles destroyed.
walkDistance - Total distance traveled on foot measured (in meters).
weaponsAcquired - Number of weapons picked up.
winPoints - Ranking in a match based on won matches.

타깃 컬럼은 다음과 같다:

winPlacePerc - Normalised placement (rank). The 1st place is 1 and the last one is 0.

각 컬럼에 대해 기본적인 통계를 살펴보자. 파라미터를 시각화하고, 아웃라이어를 필터링하고, 범위/스케일에 대한 감을 얻을 수 있다.

train.describe()

	assists	boosts	damageDealt	DBNOs	headshotKills	heals	killPlace	killPoints	kills	killStreaks	...	revives	rideDistance	roadKills	swimDistance	teamKills	vehicleDestroys	walkDistance	weaponsAcquired	winPoints	winPlacePerc
count	4.446966e+06	4.446966e+06	4.446966e+06	4.446966e+06	4.446966e+06	4.446966e+06	4.446966e+06	4.446966e+06	4.446966e+06	4.446966e+06	...	4.446966e+06	4.446966e+06	4.446966e+06	4.446966e+06	4.446966e+06	4.446966e+06	4.446966e+06	4.446966e+06	4.446966e+06	4.446965e+06
mean	2.338149e-01	1.106908e+00	1.307171e+02	6.578755e-01	2.268196e-01	1.370147e+00	4.759935e+01	5.050060e+02	9.247833e-01	5.439551e-01	...	1.646590e-01	6.061157e+02	3.496091e-03	4.509322e+00	2.386841e-02	7.918208e-03	1.154218e+03	3.660488e+00	6.064601e+02	4.728216e-01
std	5.885731e-01	1.715794e+00	1.707806e+02	1.145743e+00	6.021553e-01	2.679982e+00	2.746294e+01	6.275049e+02	1.558445e+00	7.109721e-01	...	4.721671e-01	1.498344e+03	7.337297e-02	3.050220e+01	1.673935e-01	9.261157e-02	1.183497e+03	2.456544e+00	7.397004e+02	3.074050e-01
min	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	1.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	...	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
25%	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	2.400000e+01	0.000000e+00	0.000000e+00	0.000000e+00	...	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	1.551000e+02	2.000000e+00	0.000000e+00	2.000000e-01
50%	0.000000e+00	0.000000e+00	8.424000e+01	0.000000e+00	0.000000e+00	0.000000e+00	4.700000e+01	0.000000e+00	0.000000e+00	0.000000e+00	...	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	6.856000e+02	3.000000e+00	0.000000e+00	4.583000e-01
75%	0.000000e+00	2.000000e+00	1.860000e+02	1.000000e+00	0.000000e+00	2.000000e+00	7.100000e+01	1.172000e+03	1.000000e+00	1.000000e+00	...	0.000000e+00	1.909750e-01	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	1.976000e+03	5.000000e+00	1.495000e+03	7.407000e-01
max	2.200000e+01	3.300000e+01	6.616000e+03	5.300000e+01	6.400000e+01	8.000000e+01	1.010000e+02	2.170000e+03	7.200000e+01	2.000000e+01	...	3.900000e+01	4.071000e+04	1.800000e+01	3.823000e+03	1.200000e+01	5.000000e+00	2.578000e+04	2.360000e+02	2.013000e+03	1.000000e+00

8 rows × 25 columns

결측치가 있는지 확인해보자

train.isna().sum()

Id                 0
groupId            0
matchId            0
assists            0
boosts             0
damageDealt        0
DBNOs              0
headshotKills      0
heals              0
killPlace          0
killPoints         0
kills              0
killStreaks        0
longestKill        0
matchDuration      0
matchType          0
maxPlace           0
numGroups          0
rankPoints         0
revives            0
rideDistance       0
roadKills          0
swimDistance       0
teamKills          0
vehicleDestroys    0
walkDistance       0
weaponsAcquired    0
winPoints          0
winPlacePerc       1
dtype: int64

타깃값에 결측치가 하나 존재한다.

train[train.winPlacePerc.isna()]

	Id	groupId	matchId	assists	boosts	damageDealt	DBNOs	headshotKills	heals	killPlace	...	revives	rideDistance	roadKills	swimDistance	teamKills	vehicleDestroys	walkDistance	weaponsAcquired	winPoints	winPlacePerc
2744604	f70c74418bb064	12dfbede33f92b	224a123c53e008	0	0	0.0	0	0	0	1	...	0	0.0	0	0.0	0	0	0.0	0	0	NaN

1 rows × 29 columns

2-Exploratory Data Analysis [^](#2)

a) Match types [^](#3)

no_matches = train.loc[:,"matchId"].nunique()
print("{} 개의 경기가 dataset에 저장되어 있습니다.".format(no_matches))

47965 개의 경기가 dataset에 저장되어 있습니다.

m_types = train.loc[:,"matchType"].value_counts().to_frame().reset_index()
m_types.columns = ["Type","Count"]
m_types

	Type	Count
0	squad-fpp	1756186
1	duo-fpp	996691
2	squad	626526
3	solo-fpp	536762
4	duo	313591
5	solo	181943
6	normal-squad-fpp	17174
7	crashfpp	6287
8	normal-duo-fpp	5489
9	flaretpp	2505
10	normal-solo-fpp	1682
11	flarefpp	718
12	normal-squad	516
13	crashtpp	371
14	normal-solo	326
15	normal-duo	199

배그에는 크게 3개의 게임모드가 있습니다 : Solo, Duo, Squad.

또한 시점에 따라서 모드가 나누어집니다.

FPP - 1인칭 시점
TPP - 3인칭 시점
Normal - 게임 중에 시점 변경 가능

하지만, flare- 와 crash- 타입은 무엇을 의미하는지 모르겠네요. 역시 도메인 지식은 필수입니다.

plt.figure(figsize=(15,8))
ticks = m_types.Type.values
ax = sns.barplot(x="Type", y="Count", data=m_types)
ax.set_xticklabels(ticks, rotation=45, fontsize=14)
ax.set_title("Match types")
plt.show()

png

스쿼드와 듀오가 가장 인기있음을 보여줍니다. 이제 각 타입들을 세 개의 메인 카테고리로 aggregate 해보겠습니다.

m_types2 = train.loc[:,"matchType"].value_counts().to_frame()
aggregated_squads = m_types2.loc[["squad-fpp","squad","normal-squad-fpp","normal-squad"],"matchType"].sum()
aggregated_duos = m_types2.loc[["duo-fpp","duo","normal-duo-fpp","normal-duo"],"matchType"].sum()
aggregated_solo = m_types2.loc[["solo-fpp","solo","normal-solo-fpp","normal-solo"],"matchType"].sum()
aggregated_mt = pd.DataFrame([aggregated_squads,aggregated_duos,aggregated_solo], index=["squad","duo","solo"], columns =["count"])
aggregated_mt

	count
squad	2400402
duo	1315970
solo	720713

aggregated_mt.plot.pie(y='count', legend='True', autopct='%.1f');

png

54% 이상의 매치가 스쿼드 모드에서 플레이되었음을 보여줍니다.

b) Kills and damage dealt [^](#4)

train.plot(x="kills",y="damageDealt", kind="scatter", figsize = (15,10))
plt.show()

png

킬 수와 준 데미지에는 분명한 상관관계가 있습니다. 또한 몇몇 이상치들이 있습니다. 60킬 이상은 대다수 플레이어보다 한참 높은 수치입니다.

킬마스터들은 다음과 같습니다.

train[train['kills']>60]

	Id	groupId	matchId	assists	boosts	damageDealt	headshotKills	heals	killPlace	...	walkDistance	weaponsAcquired	winPlacePerc
334400	810f2379261545	7f3e493ee71534	f900de1ec39fa5	20	0	6616.0	13	5	1	...	1036.0	60	1.0
1248348	80ac0bbf58bfaf	1e54ab4540a337	08e4c9e6c033e2	5	0	6375.0	21	4	1	...	1740.0	23	1.0
3431247	06308c988bf0c2	4c4ee1e9eb8b5e	6680c7c3d17d48	7	4	5990.0	64	10	1	...	728.1	35	1.0

3 rows × 29 columns

헤드샷 통계를 살펴봅시다. 헤드샷이 없는 플레이어는 필터링되었습니다.

headshots = train[train['headshotKills']>0]
plt.figure(figsize=(15,5))
sns.countplot(headshots['headshotKills'].sort_values())
print("Maximum number of headshots that the player scored: " + str(train["headshotKills"].max()))

Maximum number of headshots that the player scored: 64

png

DBNO - Down But Not Out. 플레이어가 기록한 DBNO 값입니다.

plt.figure(figsize=(15,5))
sns.countplot(train[train['DBNOs']>0]['DBNOs'])
print("Mean number of DBNOs that the player scored: " + str(train["DBNOs"].mean()))

Mean number of DBNOs that the player scored: 0.6578755043326169

png

DBNO와 kill간 상관관계가 있을까요?

train.plot.scatter(x='DBNOs', y='kills', figsize=(15,10));

png

DBNO와 kill은 상관관계가 있습니다.

c) Maximum distances [^](#5)

범위는 합리적인 킬 거리로 필터링됩니다. 다음은 100m와 200m 조준의 예시입니다.

Imgur

dist = train[train['longestKill']<200]
plt.rcParams['axes.axisbelow'] = True
dist.hist('longestKill', bins=20, figsize = (15,10))
plt.show()

png

print("Average longest kill distance a player achieve is {:.1f}m, 95% of them not more than {:.1f}m and a maximum distance is {:.1f}m." .format(train['longestKill'].mean(),train['longestKill'].quantile(0.95),train['longestKill'].max()))

Average longest kill distance a player achieve is 23.0m, 95% of them not more than 126.1m and a maximum distance is 1094.0m.

1094m킬이 비현실적으로 보이지만, 8배율 스코프에 정적인 타깃, 좋은 포지션과 운이 따르면 가능합니다.

Imgur

d) Driving vs. Walking [^](#6)

걷지도 않거나 차를 몰지 않은 플레이어를 살펴본다

walk0 = train["walkDistance"] == 0
ride0 = train["rideDistance"] == 0
swim0 = train["swimDistance"] == 0
print("{} of players didn't walk at all, {} players didn't drive and {} didn't swim." .format(walk0.sum(),ride0.sum(),swim0.sum()))

99603 of players didn't walk at all, 3309429 players didn't drive and 4157694 didn't swim.

게임을 하기 위해서는 무조건 걸어야 하는데, 걷지 않은 플레이어들은 게임을 하지 않은 것일까?

walk0_rows = train[walk0]
print("Average place of non-walking players is {:.3f}, minimum is {} and the best is {}, 95% of players has a score below {}." 
      .format(walk0_rows["winPlacePerc"].mean(), walk0_rows["winPlacePerc"].min(), walk0_rows["winPlacePerc"].max(),walk0_rows["winPlacePerc"].quantile(0.95)))
walk0_rows.hist('winPlacePerc', bins=40, figsize = (15,7))

Average place of non-walking players is 0.044, minimum is 0.0 and the best is 1.0, 95% of players has a score below 0.25.

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DFACCE4550>]],
      dtype=object)

png

대부분의 걷지 않은 플레이어는 꼴등이다. 그러나 소수는 치킨까지 뜯었다. 이것은 개수작임이 분명하다. 의심되는 플레이어를 찾아보자.

train[(train['winPlacePerc']== 1) & (train['walkDistance'] == 0)].head()

	Id	groupId	matchId	boosts	damageDealt	killPlace	...	weaponsAcquired	winPlacePerc
3702	3fc123559fc935	5cef1df7ee3551	01aead02bb8901	0	0.0000	1	...	3	1.0
8790	106afdb574db25	4b0ae4659e9936	cf0cb51c829eb5	0	0.0000	2	...	1	1.0
9264	0351565a7058e9	3663a93a319725	3659fe3694262a	0	0.3218	1	...	9	1.0
18426	e6d6f94558dd2f	22818b9a9a6159	486200c5613f14	1	0.0000	2	...	6	1.0
19054	d0683f5d780f09	faebf5c484de4a	ec9a90395ed8c0	0	99.0000	1	...	9	1.0

5 rows × 29 columns

suspects = train.query('winPlacePerc ==1 & walkDistance ==0').head()
suspects.head()

	Id	groupId	matchId	boosts	damageDealt	killPlace	...	weaponsAcquired	winPlacePerc
3702	3fc123559fc935	5cef1df7ee3551	01aead02bb8901	0	0.0000	1	...	3	1.0
8790	106afdb574db25	4b0ae4659e9936	cf0cb51c829eb5	0	0.0000	2	...	1	1.0
9264	0351565a7058e9	3663a93a319725	3659fe3694262a	0	0.3218	1	...	9	1.0
18426	e6d6f94558dd2f	22818b9a9a6159	486200c5613f14	1	0.0000	2	...	6	1.0
19054	d0683f5d780f09	faebf5c484de4a	ec9a90395ed8c0	0	99.0000	1	...	9	1.0

5 rows × 29 columns

print("Maximum ride distance for suspected entries is {:.3f} meters, and swim distance is {:.1f} meters." .format(suspects["rideDistance"].max(), suspects["swimDistance"].max()))

Maximum ride distance for suspected entries is 0.000 meters, and swim distance is 0.0 meters.

흥미롭게도, 모든 이동거리가 0이다.

ride = train.query('rideDistance >0 & rideDistance <10000')
walk = train.query('walkDistance >0 & walkDistance <4000')
ride.hist('rideDistance', bins=40, figsize = (15,10))
walk.hist('walkDistance', bins=40, figsize = (15,10))
plt.show()

png

모든 이동거리를 합쳐 분포를 살펴보자.

travel_dist = train["walkDistance"] + train["rideDistance"] + train["swimDistance"]
travel_dist = travel_dist[travel_dist<5000]
travel_dist.hist(bins=40, figsize = (15,10))

<matplotlib.axes._subplots.AxesSubplot at 0x1e026a1ea58>

png

e) Weapons acquired [^](#7)

print("Average number of acquired weapons is {:.3f}, minimum is {} and the maximum {}, 99% of players acquired less than weapons {}." 
      .format(train["weaponsAcquired"].mean(), train["weaponsAcquired"].min(), train["weaponsAcquired"].max(), train["weaponsAcquired"].quantile(0.99)))
train.hist('weaponsAcquired', figsize = (20,10),range=(0, 10), align="left", rwidth=0.9)

Average number of acquired weapons is 3.660, minimum is 0 and the maximum 236, 99% of players acquired less than weapons 10.0.

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E0182E1630>]],
      dtype=object)

png

f) Correlation map [^](#8)

plt.figure(figsize=(20,15))
sns.heatmap(train.corr(), annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1e0244f56a0>

png

ax = sns.clustermap(train.corr(), annot=True, linewidths=.6, fmt= '.2f', figsize=(20, 15))
plt.show()

png

3-Analysis of TOP 10% of players [^](#9)

top10 = train[train["winPlacePerc"]>0.9]
print("TOP 10% overview\n")
print("Average number of kills: {:.1f}\nMinimum: {}\nThe best: {}\n95% of players within: {} kills." 
      .format(top10["kills"].mean(), top10["kills"].min(), top10["kills"].max(),top10["kills"].quantile(0.95)))

top10.plot(x="kills", y="damageDealt", kind="scatter", figsize = (15,10))

TOP 10% overview

Average number of kills: 2.6
Minimum: 0
The best: 72
95% of players within: 8.0 kills.

<matplotlib.axes._subplots.AxesSubplot at 0x1e037457278>

png

이동거리를 전체 플레이어와 비교하며 살펴보자.

fig, ax1 = plt.subplots(figsize = (15,10))
walk.hist('walkDistance', bins=40, figsize = (15,10), ax = ax1)
walk10 = top10[top10['walkDistance']<5000]
walk10.hist('walkDistance', bins=40, figsize = (15,10), ax = ax1)

print("Average walking distance: " + str(top10['walkDistance'].mean()))

Average walking distance: 2813.5134925205784

png

fig, ax1 = plt.subplots(figsize = (15,10))
ride.hist('rideDistance', bins=40, figsize = (15,10), ax = ax1)
ride10 = top10.query('rideDistance >0 & rideDistance <10000')
ride10.hist('rideDistance', bins=40, figsize = (15,10), ax = ax1)
print("Average riding distance: " + str(top10['rideDistance'].mean()))

Average riding distance: 1392.0857815081788

png

가장 멀리서 죽인 거리는 얼마일까?

print("On average the best 10% of players have the longest kill at {:.3f} meters, and the best score is {:.1f} meters." .format(top10["longestKill"].mean(), top10["longestKill"].max()))

On average the best 10% of players have the longest kill at 75.048 meters, and the best score is 1094.0 meters.

변수 간 상관관계를 살펴보자

ax = sns.clustermap(top10.corr(), annot=True, linewidths=.5, fmt= '.2f', figsize=(20, 15))
plt.show()

png

남궁찬

캐글 EDA - PUBG

PUBG - Introduction

Content:

1-Database description [^](#1)

2-Exploratory Data Analysis [^](#2)

a) Match types [^](#3)

b) Kills and damage dealt [^](#4)

c) Maximum distances [^](#5)

d) Driving vs. Walking [^](#6)

e) Weapons acquired [^](#7)

f) Correlation map [^](#8)

3-Analysis of TOP 10% of players [^](#9)

댓글남기기

참고

lakehouse 논문

Delta Lake 논문

Spilling

조인 알고리듬의 구현