[Pandas 기초] EDA, 데이터 탐색하기(요약정보, 기초통계, 시각화)

업데이트: August 02, 2019

1. 데이터 내용 미리보기 : `head()`, `tail()`

head()는 데이터의 앞단, tail()은 뒷단을 볼 수 있다.
괄호()안에 숫자를 입력해 해당 숫자만큼의 행을 볼 수 있고, 기본값은 6row까지다.

import pandas as pd

df = pd.read_csv("~/auto-mpg.csv",header=None)
df.head()

	0	1	2	3	4	5	6	7	8
0	18.0	8	307.0	130.0	3504.0	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	3693.0	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150.0	3436.0	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150.0	3433.0	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140.0	3449.0	10.5	70	1	ford torino

print(df.columns)

[Output]
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype='int64')

열이름을 지정해주자
물론 앞에서 배운 read_csv()옵션으로 names = []를 줘서 바꿔 줄수도 있다.

df.columns = ['mpg', 'cyclinders','displacement','horsepower','weight',
             'accerleration','model year','origin','name']
df.head()

	mpg	cyclinders	displacement	horsepower	weight	accerleration	model year	origin	name
0	18.0	8	307.0	130.0	3504.0	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	3693.0	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150.0	3436.0	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150.0	3433.0	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140.0	3449.0	10.5	70	1	ford torino

df.tail()

	mpg	cyclinders	displacement	horsepower	weight	accerleration	model year	origin	name
393	27.0	4	140.0	86.00	2790.0	15.6	82	1	ford mustang gl
394	44.0	4	97.0	52.00	2130.0	24.6	82	2	vw pickup
395	32.0	4	135.0	84.00	2295.0	11.6	82	1	dodge rampage
396	28.0	4	120.0	79.00	2625.0	18.6	82	1	ford ranger
397	31.0	4	119.0	82.00	2720.0	19.4	82	1	chevy s-10

2. 데이터 요약정보 확인하기

2-1. `shape`

데이터 프레임의 모든 기본 정보(타입, 행열크기, 타입, 메모리 등)를 보여준다.

df.info()

[Output]
 	<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg              398 non-null float64
cyclinders       398 non-null int64
displacement     398 non-null float64
horsepower       398 non-null object
weight           398 non-null float64
accerleration    398 non-null float64
model year       398 non-null int64
origin           398 non-null int64
name             398 non-null object
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB
None

2-2. `info()`

데이터의 행과 열을 보여준다.

df.shape

[Output]
(398, 9)

2-3. `dtypes`

데이터 변수(열)들의 타입을 보여준다.

df.dtypes

[Output]
mpg              float64
cyclinders         int64
displacement     float64
horsepower        object
weight           float64
accerleration    float64
model year         int64
origin             int64
name              object
dtype: object

시리즈(mpg 열)의 자료형만 확인할수도 있다.

df.mpg.dtypes

[Output]
float64

3. 기초 통계정보 확인하기

3-1. 기초통계량 : `describe()`

describe()는 데이터의 기초통계량을 제공한다.
총 데이터 수(count), 평균(mean), 표준편차(std), 분위수(25,50,75%), 최대최소(max,min)
여기서 50%가 중앙값(median)인건 당연하다.

df.describe()

[Output]
              mpg  cyclinders  displacement       weight  accerleration  \
count  398.000000  398.000000    398.000000   398.000000     398.000000   
mean    23.514573    5.454774    193.425879  2970.424623      15.568090   
std      7.815984    1.701004    104.269838   846.841774       2.757689   
min      9.000000    3.000000     68.000000  1613.000000       8.000000   
25%     17.500000    4.000000    104.250000  2223.750000      13.825000   
50%     23.000000    4.000000    148.500000  2803.500000      15.500000   
75%     29.000000    8.000000    262.000000  3608.000000      17.175000   
max     46.600000    8.000000    455.000000  5140.000000      24.800000   

       model year      origin  
count  398.000000  398.000000  
mean    76.010050    1.572864  
std      3.697627    0.802055  
min     70.000000    1.000000  
25%     73.000000    1.000000  
50%     76.000000    1.000000  
75%     79.000000    2.000000  
max     82.000000    3.000000  

추가로 입력인수로 include='all'을 넣어주면 문자열 데이터가 있는 열에 대한 추가정보를 제공한다.
정확히는 고유값 개수(unique), 최빈값(top), 빈도수(freq)이다.
숫자형 열에 대해서는 NaN인 것을 알 수 있다.

df.describe(include='all')

[Output]
               mpg  cyclinders  displacement horsepower       weight  \
count   398.000000  398.000000    398.000000        398   398.000000   
unique         NaN         NaN           NaN         94          NaN   
top            NaN         NaN           NaN      150.0          NaN   
freq           NaN         NaN           NaN         22          NaN   
mean     23.514573    5.454774    193.425879        NaN  2970.424623   
std       7.815984    1.701004    104.269838        NaN   846.841774   
min       9.000000    3.000000     68.000000        NaN  1613.000000   
25%      17.500000    4.000000    104.250000        NaN  2223.750000   
50%      23.000000    4.000000    148.500000        NaN  2803.500000   
75%      29.000000    8.000000    262.000000        NaN  3608.000000   
max      46.600000    8.000000    455.000000        NaN  5140.000000   

        accerleration  model year      origin        name  
count      398.000000  398.000000  398.000000         398  
unique            NaN         NaN         NaN         305  
top               NaN         NaN         NaN  ford pinto  
freq              NaN         NaN         NaN           6  
mean        15.568090   76.010050    1.572864         NaN  
std          2.757689    3.697627    0.802055         NaN  
min          8.000000   70.000000    1.000000         NaN  
25%         13.825000   73.000000    1.000000         NaN  
50%         15.500000   76.000000    1.000000         NaN  
75%         17.175000   79.000000    2.000000         NaN  
max         24.800000   82.000000    3.000000         NaN  

3-2. 데이터 개수 확인 : `count()`

count()는 데이터의 열마다 개수를 시리즈 형태로 반환한다.

df.count()

[Output]
mpg              398
cyclinders       398
displacement     398
horsepower       398
weight           398
accerleration    398
model year       398
origin           398
name             398
dtype: int64

print(type(df.count()))

[Output]
<class 'pandas.core.series.Series'>

3-3. 각 열의 고유값 개수 : `value_counts()`

이 함수는 시리즈 자료형에 적용하는 함수로, 특정 열의 고유값 개수를 다시 시리즈형태로 반환한다.

df['origin'].value_counts()

[Output]
1    249
3     79
2     70
Name: origin, dtype: int64

의미는 ‘1’이 249개, ‘3’이 79개, ‘70’이 70개가 있다는 소리다.

3-4 기초통계량 직접 계산하기 : `mean()`,`median()`,`max()`, `min()`, `std()`

아까 describe()함수로 한번에 보여줄 수도 있고, 특정 의도로 어떤 기초통계 값만 뽑아서 사용하고 싶을때가 있다.

이 연산메소드들은 데이터프레임, 시리즈(특정열)에 다 적용할 수 있다.
`

print(df.mean()) # 데이터프레임에 적용
print('\n')
print(df['mpg'].mean()) # 시리즈에 적용
print('\n')
print(df[['mpg','weight']].mean()) # 2개의 시리즈, 즉 데이터프레임에 적용

[Output]
mpg                23.514573
cyclinders          5.454774
displacement      193.425879
weight           2970.424623
accerleration      15.568090
model year         76.010050
origin              1.572864
dtype: float64

23.514572864321615

mpg         23.514573
weight    2970.424623
dtype: float64

3-5. 상관계수(문자열은 제외) : `corr()`

상관관계는 두 변수간의 상관성에 대한 정보로 -1~1사이의 값을 갖는다.
이는 선형성가정 등을 고려해야할 필요가 있지만, EDA과정에서 한번 볼 필요는 있다.
corr()함수는 문자열을 제외한 변수간 매트릭스를 생성하고 각 쌍들의 상관계수를 반환한다.

df.corr()

[Output]
                    mpg  cyclinders  displacement    weight  accerleration  \
mpg            1.000000   -0.775396     -0.804203 -0.831741       0.420289   
cyclinders    -0.775396    1.000000      0.950721  0.896017      -0.505419   
displacement  -0.804203    0.950721      1.000000  0.932824      -0.543684   
weight        -0.831741    0.896017      0.932824  1.000000      -0.417457   
accerleration  0.420289   -0.505419     -0.543684 -0.417457       1.000000   
model year     0.579267   -0.348746     -0.370164 -0.306564       0.288137   
origin         0.563450   -0.562543     -0.609409 -0.581024       0.205873   

               model year    origin  
mpg              0.579267  0.563450  
cyclinders      -0.348746 -0.562543  
displacement    -0.370164 -0.609409  
weight          -0.306564 -0.581024  
accerleration    0.288137  0.205873  
model year       1.000000  0.180662  
origin           0.180662  1.000000  

특정 두 변수만 지정해서 상관계수를 구할수도 있다.

print('\n')
print(df[['mpg','weight']].corr())

[Output]
             mpg    weight
mpg     1.000000 -0.831741
weight -0.831741  1.000000

4. 판다스 내장 그래프 도구 활용

판다스에 내장된 시각화 도구를 이용하여 데이터를 살펴보자.

4-1. 선 그래프 : `plot()`

df = pd.read_excel("D:/Python/Pandas_ML/source/part3/남북한발전전력량.xlsx")
df.head(6)

	전력량 (억㎾h)	발전 전력별	1990	1991	1992	1993	1994	1995	1996	1997	...	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016
0	남한	합계	1077	1186	1310	1444	1650	1847	2055	2244	...	4031	4224	4336	4747	4969	5096	5171	5220	5281	5404
1	NaN	수력	64	51	49	60	41	55	52	54	...	50	56	56	65	78	77	84	78	58	66
2	NaN	화력	484	573	696	803	1022	1122	1264	1420	...	2551	2658	2802	3196	3343	3430	3581	3427	3402	3523
3	NaN	원자력	529	563	565	581	587	670	739	771	...	1429	1510	1478	1486	1547	1503	1388	1564	1648	1620
4	NaN	신재생	-	-	-	-	-	-	-	-	...	-	-	-	-	-	86	118	151	173	195
5	북한	합계	277	263	247	221	231	230	213	193	...	236	255	235	237	211	215	221	216	190	239

6 rows × 29 columns

남북한발전전력량 데이터에서 남한과 북한의 합계 자료먄 인덱싱하고, 인덱스명을 부여해주었다.
그리고 컬렴명을 숫자형으로 map함수를 이용해 숫자형으로 변환했다.
map함수는 추후에 포스팅할 것이다.

df_ns = df.iloc[[0,5], 2:]
df_ns.index = ['South','North']
df_ns.columns = df_ns.columns.map(int)
print(df_ns)

[Output]
       1990  1991  1992  1993  1994  1995  1996  1997  1998  1999  ...   2007  \
South  1077  1186  1310  1444  1650  1847  2055  2244  2153  2393  ...   4031   
North   277   263   247   221   231   230   213   193   170   186  ...    236   

       2008  2009  2010  2011  2012  2013  2014  2015  2016  
South  4224  4336  4747  4969  5096  5171  5220  5281  5404  
North   255   235   237   211   215   221   216   190   239  

[2 rows x 27 columns]

간단히 plot()함수를 이용해 선그래프를 그려보았다.

df_ns.plot()

png

뭔가 그림이 이상하니, 전치(transpose)해서 다시 그려보자.

tdf_ns = df_ns.T
print(tdf_ns.head())

     South North
1077   277
1186   263
1310   247
1444   221
1650   231

tdf_ns.plot()

png

발전 전력량은 한국은 계속 상승하고 있지만, 북한은 거의 정체되어 있음을 알 수 있다.

4-2. 막대그래프 : `plot(kind='bar')`

tdf_ns.plot(kind='bar')

png

4-3. 히스토그램 : `plot(kind='hist')`

tdf_ns.plot(kind='hist')

png

히스토그램은 변수의 의미가 아닌, 단지 값들에 대한 빈도수를 의미하므로 이 데이터에는 어울리지 않는 듯하다.

4-4. 산점도 : `plot(kind='scatter')`

df = pd.read_csv("D:/Python/Pandas_ML/source/part3/auto-mpg.csv", header=None)
df.columns = ['mpg', 'cyclinders','displacement','horsepower','weight',
             'accerleration','model year','origin','name']
print(df.head())

[Output]
    mpg  cyclinders  displacement horsepower  weight  accerleration  \
18.0           8         307.0      130.0  3504.0           12.0   
15.0           8         350.0      165.0  3693.0           11.5   
18.0           8         318.0      150.0  3436.0           11.0   
16.0           8         304.0      150.0  3433.0           12.0   
17.0           8         302.0      140.0  3449.0           10.5   

   model year  origin                       name  
        70       1  chevrolet chevelle malibu  
        70       1          buick skylark 320  
        70       1         plymouth satellite  
        70       1              amc rebel sst  
        70       1                ford torino  

산점도는 x축과 y축의 변수를 지정하고 kind = 'scatter'옵션을 추가해 주면 된다.

df.plot(x='weight',y='mpg', kind = 'scatter')

png

4-5. 박스플롯(box-plot) : `plot(kind='box')`

df = pd.read_csv("D:/Python/Pandas_ML/source/part3/auto-mpg.csv", header=None)
df.columns = ['mpg', 'cyclinders','displacement','horsepower','weight',
             'accerleration','model year','origin','name']
print(df.head())

[Output]
    mpg  cyclinders  displacement horsepower  weight  accerleration  \
18.0           8         307.0      130.0  3504.0           12.0   
15.0           8         350.0      165.0  3693.0           11.5   
18.0           8         318.0      150.0  3436.0           11.0   
16.0           8         304.0      150.0  3433.0           12.0   
17.0           8         302.0      140.0  3449.0           10.5   

   model year  origin                       name  
        70       1  chevrolet chevelle malibu  
        70       1          buick skylark 320  
        70       1         plymouth satellite  
        70       1              amc rebel sst  
        70       1                ford torino  

df[['mpg','cyclinders']].plot(kind='box')

png

박스플롯은 이상치 제외 최소값, 1분위수, 중앙값, 3분위수, 이상치 제외 최대값, 이상치 를 제공한다.
자세한 설명은 여기 블로그에 설명이 잘 되어있어 참고하면 좋을 것 같다.

Reference

도서 [파이썬 머신러닝 판다스 데이터 분석]을 공부하며 작성하였습니다.

Twitter Facebook LinkedIn

Yonggeun Shin

[Pandas 기초] EDA, 데이터 탐색하기(요약정보, 기초통계, 시각화)

1. 데이터 내용 미리보기 : `head()`, `tail()`

2. 데이터 요약정보 확인하기

2-1. `shape`

2-2. `info()`

2-3. `dtypes`

3. 기초 통계정보 확인하기

3-1. 기초통계량 : `describe()`

3-2. 데이터 개수 확인 : `count()`

3-3. 각 열의 고유값 개수 : `value_counts()`

3-4 기초통계량 직접 계산하기 : `mean()`,`median()`,`max()`, `min()`, `std()`

3-5. 상관계수(문자열은 제외) : `corr()`

4. 판다스 내장 그래프 도구 활용

4-1. 선 그래프 : `plot()`

4-2. 막대그래프 : `plot(kind='bar')`

4-3. 히스토그램 : `plot(kind='hist')`

4-4. 산점도 : `plot(kind='scatter')`

4-5. 박스플롯(box-plot) : `plot(kind='box')`

Reference

공유하기

댓글남기기

참고

[서평] 한가지에 집중하라!, The One Thing - 게리 켈러, 제이 파파산

[Hive] Hive CREATE TABLE 다양한 옵션 정리

[Hive] Hive Table 다루기 (DESCRIBE, CREATE, LOAD, INSERT, DROP, VIEW, CTAS)

[Linux] Linux 기본 명령어 정리

Yonggeun Shin

1. 데이터 내용 미리보기 : head(), tail()

2. 데이터 요약정보 확인하기

2-1. shape

2-2. info()

2-3. dtypes

3. 기초 통계정보 확인하기

3-1. 기초통계량 : describe()

3-2. 데이터 개수 확인 : count()

3-3. 각 열의 고유값 개수 : value_counts()

3-4 기초통계량 직접 계산하기 : mean(),median(),max(), min(), std()

3-5. 상관계수(문자열은 제외) : corr()

4. 판다스 내장 그래프 도구 활용

4-1. 선 그래프 : plot()

4-2. 막대그래프 : plot(kind='bar')

4-3. 히스토그램 : plot(kind='hist')

4-4. 산점도 : plot(kind='scatter')

4-5. 박스플롯(box-plot) : plot(kind='box')

Reference

공유하기

댓글남기기

참고

[서평] 한가지에 집중하라!, The One Thing - 게리 켈러, 제이 파파산

[Hive] Hive CREATE TABLE 다양한 옵션 정리

[Hive] Hive Table 다루기 (DESCRIBE, CREATE, LOAD, INSERT, DROP, VIEW, CTAS)

[Linux] Linux 기본 명령어 정리

1. 데이터 내용 미리보기 : `head()`, `tail()`

2-1. `shape`

2-2. `info()`

2-3. `dtypes`

3-1. 기초통계량 : `describe()`

3-2. 데이터 개수 확인 : `count()`

3-3. 각 열의 고유값 개수 : `value_counts()`

3-4 기초통계량 직접 계산하기 : `mean()`,`median()`,`max()`, `min()`, `std()`

3-5. 상관계수(문자열은 제외) : `corr()`

4-1. 선 그래프 : `plot()`

4-2. 막대그래프 : `plot(kind='bar')`

4-3. 히스토그램 : `plot(kind='hist')`

4-4. 산점도 : `plot(kind='scatter')`

4-5. 박스플롯(box-plot) : `plot(kind='box')`