데이터 분석 / 데이터 타입

Nominal data

1) 순서가 없는 단순 'named' 또는 'labeled' 데이터를 의미한다.

- 순서가 없는 범주형 자료

2) Null 값을 채울 경우, 평균, 중간값, 편차 등은 의미가 없다.

3) categorical data (범주형 자료)라고 부르기도 한다.

i.e. - 성별, 사는 위치 등

4) 두 개의 범주만 있는 경우 -> 'binary' 데이터라고도 부른다.

Ordinal data

1) 순서가 있는 자료

- 순서가 있는 범주형 자료

i.e. - 빈도, 만족감, 행복도, 고통수치 등

2) string 타입의 데이터의 수치변환 시 신중한 평가가 요구된다.

Interval data

1) 순서가 있으며, 데이터 간의 유의미한 상관관계가 있는 자료

- distance between two entities

2) Mean, median, or mode를 사용하기에 충분하다.

3) 고정된 시작점 또는 '0'인 값은 없다.

4) 회귀분석, 기술적 통계분석 등이 가능하다.

i.e. - time based data, shoe size, etc

Ratio

1) interval data와 다르게 원점 (고정된 시작점; true zero)이 존재한다.

i.e. - age, weight

출처 : https://www.questionpro.com/blog/nominal-ordinal-interval-ratio/

Discrete & Continuous

1) interval & ratio data는 discrete 또는 continuous 중 하나에 속한다.

Qualitative (Categorical)		Quantitative (Numerical)
Nominal	Ordinal	Discrete - 불연속적 데이터 - 연속적이지 않아서 절대적인 값만 해당됨	Continuous - 연속적 데이터 - height in cm

활용법

data = []

for f in train.columns:
    # Defining the role
    if f == 'target':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'
         
    # Defining the level
    if 'bin' in f or f == 'target':
        level = 'binary'
    elif 'cat' in f or f == 'id':
        level = 'nominal'
    elif train[f].dtype == np.float64:
        level = 'interval'
    elif train[f].dtype == np.int64:
        level = 'ordinal'
        
    # Initialize keep to True for all variables except for id
    keep = True
    if f == 'id':
        keep = False
    
    # Defining the data type 
    dtype = train[f].dtype
    
    # Creating a Dict that contains all the metadata for the variable
    f_dict = {
        'varname': f,
        'role': role,
        'level': level,
        'keep': keep,
        'dtype': dtype
    }
    data.append(f_dict)
    
meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace=True)

v = meta[(meta.level == 'interval') & (meta.keep)].index
train[v].describe()

'Data > Data Analysis' 카테고리의 다른 글

Pearson's correlation (0)	2021.02.03

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

날아가는 개발자