[인공지능 #18 ] 인공지능/딥러닝 실전입문_형태소 적용/텍스트(스팸등) 분류

프로젝트/인공지능2017. 8. 19. 16:47

뷰어
댓글로
이전글
다음글

인공지능의 실전입문에 관한 글입니다.

알고리즘을 만드는것이 아니고, 만들어진 알로리즘을 활용하는 방법에 관한 글입니다.

자동차 운전을 위해 자동차를 만드는방법을 알필요는 없습니다. 물론 알면 좋기는 하겠지만, 서로 별도의 분야라고 할수있습니다.

본글은 한글 형태소 분석 방법에 대한 글입니다.

글의 순서는 아래와 같습니다.

=========================================================================================

1.프로그램 설치

- 아나콘다 기준임

- conda install openjdk

- conda install python

- pip install konlpy

- 상기 프로그램 설치후 jvm 관련 에러가 발생할경우 아래 프로그램 추가로 설치한다

.jdk-8u144-windows-x64.exe ==> http://docs.oracle.com/cd/E19182-01/820-7851/inst_cli_jdk_javahome_t/index.html

- jupyter notebook 설치실행하기

. pip install jupyter

. 실행 :아나콘도 프롬프트에서 명령어 입력: jupyter notebook ==> 노트북이 열림 ( http://localhost:8888/tree# )

. 에러 처리 : 셀에 코드 입력후 cell run을 해도 반응이 없을경우, 인터넷 브로우져를 " 크롬"으로 변경한다.

. 기타 옵션 활성화는 "ipythoen.config 화일을 에서 필요한 항목을 주석해제 한다(활성화 한다)

-C:\Users\dhp\.ipython\profile_default

2.한글 형태소 분석[ # 170819 10 toji_count ]

- 단어별 등장빈도를 분석함

- 소설의 장르구분, 글의 종류등을 구분하는데 많은 응용이 가능함

3. 소설 "togi" 내용을 형태소 적용 출력하고, 단어별 가까운 단어 사용빈도를 출력해줌 =>[ # 170820 1 word2vec-toji ]

- 아래 2 라인은 jupyter 에서는 실행이 잘 되는데, 본 에디터(pycharm)여기서는 실행이 않되는 문제가 있음,확인필요함

model1 = word2vec.Word2Vec.load("toji.model")

model1.most_similar(positive=["땅"])

4. 베이지안필터로 텍스트 분류하기 ==? [ #bayes170820 ] [#170820 3bayes_test]

- 광고성 스팸분류등.

5. MLP(다중 퍼셉트론 ,Multi layers perceptron)을 이용한 텍스트 분류 ==> [#170826 1 mlp2-seq] [# 170826 2 mlp3-classify]

- 화일(- data.json data-mini,josn word-dic.json) 은 "http://wikibook.co.kr/python-machine-learning/ " 참조

6. Next Step

- 문장의 유사도를 n-gram으로 분석하기 ==>[#170826 3 lev-distance]

- 마르코프 체인과 LSTM으로 문장생성하기 ==>170826 5 markov , 170826 4 lstm-text-gen

- 챗봇 만들기 =>170826 6 chatbot

- 이미지와 딥러닝

. 유사이미지 검출하기

. cnn으로 caltech101 의 이미지 분류하기

. opencv로 얼굴 인식하기

. 이미지 ocr- 연속된 문자 인식하기

7. 참고자료

=========================================================================================

[ # 170819 10 toji_count ]

import codecs

from bs4 import BeautifulSoup

from konlpy.tag import Twitter

# utf-16 인코딩으로 파일을 열고 글자를 출력하기 --- (※1)

fp = codecs.open("170819 8.txt", "r", encoding="utf-16")

soup = BeautifulSoup(fp, "html.parser")

body = soup.select_one("text > body")

text = body.getText()

# 텍스트를 한 줄씩 처리하기 --- (※2)

twitter = Twitter()

word_dic = {}

lines = text.split("\r\n")

for line in lines:

malist = twitter.pos(line)

for word in malist:

if word[1] == "Noun": # 명사 확인하기 --- (※3)

if not (word[0] in word_dic):

word_dic[word[0]] = 0

word_dic[word[0]] += 1 # 카운트하기

# 많이 사용된 명사 출력하기 --- (※4)

keys = sorted(word_dic.items(), key=lambda x:x[1], reverse=True)

for word, count in keys[:50]:

print("{0}({1}) ".format(word, count), end="")

print()

[ # 170820 1 word2vec-toji ]

"""

아래 2 라인은 jupyter 에서는 실행이 잘 되는데, 본 에디터(pycharm)여기서는 실행이 않되는 문제가 있음

확인필요함

model1 = word2vec.Word2Vec.load("toji.model")

model1.most_similar(positive=["땅"])

"""

import codecs

from bs4 import BeautifulSoup

from konlpy.tag import Twitter

from gensim.models import word2vec

# utf-16 인코딩으로 파일을 열고 글자를 출력하기 --- (※1)

fp = codecs.open("170819 8.txt", "r", encoding="utf-16")

soup = BeautifulSoup(fp, "html.parser")

body = soup.select_one("text > body")

text = body.getText()

# 텍스트를 한 줄씩 처리하기 --- (※2)

twitter = Twitter()

results = []

lines = text.split("\r\n")

for line in lines:

# 형태소 분석하기 --- (※3)

# 단어의 기본형 사용

malist = twitter.pos(line, norm=True, stem=True)

r = []

for word in malist:

# 어미/조사/구두점 등은 대상에서 제외

if not word[1] in ["Josa", "Eomi", "Punctuation"]:

r.append(word[0])

rl = (" ".join(r)).strip()

results.append(rl)

print(rl)

# 파일로 출력하기 --- (※4)

wakati_file = 'toji.wakati'

with open(wakati_file, 'w', encoding='utf-8') as fp:

fp.write("\n".join(results))

# Word2Vec 모델 만들기 --- (※5)

data = word2vec.LineSentence(wakati_file)

model = word2vec.Word2Vec(data,

size=200, window=10, hs=1, min_count=2, sg=1)

model.save("toji.model")

print("ok")

model1 = word2vec.Word2Vec.load("toji.model")

model1.most_similar(positive=["땅"])

[ #bayes170820 ]

"베이지안 필터로 텍스트 분류하기"

import math, sys

from konlpy.tag import Twitter

class BayesianFilter:

""" 베이지안 필터 """

def __init__(self):

self.words = set() # 출현한 단어 기록

self.word_dict = {} # 카테고리마다의 출현 횟수 기록

self.category_dict = {} # 카테고리 출현 횟수 기록

# 형태소 분석하기 --- (※1)

def split(self, text):

results = []

twitter = Twitter()

# 단어의 기본형 사용

malist = twitter.pos(text, norm=True, stem=True)

for word in malist:

# 어미/조사/구두점 등은 대상에서 제외

if not word[1] in ["Josa", "Eomi", "Punctuation"]:

results.append(word[0])

return results

# 단어와 카테고리의 출현 횟수 세기 --- (※2)

def inc_word(self, word, category):

# 단어를 카테고리에 추가하기

if not category in self.word_dict:

self.word_dict[category] = {}

if not word in self.word_dict[category]:

self.word_dict[category][word] = 0

self.word_dict[category][word] += 1

self.words.add(word)

def inc_category(self, category):

# 카테고리 계산하기

if not category in self.category_dict:

self.category_dict[category] = 0

self.category_dict[category] += 1

# 텍스트 학습하기 --- (※3)

def fit(self, text, category):

""" 텍스트 학습 """

word_list = self.split(text)

for word in word_list:

self.inc_word(word, category)

self.inc_category(category)

# 단어 리스트에 점수 매기기--- (※4)

def score(self, words, category):

score = math.log(self.category_prob(category))

for word in words:

score += math.log(self.word_prob(word, category))

return score

# 예측하기 --- (※5)

def predict(self, text):

best_category = None

max_score = -sys.maxsize

words = self.split(text)

score_list = []

for category in self.category_dict.keys():

score = self.score(words, category)

score_list.append((category, score))

if score > max_score:

max_score = score

best_category = category

return best_category, score_list

# 카테고리 내부의 단어 출현 횟수 구하기

def get_word_count(self, word, category):

if word in self.word_dict[category]:

return self.word_dict[category][word]

else:

return 0

# 카테고리 계산

def category_prob(self, category):

sum_categories = sum(self.category_dict.values())

category_v = self.category_dict[category]

return category_v / sum_categories

# 카테고리 내부의 단어 출현 비율 계산 --- (※6)

def word_prob(self, word, category):

n = self.get_word_count(word, category) + 1 # ---(※6a)

d = sum(self.word_dict[category].values()) + len(self.words)

return n / d

[#170820 3bayes_test]

from bayes170820 import BayesianFilter

bf = BayesianFilter()

# 텍스트 학습

bf.fit("파격 세일 - 오늘까지만 30% 할인", "광고")

bf.fit("쿠폰 선물 & 무료 배송", "광고")

bf.fit("현데계 백화점 세일", "광고")

bf.fit("봄과 함께 찾아온 따뜻한 신제품 소식", "광고")

bf.fit("인기 제품 기간 한정 세일", "광고")

bf.fit("오늘 일정 확인", "중요")

bf.fit("프로젝트 진행 상황 보고","중요")

bf.fit("회의 일정이 등록되었습니다.","중요")

bf.fit("오늘 일정이 없습니다.","중요")

# 예측

pre, scorelist = bf.predict("재고 정리 할인, 무료 배송")

print("결과 =", pre)

print(scorelist)

[#170826 1 mlp2-seq]

import os, glob, json

root_dir = "./newstext"

dic_file = root_dir + "/word-dic.json"

data_file = root_dir + "/data.json"

data_file_min = root_dir + "/data-mini.json"

# 어구를 자르고 ID로 변환하기 ---(※1)

word_dic = { "_MAX": 0 }

def text_to_ids(text):

text = text.strip()

words = text.split(" ")

result = []

for n in words:

n = n.strip()

if n == "": continue

if not n in word_dic:

wid = word_dic[n] = word_dic["_MAX"]

word_dic["_MAX"] += 1

print(wid, n)

else:

wid = word_dic[n]

result.append(wid)

print(result)

return result

# 파일을 읽고 고정 길이의 배열 리턴하기 ---(※2)

def file_to_ids(fname):

with open(fname, "r") as f:

text = f.read()

return text_to_ids(text)

# 딕셔너리에 단어 모두 등록하기 --- (※3)

def register_dic():

files = glob.glob(root_dir+"/*/*.wakati", recursive=True)

for i in files:

file_to_ids(i)

# 파일 내부의 단어 세기 --- (※4)

def count_file_freq(fname):

cnt = [0 for n in range(word_dic["_MAX"])]

with open(fname,"r") as f:

text = f.read().strip()

ids = text_to_ids(text)

for wid in ids:

cnt[wid] += 1

return cnt

# 카테고리마다 파일 읽어 들이기 --- (※5)

def count_freq(limit = 0):

X = []

Y = []

max_words = word_dic["_MAX"]

cat_names = []

for cat in os.listdir(root_dir):

cat_dir = root_dir + "/" + cat

if not os.path.isdir(cat_dir): continue

cat_idx = len(cat_names)

cat_names.append(cat)

files = glob.glob(cat_dir+"/*.wakati")

i = 0

for path in files:

print(path)

cnt = count_file_freq(path)

X.append(cnt)

Y.append(cat_idx)

if limit > 0:

if i > limit: break

i += 1

return X,Y

# 단어 딕셔너리 만들기 --- (※5)

if os.path.exists(dic_file):

word_dic = json.load(open(dic_file))

else:

register_dic()

json.dump(word_dic, open(dic_file,"w"))

# 벡터를 파일로 출력하기 --- (※6)

# 테스트 목적의 소규모 데이터 만들기

X, Y = count_freq(20)

json.dump({"X": X, "Y": Y}, open(data_file_min,"w"))

# 전체 데이터를 기반으로 데이터 만들기

X, Y = count_freq()

json.dump({"X": X, "Y": Y}, open(data_file,"w"))

print("ok")

[# 170826 2 mlp3-classify]

from keras.models import Sequential

from keras.layers import Dense, Dropout, Activation

from keras.wrappers.scikit_learn import KerasClassifier

from keras.utils import np_utils

from sklearn.model_selection import train_test_split

from sklearn import model_selection, metrics

import json

max_words = 56681 # 입력 단어 수: word-dic.json 파일 참고

nb_classes = 9 # 9개의 카테고리

batch_size = 64

nb_epoch = 20

# MLP 모델 생성하기 --- (※1)

def build_model():

model = Sequential()

model.add(Dense(512, input_shape=(max_words,)))

model.add(Activation('relu'))

model.add(Dropout(0.5))

model.add(Dense(nb_classes))

model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',

optimizer='adam',

metrics=['accuracy'])

return model

# 데이터 읽어 들이기--- (※2)

data = json.load(open("./newstext/data-mini.json"))

#data = json.load(open("./newstext/data.json"))

X = data["X"] # 텍스트를 나타내는 데이터

Y = data["Y"] # 카테고리 데이터

# 학습하기 --- (※3)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

Y_train = np_utils.to_categorical(Y_train, nb_classes)

model = KerasClassifier(

build_fn=build_model,

nb_epoch=nb_epoch,

batch_size=batch_size)

model.fit(X_train, Y_train)

print(len(X_train),len(Y_train))

# 예측하기 --- (※4)

y = model.predict(X_test)

ac_score = metrics.accuracy_score(Y_test, y)

cl_report = metrics.classification_report(Y_test, y)

print("정답률 =", ac_score)

print("리포트 =\n", cl_report)

[#170826 3 lev-distance]

# 레벤슈타인 거리 구하기

def calc_distance(a, b):

''' 레벤슈타인 거리 계산하기 '''

if a == b: return 0

a_len = len(a)

b_len = len(b)

if a == "": return b_len

if b == "": return a_len

# 2차원 표 (a_len+1, b_len+1) 준비하기 --- (※1)

matrix = [[] for i in range(a_len+1)]

for i in range(a_len+1): # 0으로 초기화

matrix[i] = [0 for j in range(b_len+1)]

# 0일 때 초깃값을 설정

for i in range(a_len+1):

matrix[i][0] = i

for j in range(b_len+1):

matrix[0][j] = j

# 표 채우기 --- (※2)

for i in range(1, a_len+1):

ac = a[i-1]

for j in range(1, b_len+1):

bc = b[j-1]

cost = 0 if (ac == bc) else 1

matrix[i][j] = min([

matrix[i-1][j] + 1, # 문자 삽입

matrix[i][j-1] + 1, # 문자 제거

matrix[i-1][j-1] + cost # 문자 변경

])

return matrix[a_len][b_len]

# "가나다라"와 "가마바라"의 거리 --- (※3)

print(calc_distance("가나다라","가마바라"))

# 실행 예

samples = ["신촌역","신천군","신천역","신발","마곡역"]

base = samples[0]

r = sorted(samples, key = lambda n: calc_distance(base, n))

for n in r:

print(calc_distance(base, n), n)

[참고자료]

https://www.docker.com/products/docker-toolbox ==> docker 설치방법

https://www.youtube.com/playlist?list=PLBXuLgInP-5m_vn9ycXHRl7hlsd1huqmS ==> 동영상 강좌

http://wikibook.co.kr/python-machine-learning/ ==>소스코드

https://www.data.go.kr/main.do

http://konlpy-ko.readthedocs.io/ko/v0.4.3/

https://ithub.korean.go.kr/user/total/database/corpusView.do

저작자표시 비영리 변경금지

'프로젝트 > 인공지능' 카테고리의 다른 글

[ChatBot 만들기 #1 ] 개발환경 구성하기 (0)	2017.08.28
[인공지능 #19 ] 인공지능/딥러닝 실전입문_언어별 크롤링 및 언어식별 (0)	2017.08.23
[인공지능 #17 ] 인공지능/딥러닝 실전입문_머신러닝에 딥러닝 적용 (0)	2017.08.19
[인공지능 #16 ] 인공지능/딥러닝 실전입문_외국어판별_식용버섯 식별 (0)	2017.08.16
[인공지능 #15 ] 인공지능/딥러닝 실전입문_XOR/손글씨 맞추기 (0)	2017.08.14

TechTogetWorld

[인공지능 #18 ] 인공지능/딥러닝 실전입문_형태소 적용/텍스트(스팸등) 분류

'프로젝트 > 인공지능' 카테고리의 다른 글

최근에 올라온 글

최근에 달린 댓글

공지사항

글 보관함

최근에 받은 트랙백

링크

티스토리툴바

« 2024/07 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31