[인공지능 #13 ] q_net_frozenlake / cartpole
인공지능 구현 에 대한 글입니다.(Deep Reinforcement Learning)
글의 순서는 아래와 같습니다.
================================================
요약
- 신경망 (Neural Network)즉 Q-NETWORK 를 이용하여 Q-Learing 구현( q-talble 형식의경우 메모리가 기하 급수적으로 필요해짐 )
==> 실생활에 적용하게에는 무리가 있음(q-table 방식) , 따라서 신경망 방식이 필요해짐
1.q_net_frozenlake
- Network 으로 변환
2. 07_3_dqn_2015_cartpole
- Q-NETWORk 이슈
1) 데이터가 너무 적어 정확도가 좋치 못하다. 2개의 데이터로 학습을 하게되면 전혀 다른 직선이 나오게 되는것이다
. 깊게(deep)
. experience replay : action후 버퍼에 상태,action등 을 저장한다 , 이후 random(골고루) 하게 샘플링해서 학습한다
2) 타겟이 흔들림 ( 같은 네트웍을 사용해서, 예측변경이 타겟치도 변경이 일어남) => 화살을 쏘자마자 과녁을 움직이는것 같음
. network을 하나더 만든다 ( 각자 업데이트 하다가, 학습전에 복사해서 합친다)
3. Next Step
==> 신경망 (Neural Network)를 이용하여 Q-Learing 구현
4. 참고자료
=================================================
[ 06_q_net_frozenlake ]
06_q_net_frozenlake
This code is based on
https://github.com/hunkim/DeepRL-Agents
'''
import gym
import numpy as np
import matplotlib.pyplot as plt
import time
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # default value = 0 From http://stackoverflow.com/questions/35911252/disable-tensorflow-debugging-information
import tensorflow as tf
env = gym.make('FrozenLake-v0')
# Input and output size based on the Env
input_size = env.observation_space.n;
output_size = env.action_space.n;
learning_rate = 0.1
# These lines establish the feed-forward part of the network used to choose actions
X = tf.placeholder(shape=[1, input_size], dtype=tf.float32) # state input
W = tf.Variable(tf.random_uniform([input_size, output_size], 0, 0.01)) # weight
Qpred = tf.matmul(X, W) # Out Q prediction
Y = tf.placeholder(shape=[1, output_size], dtype=tf.float32) # Y label
loss = tf.reduce_sum(tf.square(Y-Qpred))
train = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)
# Set Q-learning parameters
dis = .99
num_episodes = 2000
# create lists to contain total rewards and steps per episode
rList = []
def one_hot(x):
return np.identity(16)[x:x+1]
start_time = time.time()
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for i in range(num_episodes):
# Reset environment and get first new observation
s = env.reset()
e = 1. / ((i / 50) + 10)
rAll = 0
done = False
local_loss = []
# The Q-Table learning algorithm
while not done:
# Choose an action by greedly (with a chance of random action)
# from the Q-network
Qs = sess.run(Qpred, feed_dict={X: one_hot(s)})
if np.random.rand(1) < e:
a = env.action_space.sample()
else:
a = np.argmax(Qs)
# Get new state and reward from environment
s1, reward, done, _ = env.step(a)
if done:
# Update Q, and no Qs+1, since it's a termial state
Qs[0, a] = reward
else:
# Obtain the Q_s` values by feeding the new state through our network
Qs1 = sess.run(Qpred, feed_dict={X: one_hot(s1)})
# Update Q
Qs[0, a] = reward + dis*np.max(Qs1)
# Train our network using target (Y) and predicted Q (Qpred) values
sess.run(train, feed_dict={X: one_hot(s), Y: Qs})
rAll += reward
s = s1
rList.append(rAll)
print("--- %s seconds ---" % (time.time() - start_time))
print("Success rate: " + str(sum(rList) / num_episodes))
#plt.bar(range(len(rList)), rList, color="blue")
plt.bar(range(len(rList)), rList, color='b', alpha=0.4)
plt.show()
07_3_dqn_2015_cartpole
This code is based on
https://github.com/hunkim/DeepRL-Agents
CF https://github.com/golbin/TensorFlow-Tutorials
https://github.com/dennybritz/reinforcement-learning/blob/master/DQN/dqn.py
Q-NETWOR 이슈
1. 데이터가 너무 적어 정확도가 좋치 못하다. 2개의 데이터로 학습을 하게되면 전혀 다른 직선이 나오게 되는것이다
- 깊게(deep)
- experience replay : action후 버퍼에 상태,action등 을 저장한다 , 이후 random(골고루) 하게 샘플링해서 학습한다
2. 타겟이 흔들림 ( 같은 네트웍을 사용해서, 예측변경이 타겟치도 변경이 일어남) => 화살을 쏘자마자 과녁을 움직이는것 같음
- network을 하나더 만든다
"""
import numpy as np
import tensorflow as tf
import random
from collections import deque
from dqn import dqn
import gym
from gym import wrappers
env = gym.make('CartPole-v0')
# Constants defining our neural network
input_size = env.observation_space.shape[0]
output_size = env.action_space.n
dis = 0.9
REPLAY_MEMORY = 50000
def replay_train(mainDQN, targetDQN, train_batch):
x_stack = np.empty(0).reshape(0, input_size)
y_stack = np.empty(0).reshape(0, output_size)
# Get stored information from the buffer
for state, action, reward, next_state, done in train_batch:
Q = mainDQN.predic(state)
# terminal?
if done:
Q[0, action] = reward
else:
# get target from target DQN (Q')
Q[0, action] = reward + dis * np.max(targetDQN.predict(next_state))
y_stack = np.vstack([y_stack, Q])
x_stack = np.vstack( [x_stack, state])
# Train our network using target and predicted Q values on each episode
return mainDQN.update(x_stack, y_stack)
def ddqn_replay_train(mainDQN, targetDQN, train_batch):
#Double DQN implementation
#param mainDQN main DQN
#param targetDQN target DQN
#param train_batch minibatch for train
#return loss
x_stack = np.empty(0).reshape(0, mainDQN.input_size)
y_stack = np.empty(0).reshape(0, mainDQN.output_size)
# Get stored information from the buffer
for state, action, reward, next_state, done in train_batch:
Q = mainDQN.predict(state)
# terminal?
if done:
Q[0, action] = reward
else:
# Double DQN: y = r + gamma * targetDQN(s')[a] where
# a = argmax(mainDQN(s'))
Q[0, action] = reward + dis * targetDQN.predict(next_state)[0, np.argmax(mainDQN.predict(next_state))]
y_stack = np.vstack([y_stack, Q])
x_stack = np.vstack([x_stack, state])
# Train our network using target and predicted Q values on each episode
return mainDQN.update(x_stack, y_stack)
def get_copy_var_ops(*, dest_scope_name="target", src_scope_name="main"):
# Copy variables src_scope to dest_scope
op_holder = []
src_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=src_scope_name)
dest_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=dest_scope_name)
for src_var, dest_var in zip(src_vars, dest_vars):
op_holder.append(dest_var.assign(src_var.value()))
return op_holder
def bot_play(mainDQN, env=env):
# See our trained network in action
state = env.reset()
reward_sum = 0
while True:
env.render()
action = np.argmax(mainDQN.predict(state))
state, reward, done, _ = env.step(action)
reward_sum += reward
if done:
print("Total score: {}".format(reward_sum))
break
def main():
max_episodes = 5000
# store the previous observations in replay memory
replay_buffer = deque()
with tf.Session() as sess:
mainDQN = dqn.DQN(sess, input_size, output_size, name="main")
targetDQN = dqn.DQN(sess, input_size, output_size, name="target")
tf.global_variables_initializer().run()
#initial copy q_net -> target_net
copy_ops = get_copy_var_ops(dest_scope_name="target", src_scope_name="main")
sess.run(copy_ops)
for episode in range(max_episodes):
e = 1. / ((episode / 10) + 1)
done = False
step_count = 0
state = env.reset()
while not done:
if np.random.rand(1) < e:
action = env.action_space.sample()
else:
# Choose an action by greedily from the Q-network
action = np.argmax(mainDQN.predict(state))
# Get new state and reward from environment
next_state, reward, done, _ = env.step(action)
if done: # Penalty
reward = -100
# Save the experience to our buffer
replay_buffer.append((state, action, reward, next_state, done))
if len(replay_buffer) > REPLAY_MEMORY:
replay_buffer.popleft()
state = next_state
step_count += 1
if step_count > 10000: # Good enough. Let's move on
break
print("Episode: {} steps: {}".format(episode, step_count))
if step_count > 10000:
pass
##10,000이면 정지(무한루프방지)
# break
if episode % 10 == 1: # train every 10 episode
# Get a random batch of experiences
for _ in range(50):
minibatch = random.sample(replay_buffer, 10)
loss, _ = ddqn_replay_train(mainDQN, targetDQN, minibatch)
print("Loss: ", loss)
# copy q_net -> target_net
sess.run(copy_ops)
# See our trained bot in action
env2 = wrappers.Monitor(env, 'gym-results', force=True)
for i in range(200):
bot_play(mainDQN, env=env2)
env2.close()
# gym.upload("gym-results", api_key="sk_VT2wPcSSOylnlPORltmQ")
if __name__ == "__main__":
main()
[ 참고자료 ]
https://www.inflearn.com/course/기본적인-머신러닝-딥러닝-강좌
https://github.com/hunkim/deeplearningzerotoall
https://www.tensorflow.org/api_docs/python/tf/layers
https://www.inflearn.com/course/reinforcement-learning/
'프로젝트 > 인공지능' 카테고리의 다른 글
[인공지능 #15 ] 인공지능/딥러닝 실전입문_XOR/손글씨 맞추기 (0) | 2017.08.14 |
---|---|
[인공지능 #14 ] 인공지능/딥러닝 실전입문_데이터 확보하기 (0) | 2017.08.12 |
[인공지능 #12 ] Q-Learning / OpenAI gym / frozenlake (0) | 2017.08.06 |
[인공지능 #11 ] hello-rnn /char-seq-rnn /char-seq-softmax-only /rnn_long_char (0) | 2017.08.06 |
[인공지능 #10 ]mnist_cnn/mnist_deep_cnn/mnist_cnn_class/mnist_cnn_layers/mnist_cnn_ensemble_layers (0) | 2017.08.06 |
[인공지능 #12 ] Q-Learning / OpenAI gym / frozenlake
인공지능 구현 에 대한 글입니다.(Deep Reinforcement Learning)
글의 순서는 아래와 같습니다.
================================================
요약
- 얼음얼린 호수에서 구멍에 빠지지 않고 길을 찾아나오는 게임임
- 얼음은 미끄럽습니다. 길을 안내해주는 사람의 말을 전적으로 의지할경우, 오히려 바람등에 의해(불확실한 환경)
미끄러질수가 있습니다. 따라서 약간만 의지하고, 나의 의지를 좀더 반영하는 방식으로 정확도를 높일수 있음 ( 1.5% ==> 66% 수준)
- 현실도 주변 환경에의해 예측이 불가능한 경우가 많습니다. 이럴경우에 적용이 가능한 방식임.
1.OpenAI gym 게임을 위한 프로그램 설치
2. # 01_play_frozenlake_det_windows
=> 화살키를 입력하는 방향으로 이동
=> 화일 실행은 터미널에서 한다(키 인을 받기위함).
=> frozenlake_det_windows.py 가 있는 폴더로 가서, python 명령어 실행
=> 홀에 빠지거나,끝으로 가면 게임이 종료된다.
3.# 03_0_q_table_frozenlake_det
=> 초기에는 임의의 장소로, 두번째 부터는 큰곳으로 이동, 홀을 피해 길을 찾는 알고리즘임
=> 참고(random argmax) : 같은면 아무곳이나, 큰곳이 있으면 큰곳으로 간다
4. # 03_2_q_table_frozenlake_det
==>e보다 작으면 임의의 장소로 가고,그렇치 않으면, 큰 리워드방향으로 이동한다(explot &exporation)
==> 이사간 동네에서 초기에는 랜덤하게 식당을 다니고, 파악이 다되면 맞집위주로 찾아간다게
==> 노이즈 값을 주어서, 기존의 data를 일부 반영하는 방법도 있음 ( 상기 e의 경우는 기존 data를 무시하됨)
. 차선책을 선택하는 방법임
==> discount(0.9) 나중에 받을 리워드는 0.9를 곱해서 비중을 낮춘다.
최단거리를 찾는 방법임
5. play_frozenlake_windows
==> keyboard 인식이 잘 않됨, 추후 보완필요
==> 빙판길로, 키보드 조작대로 움직이지 않고, Q 선생의 말에 전적으로 의존하지 않는 상황을 의미함
==> 현실세계와 비슷한 환경을 구현하는것임.
6. #05_0_q_table_frozenlake
==>미끄러운 환경 ('FrozenLake-v0' ) 에서는 Q 형님의 조언을 그대로 따르면 않된다, 주변환경에 의해 미끄질수 있기때문임
==> 빙판에서, 기존과 같이 Q 형님의 말에 전적으로 의존할 경우 1.55% , 일부 의존할경우 66%
7. # 05_q_table_frozenlake
==>미끄러운 환경 ('FrozenLake-v0' ) 에서는 Q 형님의 조언을 그대로 따르면 않된다, 주변환경에 의해 미끄질수 있기때문임
따라서 Q 형님의 말을 조금만 반영할 필요가 있음
==> Q[state, action] = (1-learning_rate) * Q[state, action] \
+ learning_rate*(reward + dis * np.max(Q[new_state, :]))
==> 정확도가 어느정도 상승함 (1.55% ==> 66% )
. 빙판에서, Q 형님의 말에 전적으로 의존할 경우 1.55% , 일부 의존할경우 66%
8. Next Step
==> 신경망 (Neural Network)를 이용하여 Q-Learing 구현
9. 참고자료
=================================================
[ 1.OpenAI gym 게임을 위한 프로그램 설치 ]
- 설치가이드 :https://gym.openai.com/docs
- step 1
. anaconda3 prompt 실행
- step 2
. git clone https\\github.com/openai/gym ==> gym 다운받기
. cd gym ==> gym 폴더로 이동
. pip3 install -e . ==> minimal install , pip3로 해야함 , gym으로 다운받은 gym을 pc에 설치해 주는 과정임
- step3 ==> 해당 python 선택( 패키지 별로 python 설치경로가 틀려서, 해당 python을 찾아서 연결시켜 주어야함
. python 편집기(pycharm 프로그램 설정) 의 interpreter 변경
. 변경전 : tensorflow 폴더의 python
. 변경후 : c:\\user\dhp\appdata3\python.exe
- step4 ==> 패키지 추가 ==> tensorflow 패키지 설치가 필요할경우 pycharm 에서 설치가능함
. 우측상단의 " + " 버튼을 누르면, 설치가능한 패키지 목록이 나옵니다. 여기서 tesorflow를 선택해서 설치한다.
. 이로서 c:\\user\dhp\appdata3\python.exe의 python에는 gym 과 tesorflow가 동시에 설치됨
. 진행중 필요한 패키지는 상황에 맞게 추가 설치하면 됨.
- tensorflow 내의 python에 패키지 추가방법 추가확인 필요함. gym설치되어 있는데, 잘 동작하지 않고있음.
- 설치성공 여부 확인 : pycharm 화면에서 아래 코딩후 실행확인
# cartpolo test
"""cartpolo test
"""
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(100):
env.render()
env.step(env.action_space.sample()) # take a random action
for i_episode in range(20):
observation = env.reset()
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(t+1))
break
[ # 01_play_frozenlake_det_windows ]
# 01_play_frozenlake_det_windows
"""
화일 실행은 터미널에서 한다(키 인을 받기위함).
frozenlake_det_windows.py 가 있는 폴더로 가서, python 명령어 실행
홀에 빠지거나,끝으로 가면 게임이 종료된다.
"""
import gym
from gym.envs.registration import register
from colorama import init
from kbhit import KBHit
init(autoreset=True) # Reset the terminal mode to display ansi color
register(
id='FrozenLake-v3',
entry_point='gym.envs.toy_text:FrozenLakeEnv',
kwargs={'map_name' : '4x4', 'is_slippery': False}
)
env = gym.make('FrozenLake-v3') # is_slippery False
env.render() # Show the initial board
key = KBHit()
while True:
action = key.getarrow();
if action not in [0, 1, 2, 3]:
print("Game aborted!")
break
state, reward, done, info = env.step(action)
env.render()
print("State: ", state, "Action: ", action, "Reward: ", reward, "Info: ", info)
if done:
print("Finished with reward", reward)
break
[## 03_0_q_table_frozenlake_det ]
# 03_0_q_table_frozenlake_det
"""
# random argmax : 같은면 아무곳이나, 큰곳이 있으면 큰곳으로 간다
"""
import gym
import numpy as np
import matplotlib.pyplot as plt
from gym.envs.registration import register
import random as pr
def rargmax(vector): # https://gist.github.com/stober/1943451
""" Argmax that chooses randomly among eligible maximum idices. """
m = np.amax(vector)
indices = np.nonzero(vector == m)[0]
return pr.choice(indices)
register(
id='FrozenLake-v3',
entry_point='gym.envs.toy_text:FrozenLakeEnv',
kwargs={'map_name' : '4x4', 'is_slippery': False}
)
env = gym.make('FrozenLake-v3')
# Initialize table with all zeros
Q = np.zeros([env.observation_space.n, env.action_space.n]) #16*4 사이즈임
# Set learning parameters
num_episodes = 2000
# create lists to contain total rewards and steps per episode
rList = []
for i in range(num_episodes):
# Reset environment and get first new observation
state = env.reset()
rAll = 0
done = False
# The Q-Table learning algorithm
while not done:
action = rargmax(Q[state, :]) # random argmax : 같은면 아무곳이자, 큰곳이 있으면 큰곳으로 간다
# Get new state and reward from environment
new_state, reward, done, _ = env.step(action)
# Update Q-Table with new knowledge using learning rate
Q[state, action] = reward + np.max(Q[new_state, :])
rAll += reward
state = new_state
rList.append(rAll)
print("Success rate: " + str(sum(rList) / num_episodes))
print("Final Q-Table Values")
print("LEFT DOWN RIGHT UP")
print(Q)
plt.bar(range(len(rList)), rList, color="blue")
#plt.bar(range(len(rList)), rList, color='b', alpha=0.4)
plt.show()
[# 03_2_q_table_frozenlake_det]
# 03_2_q_table_frozenlake_det
"""
==> e보다 작으면 임의의 장소로 가고,그렇치 않으면, 큰 리워드방향으로 이동한다(explot &exporation)
==> 이사간 동네에서 초기에는 랜덤하게 식당을 다니고, 파악이 다되면 맞집위주로 찾아간다게
==> 노이즈 값을 주어서, 기존의 data를 일부 반영하는 방법도 있음 ( 상기 e의 경우는 기존 data를 무시하됨)
. 차선책을 선택하는 방법임
==> discount(0.9) 나중에 받을 리워드는 0.9를 곱해서 비중을 낮춘다.
최단거리를 찾는 방법임
"""
import gym
import numpy as np
import matplotlib.pyplot as plt
from gym.envs.registration import register
register(
id='FrozenLake-v3',
entry_point='gym.envs.toy_text:FrozenLakeEnv',
kwargs={'map_name' : '4x4', 'is_slippery': False}
)
env = gym.make('FrozenLake-v3')
# Initialize table with all zeros
Q = np.zeros([env.observation_space.n, env.action_space.n])
# Set learning parameters
dis = .99
num_episodes = 2000
# create lists to contain total rewards and steps per episode
rList = []
for i in range(num_episodes):
# Reset environment and get first new observation
state = env.reset()
rAll = 0
done = False
e = 1. / ((i // 100) + 1) # Python2 & 3
# 후반부로 갈수록 e 값은 작아짐
# The Q-Table learning algorithm
while not done:
# Choose an action by e-greedy
if np.random.rand(1) < e:
# e보다 작으면 임의의 장소로 가고
# 랜더만 방향으로 많이 가게되면 정확도가 떨어질수 있음
action = env.action_space.sample()
else:
# 그렇치 않으면, 큰 리워드방향으로 이동한다
action = np.argmax(Q[state, :])
# Get new state and reward from environment
new_state, reward, done, _ = env.step(action)
# Update Q-Table with new knowledge using decay rate
Q[state, action] = reward + dis * np.max(Q[new_state, :])
# 나중에 받을 리워드는 0.9를 곱해서 비중을 낮춘다. 최단거리를 찾는 방법임
rAll += reward
state = new_state
rList.append(rAll)
print("Success rate: " + str(sum(rList) / num_episodes))
print("Final Q-Table Values")
print("LEFT DOWN RIGHT UP")
print(Q)
#plt.bar(range(len(rList)), rList, color="blue")
plt.bar(range(len(rList)), rList, color='b', alpha=0.4)
plt.show()
[# play_frozenlake_windows]
# play_frozenlake_windows
"""keyboard 인식이 잘 않됨, 추후 보완필요
빙판길로, 키보드 조작대로 움직이지 않고, Q 선생의 말에 전적으로 의존하지 않는 상황을 의미함"""
import gym
from gym.envs.registration import register
from colorama import init
from kbhit import KBHit
init(autoreset=True) # Reset the terminal mode to display ansi color
env = gym.make('FrozenLake-v0') # is_slippery True
env.render() # Show the initial board
key = KBHit()
while True:
action = key.getarrow();
if action not in [0, 1, 2, 3]:
print("Game aborted!")
break
state, reward, done, info = env.step(action)
env.render()
print("State: ", state, "Action: ", action, "Reward: ", reward, "Info: ", info)
if done:
print("Finished with reward", reward)
break
[#05_0_q_table_frozenlake]
#05_0_q_table_frozenlake
"""
미끄러운 환경 ('FrozenLake-v0' ) 에서는 Q 형님의 조언을 그대로 따르면 않된다, 주변환경에 의해 미끄질수 있기때문임
. 빙판에서, 기존과 같이 Q 형님의 말에 전적으로 의존할 경우 1.55% , 일부 의존할경우 66%
"""
import gym
import numpy as np
import matplotlib.pyplot as plt
from gym.envs.registration import register
import random as pr
env = gym.make('FrozenLake-v0')
# Initialize table with all zeros
Q = np.zeros([env.observation_space.n, env.action_space.n])
# Set learning parameters
learning_rate = .85
dis = .99
num_episodes = 2000
# create lists to contain total rewards and steps per episode
rList = []
for i in range(num_episodes):
# Reset environment and get first new observation
state = env.reset()
rAll = 0
done = False
# The Q-Table learning algorithm
while not done:
action = np.argmax(Q[state, :] + np.random.randn(1, env.action_space.n) / (i + 1))
# Get new state and reward from environment
new_state, reward, done, _ = env.step(action)
# Update Q-Table with new knowledge using learning rate
Q[state, action] = reward + dis * np.max(Q[new_state, :])
state = new_state
rAll += reward
rList.append(rAll)
print("Success rate: " + str(sum(rList) / num_episodes))
print("Final Q-Table Values")
print("LEFT DOWN RIGHT UP")
print(Q)
plt.bar(range(len(rList)), rList, color="blue")
#plt.bar(range(len(rList)), rList, color='b', alpha=0.4)
plt.show()
[# 05_q_table_frozenlake]
# 05_q_table_frozenlake
"""
미끄러운 환경 ('FrozenLake-v0' ) 에서는 Q 형님의 조언을 그대로 따르면 않된다, 주변환경에 의해 미끄질수 있기때문임
따라서 Q 형님의 말을 조금만 반영할 필요가 있음
==> Q[state, action] = (1-learning_rate) * Q[state, action] \
+ learning_rate*(reward + dis * np.max(Q[new_state, :]))
==> 정확도가 어느정도 상승함 (1.55% ==> 66% )
. 빙판에서, Q 형님의 말에 전적으로 의존할 경우 1.55% , 일부 의존할경우 66%
"""
import gym
import numpy as np
import matplotlib.pyplot as plt
from gym.envs.registration import register
import random as pr
register(
id='FrozenLake-v3',
entry_point='gym.envs.toy_text:FrozenLakeEnv',
kwargs={'map_name' : '4x4', 'is_slippery': False}
)
#env = gym.make('FrozenLake-v3')
env = gym.make('FrozenLake-v0')
# Initialize table with all zeros
Q = np.zeros([env.observation_space.n, env.action_space.n])
# Set learning parameters
learning_rate = .85
dis = .99
num_episodes = 2000
# create lists to contain total rewards and steps per episode
rList = []
for i in range(num_episodes):
# Reset environment and get first new observation
state = env.reset()
rAll = 0
done = False
# The Q-Table learning algorithm
while not done:
action = np.argmax(Q[state, :] + np.random.randn(1, env.action_space.n) / (i + 1))
# 노이즈 추가 ==>e 값 이용은, 처음부터 계산 , 노이지 값 이용은 기존값 반영 , 즉 차선책을 선택하는 방법임
# Get new state and reward from environment
new_state, reward, done, _ = env.step(action)
# Q 형님의 말을 조금만 반영
# Update Q-Table with new knowledge using learning rate
Q[state, action] = (1-learning_rate) * Q[state, action] \
+ learning_rate*(reward + dis * np.max(Q[new_state, :]))
rAll += reward
state = new_state
rList.append(rAll)
print("Success rate: " + str(sum(rList) / num_episodes))
print("Final Q-Table Values")
print("LEFT DOWN RIGHT UP")
print(Q)
#plt.bar(range(len(rList)), rList, color="blue")
plt.bar(range(len(rList)), rList, color='b', alpha=0.4)
plt.show()
[참고자료]
https://www.inflearn.com/course/기본적인-머신러닝-딥러닝-강좌
https://github.com/hunkim/deeplearningzerotoall
https://www.tensorflow.org/api_docs/python/tf/layers
https://www.inflearn.com/course/reinforcement-learning/
'프로젝트 > 인공지능' 카테고리의 다른 글
[인공지능 #11 ] hello-rnn /char-seq-rnn /char-seq-softmax-only /rnn_long_char
인공지능 구현에 대한 글입니다.
글의 순서는 아래와 같습니다.
================================================
1.#lab-12-1-hello-rnn
전 단계의 출력이 다음단계의 출력에 영향을 주는 경우 적용함
- 단어, 연관검색등..
2. # lab-12-2-char-seq-rnn
-rnn 적용 ==> 정확도 높음
. 49 loss: 0.000650434 Prediction: if you want you
. y값 if you want you
3. #lab-12-3-char-seq-softmax-only
rnn 미적용 ==> 정확도 미흡함
2999 loss: 0.277323 Prediction: yf you yant you
y값 if you want you
4. # lab-12-4-rnn_long_char
error : from __future__ import print_function ==> 실행불가로 주석처리함
MultiRNNCell 로 여러단을 만들면 , 정확도가 높아짐
softmax =>reshape 수행
6. 코드탐구(추가)
==>lab-12-5-rnn_stock_prediction
lab-13-1-mnist_using_scope
lab-13-2-mnist_tensorboard
lab-13-3-mnist_save_restore
7. 참고자료
=================================================
[ #lab-12-1-hello-rnn ]
#lab-12-1-hello-rnn
"""
전 단계의 출력이 다음단계의 출력에 영향을 주는 경우 적용함
- 단어, 연관검색등..
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
# Lab 12 RNN
import tensorflow as tf
import numpy as np
tf.set_random_seed(777) # reproducibility
idx2char = ['h', 'i', 'e', 'l', 'o']
# Teach hello: hihell -> ihello
x_data = [[0, 1, 0, 2, 3, 3]] # hihell
x_one_hot = [[[1, 0, 0, 0, 0], # h 0
[0, 1, 0, 0, 0], # i 1
[1, 0, 0, 0, 0], # h 0
[0, 0, 1, 0, 0], # e 2
[0, 0, 0, 1, 0], # l 3
[0, 0, 0, 1, 0]]] # l 3
y_data = [[1, 0, 2, 3, 3, 4]] # ihello
num_classes = 5
input_dim = 5 # one-hot size
hidden_size = 5 # output from the LSTM. 5 to directly predict one-hot
batch_size = 1 # one sentence
sequence_length = 6 # |ihello| == 6
learning_rate = 0.1
X = tf.placeholder(
tf.float32, [None, sequence_length, input_dim]) # X one-hot
Y = tf.placeholder(tf.int32, [None, sequence_length]) # Y label
cell = tf.contrib.rnn.BasicLSTMCell(num_units=hidden_size, state_is_tuple=True)
initial_state = cell.zero_state(batch_size, tf.float32)
outputs, _states = tf.nn.dynamic_rnn(
cell, X, initial_state=initial_state, dtype=tf.float32)
# FC layer
X_for_fc = tf.reshape(outputs, [-1, hidden_size])
# fc_w = tf.get_variable("fc_w", [hidden_size, num_classes])
# fc_b = tf.get_variable("fc_b", [num_classes])
# outputs = tf.matmul(X_for_fc, fc_w) + fc_b
outputs = tf.contrib.layers.fully_connected(
inputs=X_for_fc, num_outputs=num_classes, activation_fn=None)
# reshape out for sequence_loss
outputs = tf.reshape(outputs, [batch_size, sequence_length, num_classes])
weights = tf.ones([batch_size, sequence_length])
sequence_loss = tf.contrib.seq2seq.sequence_loss(
logits=outputs, targets=Y, weights=weights)
loss = tf.reduce_mean(sequence_loss)
train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
prediction = tf.argmax(outputs, axis=2)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(50):
l, _ = sess.run([loss, train], feed_dict={X: x_one_hot, Y: y_data})
result = sess.run(prediction, feed_dict={X: x_one_hot})
print(i, "loss:", l, "prediction: ", result, "true Y: ", y_data)
# print char using dic
result_str = [idx2char[c] for c in np.squeeze(result)]
print("\tPrediction str: ", ''.join(result_str))
'''
0 loss: 1.71584 prediction: [[2 2 2 3 3 2]] true Y: [[1, 0, 2, 3, 3, 4]]
Prediction str: eeelle
1 loss: 1.56447 prediction: [[3 3 3 3 3 3]] true Y: [[1, 0, 2, 3, 3, 4]]
Prediction str: llllll
2 loss: 1.46284 prediction: [[3 3 3 3 3 3]] true Y: [[1, 0, 2, 3, 3, 4]]
Prediction str: llllll
3 loss: 1.38073 prediction: [[3 3 3 3 3 3]] true Y: [[1, 0, 2, 3, 3, 4]]
Prediction str: llllll
4 loss: 1.30603 prediction: [[3 3 3 3 3 3]] true Y: [[1, 0, 2, 3, 3, 4]]
Prediction str: llllll
5 loss: 1.21498 prediction: [[3 3 3 3 3 3]] true Y: [[1, 0, 2, 3, 3, 4]]
Prediction str: llllll
6 loss: 1.1029 prediction: [[3 0 3 3 3 4]] true Y: [[1, 0, 2, 3, 3, 4]]
Prediction str: lhlllo
7 loss: 0.982386 prediction: [[1 0 3 3 3 4]] true Y: [[1, 0, 2, 3, 3, 4]]
Prediction str: ihlllo
8 loss: 0.871259 prediction: [[1 0 3 3 3 4]] true Y: [[1, 0, 2, 3, 3, 4]]
Prediction str: ihlllo
9 loss: 0.774338 prediction: [[1 0 2 3 3 4]] true Y: [[1, 0, 2, 3, 3, 4]]
Prediction str: ihello
10 loss: 0.676005 prediction: [[1 0 2 3 3 4]] true Y: [[1, 0, 2, 3, 3, 4]]
Prediction str: ihello
...
'''
[# lab-12-2-char-seq-rnn ]
# lab-12-2-char-seq-rnn
"""
rnn 적용 ==> 정확도 높음
- 49 loss: 0.000650434 Prediction: if you want you
- y값 if you want you
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
# Lab 12 Character Sequence RNN
import tensorflow as tf
import numpy as np
tf.set_random_seed(777) # reproducibility
sample = " if you want you"
idx2char = list(set(sample)) # index -> char
char2idx = {c: i for i, c in enumerate(idx2char)} # char -> idex
# hyper parameters
dic_size = len(char2idx) # RNN input size (one hot size)
hidden_size = len(char2idx) # RNN output size
num_classes = len(char2idx) # final output size (RNN or softmax, etc.)
batch_size = 1 # one sample data, one batch
sequence_length = len(sample) - 1 # number of lstm rollings (unit #)
learning_rate = 0.1
sample_idx = [char2idx[c] for c in sample] # char to index
x_data = [sample_idx[:-1]] # X data sample (0 ~ n-1) hello: hell
y_data = [sample_idx[1:]] # Y label sample (1 ~ n) hello: ello
X = tf.placeholder(tf.int32, [None, sequence_length]) # X data
Y = tf.placeholder(tf.int32, [None, sequence_length]) # Y label
x_one_hot = tf.one_hot(X, num_classes) # one hot: 1 -> 0 1 0 0 0 0 0 0 0 0
cell = tf.contrib.rnn.BasicLSTMCell(
num_units=hidden_size, state_is_tuple=True)
initial_state = cell.zero_state(batch_size, tf.float32)
outputs, _states = tf.nn.dynamic_rnn(
cell, x_one_hot, initial_state=initial_state, dtype=tf.float32)
# FC layer
X_for_fc = tf.reshape(outputs, [-1, hidden_size])
outputs = tf.contrib.layers.fully_connected(X_for_fc, num_classes, activation_fn=None)
# reshape out for sequence_loss
outputs = tf.reshape(outputs, [batch_size, sequence_length, num_classes])
weights = tf.ones([batch_size, sequence_length])
sequence_loss = tf.contrib.seq2seq.sequence_loss(
logits=outputs, targets=Y, weights=weights)
loss = tf.reduce_mean(sequence_loss)
train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
prediction = tf.argmax(outputs, axis=2)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(50):
l, _ = sess.run([loss, train], feed_dict={X: x_data, Y: y_data})
result = sess.run(prediction, feed_dict={X: x_data})
# print char using dic
result_str = [idx2char[c] for c in np.squeeze(result)]
print(i, "loss:", l, "Prediction:", ''.join(result_str))
'''
0 loss: 2.35377 Prediction: uuuuuuuuuuuuuuu
1 loss: 2.21383 Prediction: yy you y you
2 loss: 2.04317 Prediction: yy yoo ou
3 loss: 1.85869 Prediction: yy ou uou
4 loss: 1.65096 Prediction: yy you a you
5 loss: 1.40243 Prediction: yy you yan you
6 loss: 1.12986 Prediction: yy you wann you
7 loss: 0.907699 Prediction: yy you want you
8 loss: 0.687401 Prediction: yf you want you
9 loss: 0.508868 Prediction: yf you want you
10 loss: 0.379423 Prediction: yf you want you
11 loss: 0.282956 Prediction: if you want you
12 loss: 0.208561 Prediction: if you want you
...
'''
#lab-12-3-char-seq-softmax-only
"""
rnn 미적용 ==> 정확도 미흡함
- 2999 loss: 0.277323 Prediction: yf you yant you
- y값 if you want you
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
# Lab 12 Character Sequence Softmax only
import tensorflow as tf
import numpy as np
tf.set_random_seed(777) # reproducibility
sample = " if you want you"
idx2char = list(set(sample)) # index -> char
char2idx = {c: i for i, c in enumerate(idx2char)} # char -> idex
# hyper parameters
dic_size = len(char2idx) # RNN input size (one hot size)
rnn_hidden_size = len(char2idx) # RNN output size
num_classes = len(char2idx) # final output size (RNN or softmax, etc.)
batch_size = 1 # one sample data, one batch
sequence_length = len(sample) - 1 # number of lstm rollings (unit #)
learning_rate = 0.1
sample_idx = [char2idx[c] for c in sample] # char to index
x_data = [sample_idx[:-1]] # X data sample (0 ~ n-1) hello: hell
y_data = [sample_idx[1:]] # Y label sample (1 ~ n) hello: ello
X = tf.placeholder(tf.int32, [None, sequence_length]) # X data
Y = tf.placeholder(tf.int32, [None, sequence_length]) # Y label
# flatten the data (ignore batches for now). No effect if the batch size is 1
X_one_hot = tf.one_hot(X, num_classes) # one hot: 1 -> 0 1 0 0 0 0 0 0 0 0
X_for_softmax = tf.reshape(X_one_hot, [-1, rnn_hidden_size])
# softmax layer (rnn_hidden_size -> num_classes)
softmax_w = tf.get_variable("softmax_w", [rnn_hidden_size, num_classes])
softmax_b = tf.get_variable("softmax_b", [num_classes])
outputs = tf.matmul(X_for_softmax, softmax_w) + softmax_b
# expend the data (revive the batches)
outputs = tf.reshape(outputs, [batch_size, sequence_length, num_classes])
weights = tf.ones([batch_size, sequence_length])
# Compute sequence cost/loss
sequence_loss = tf.contrib.seq2seq.sequence_loss(
logits=outputs, targets=Y, weights=weights)
loss = tf.reduce_mean(sequence_loss) # mean all sequence loss
train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
prediction = tf.argmax(outputs, axis=2)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(3000):
l, _ = sess.run([loss, train], feed_dict={X: x_data, Y: y_data})
result = sess.run(prediction, feed_dict={X: x_data})
# print char using dic
result_str = [idx2char[c] for c in np.squeeze(result)]
print(i, "loss:", l, "Prediction:", ''.join(result_str))
'''
0 loss: 2.29513 Prediction: yu yny y y oyny
1 loss: 2.10156 Prediction: yu ynu y y oynu
2 loss: 1.92344 Prediction: yu you y u you
..
2997 loss: 0.277323 Prediction: yf you yant you
2998 loss: 0.277323 Prediction: yf you yant you
2999 loss: 0.277323 Prediction: yf you yant you
'''
[# lab-12-4-rnn_long_char]
# lab-12-4-rnn_long_char
"""
error : from __future__ import print_function ==> 실행불가로 주석처리함
# MultiRNNCell 로 여러단을 만들면 , 정확도가 높아짐
# softmax =>reshape 수행
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
# from __future__ import print_function
import tensorflow as tf
import numpy as np
from tensorflow.contrib import rnn
tf.set_random_seed(777) # reproducibility
sentence = ("if you want to build a ship, don't drum up people together to "
"collect wood and don't assign them tasks and work, but rather "
"teach them to long for the endless immensity of the sea.")
char_set = list(set(sentence))
char_dic = {w: i for i, w in enumerate(char_set)}
data_dim = len(char_set)
hidden_size = len(char_set)
num_classes = len(char_set)
sequence_length = 10 # Any arbitrary number
learning_rate = 0.1
dataX = []
dataY = []
for i in range(0, len(sentence) - sequence_length):
x_str = sentence[i:i + sequence_length]
y_str = sentence[i + 1: i + sequence_length + 1]
print(i, x_str, '->', y_str)
x = [char_dic[c] for c in x_str] # x str to index
y = [char_dic[c] for c in y_str] # y str to index
dataX.append(x)
dataY.append(y)
batch_size = len(dataX)
X = tf.placeholder(tf.int32, [None, sequence_length])
Y = tf.placeholder(tf.int32, [None, sequence_length])
# One-hot encoding
X_one_hot = tf.one_hot(X, num_classes)
print(X_one_hot) # check out the shape
# Make a lstm cell with hidden_size (each unit output vector size)
def lstm_cell():
cell = rnn.BasicLSTMCell(hidden_size, state_is_tuple=True)
return cell
multi_cells = rnn.MultiRNNCell([lstm_cell() for _ in range(2)], state_is_tuple=True)
# 위와 같이.MultiRNNCell 로 여러단을 만들면 , 정확도가 높아짐
# outputs: unfolding size x hidden size, state = hidden size
outputs, _states = tf.nn.dynamic_rnn(multi_cells, X_one_hot, dtype=tf.float32)
# softmax =>reshape 수행
# FC layer
X_for_fc = tf.reshape(outputs, [-1, hidden_size])
outputs = tf.contrib.layers.fully_connected(X_for_fc, num_classes, activation_fn=None)
# reshape out for sequence_loss
outputs = tf.reshape(outputs, [batch_size, sequence_length, num_classes])
# All weights are 1 (equal weights)
weights = tf.ones([batch_size, sequence_length])
sequence_loss = tf.contrib.seq2seq.sequence_loss(
logits=outputs, targets=Y, weights=weights)
mean_loss = tf.reduce_mean(sequence_loss)
train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(mean_loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(500):
_, l, results = sess.run(
[train_op, mean_loss, outputs], feed_dict={X: dataX, Y: dataY})
for j, result in enumerate(results):
index = np.argmax(result, axis=1)
print(i, j, ''.join([char_set[t] for t in index]), l)
# Let's print the last char of each result to check it works
results = sess.run(outputs, feed_dict={X: dataX})
for j, result in enumerate(results):
index = np.argmax(result, axis=1)
if j is 0: # print all for the first result to make a sentence
print(''.join([char_set[t] for t in index]), end='')
else:
print(char_set[index[-1]], end='')
'''
0 167 tttttttttt 3.23111
0 168 tttttttttt 3.23111
0 169 tttttttttt 3.23111
…
499 167 of the se 0.229616
499 168 tf the sea 0.229616
499 169 the sea. 0.229616
g you want to build a ship, don't drum up people together to collect wood and don't assign them tasks and work, but rather teach them to long for the endless immensity of the sea.
'''
[# lab-12-5-rnn_stock_prediction]
[참고자료]
https://www.inflearn.com/course/기본적인-머신러닝-딥러닝-강좌
https://github.com/hunkim/deeplearningzerotoall
https://www.tensorflow.org/api_docs/python/tf/layers
'프로젝트 > 인공지능' 카테고리의 다른 글
[인공지능 #13 ] q_net_frozenlake / cartpole (0) | 2017.08.07 |
---|---|
[인공지능 #12 ] Q-Learning / OpenAI gym / frozenlake (0) | 2017.08.06 |
[인공지능 #10 ]mnist_cnn/mnist_deep_cnn/mnist_cnn_class/mnist_cnn_layers/mnist_cnn_ensemble_layers (0) | 2017.08.06 |
[인공지능 #9]mnist_softmax /mnist_nn/mnist_nn_xavier/mnist_nn_deep / mnist_nn_dropout (0) | 2017.08.06 |
[인공지능 #8]xor / xor-nn / xor-nn-wide-deep (0) | 2017.08.06 |