TechTogetWorld

인공지능 구현 에 대한 글입니다.(Deep Reinforcement Learning)


글의 순서는 아래와 같습니다.


================================================

요약

 - 신경망 (Neural Network)즉  Q-NETWORK 를 이용하여  Q-Learing 구현( q-talble 형식의경우 메모리가 기하 급수적으로 필요해짐 )

  ==> 실생활에 적용하게에는 무리가 있음(q-table 방식) , 따라서 신경망 방식이 필요해짐


1.q_net_frozenlake

 - Network 으로 변환 


2. 07_3_dqn_2015_cartpole

  - Q-NETWORk 이슈

   1) 데이터가 너무 적어 정확도가 좋치 못하다. 2개의 데이터로 학습을 하게되면 전혀 다른 직선이 나오게 되는것이다

      . 깊게(deep)

      . experience replay : action후 버퍼에 상태,action등 을 저장한다 , 이후 random(골고루) 하게 샘플링해서 학습한다

   2) 타겟이 흔들림 ( 같은 네트웍을 사용해서, 예측변경이 타겟치도 변경이 일어남) => 화살을 쏘자마자 과녁을 움직이는것 같음

      . network을 하나더 만든다 ( 각자 업데이트 하다가, 학습전에 복사해서 합친다)


3. Next Step

  ==> 신경망 (Neural Network)를 이용하여  Q-Learing 구현


4. 참고자료

=================================================




[ 06_q_net_frozenlake ]



06_q_net_frozenlake


This code is based on

https://github.com/hunkim/DeepRL-Agents

'''

import gym

import numpy as np

import matplotlib.pyplot as plt

import time

import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'    # default value = 0  From http://stackoverflow.com/questions/35911252/disable-tensorflow-debugging-information


import tensorflow as tf

env = gym.make('FrozenLake-v0')


# Input and output size based on the Env

input_size = env.observation_space.n;

output_size = env.action_space.n;

learning_rate = 0.1


# These lines establish the feed-forward part of the network used to choose actions

X = tf.placeholder(shape=[1, input_size], dtype=tf.float32)              # state input

W = tf.Variable(tf.random_uniform([input_size, output_size], 0, 0.01))   # weight


Qpred = tf.matmul(X, W)     # Out Q prediction

Y = tf.placeholder(shape=[1, output_size], dtype=tf.float32)    # Y label


loss = tf.reduce_sum(tf.square(Y-Qpred))

train = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)


# Set Q-learning parameters

dis = .99

num_episodes = 2000


# create lists to contain total rewards and steps per episode

rList = []


def one_hot(x):

    return np.identity(16)[x:x+1]


start_time = time.time()


init = tf.global_variables_initializer()

with tf.Session() as sess:

    sess.run(init)

    for i in range(num_episodes):

        # Reset environment and get first new observation

        s = env.reset()

        e = 1. / ((i / 50) + 10)

        rAll = 0

        done = False

        local_loss = []


        # The Q-Table learning algorithm

        while not done:

            # Choose an action by greedly (with a chance of random action)

            # from the Q-network

            Qs = sess.run(Qpred, feed_dict={X: one_hot(s)})

            if np.random.rand(1) < e:

                a = env.action_space.sample()

            else:

                a = np.argmax(Qs)


            # Get new state and reward from environment

            s1, reward, done, _ = env.step(a)

            if done:

                # Update Q, and no Qs+1, since it's a termial state

                Qs[0, a] = reward

            else:

                # Obtain the Q_s` values by feeding the new state through our network

                Qs1 = sess.run(Qpred, feed_dict={X: one_hot(s1)})

                # Update Q

                Qs[0, a] = reward + dis*np.max(Qs1)


            # Train our network using target (Y) and predicted Q (Qpred) values

            sess.run(train, feed_dict={X: one_hot(s), Y: Qs})


            rAll += reward

            s = s1


        rList.append(rAll)


print("--- %s seconds ---" % (time.time() - start_time))


print("Success rate: " + str(sum(rList) / num_episodes))

#plt.bar(range(len(rList)), rList, color="blue")

plt.bar(range(len(rList)), rList, color='b', alpha=0.4)

plt.show()



[07_3_dqn_2015_cartpole]


07_3_dqn_2015_cartpole


This code is based on

https://github.com/hunkim/DeepRL-Agents


CF https://github.com/golbin/TensorFlow-Tutorials

https://github.com/dennybritz/reinforcement-learning/blob/master/DQN/dqn.py


Q-NETWOR 이슈

 1. 데이터가 너무 적어 정확도가 좋치 못하다. 2개의 데이터로 학습을 하게되면 전혀 다른 직선이 나오게 되는것이다

   - 깊게(deep)

   - experience replay : action후 버퍼에 상태,action등 을 저장한다 , 이후 random(골고루) 하게 샘플링해서 학습한다

 2. 타겟이 흔들림 ( 같은 네트웍을 사용해서, 예측변경이 타겟치도 변경이 일어남) => 화살을 쏘자마자 과녁을 움직이는것 같음

    - network을 하나더 만든다

"""


import numpy as np

import tensorflow as tf

import random

from collections import deque

from dqn import dqn


import gym

from gym import wrappers


env = gym.make('CartPole-v0')


# Constants defining our neural network

input_size = env.observation_space.shape[0]

output_size = env.action_space.n


dis = 0.9

REPLAY_MEMORY = 50000


def replay_train(mainDQN, targetDQN, train_batch):

    x_stack = np.empty(0).reshape(0, input_size)

    y_stack = np.empty(0).reshape(0, output_size)


    # Get stored information from the buffer

    for state, action, reward, next_state, done in train_batch:

        Q = mainDQN.predic(state)


        # terminal?

        if done:

            Q[0, action] = reward

        else:

            # get target from target DQN (Q')

            Q[0, action] = reward + dis * np.max(targetDQN.predict(next_state))


        y_stack = np.vstack([y_stack, Q])

        x_stack = np.vstack( [x_stack, state])


    # Train our network using target and predicted Q values on each episode

    return mainDQN.update(x_stack, y_stack)


def ddqn_replay_train(mainDQN, targetDQN, train_batch):


#Double DQN implementation

#param mainDQN main DQN

#param targetDQN target DQN

#param train_batch minibatch for train

#return loss



    x_stack = np.empty(0).reshape(0, mainDQN.input_size)

    y_stack = np.empty(0).reshape(0, mainDQN.output_size)


    # Get stored information from the buffer

    for state, action, reward, next_state, done in train_batch:

        Q = mainDQN.predict(state)


        # terminal?

        if done:

            Q[0, action] = reward

        else:

            # Double DQN: y = r + gamma * targetDQN(s')[a] where

            # a = argmax(mainDQN(s'))

            Q[0, action] = reward + dis * targetDQN.predict(next_state)[0, np.argmax(mainDQN.predict(next_state))]


        y_stack = np.vstack([y_stack, Q])

        x_stack = np.vstack([x_stack, state])


    # Train our network using target and predicted Q values on each episode

    return mainDQN.update(x_stack, y_stack)


def get_copy_var_ops(*, dest_scope_name="target", src_scope_name="main"):


    # Copy variables src_scope to dest_scope

    op_holder = []


    src_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=src_scope_name)

    dest_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=dest_scope_name)


    for src_var, dest_var in zip(src_vars, dest_vars):

        op_holder.append(dest_var.assign(src_var.value()))


    return op_holder


def bot_play(mainDQN, env=env):

    # See our trained network in action

    state = env.reset()

    reward_sum = 0

    while True:

        env.render()

        action = np.argmax(mainDQN.predict(state))

        state, reward, done, _ = env.step(action)

        reward_sum += reward

        if done:

            print("Total score: {}".format(reward_sum))

            break


def main():

    max_episodes = 5000

    # store the previous observations in replay memory

    replay_buffer = deque()


    with tf.Session() as sess:

        mainDQN = dqn.DQN(sess, input_size, output_size, name="main")

        targetDQN = dqn.DQN(sess, input_size, output_size, name="target")

        tf.global_variables_initializer().run()


        #initial copy q_net -> target_net

        copy_ops = get_copy_var_ops(dest_scope_name="target", src_scope_name="main")

        sess.run(copy_ops)


        for episode in range(max_episodes):

            e = 1. / ((episode / 10) + 1)

            done = False

            step_count = 0

            state = env.reset()


            while not done:

                if np.random.rand(1) < e:

                    action = env.action_space.sample()

                else:

                    # Choose an action by greedily from the Q-network

                    action = np.argmax(mainDQN.predict(state))


                # Get new state and reward from environment

                next_state, reward, done, _ = env.step(action)

                if done: # Penalty

                    reward = -100


                # Save the experience to our buffer

                replay_buffer.append((state, action, reward, next_state, done))

                if len(replay_buffer) > REPLAY_MEMORY:

                      replay_buffer.popleft()


                state = next_state

                step_count += 1

                if step_count > 10000:   # Good enough. Let's move on

                    break


            print("Episode: {} steps: {}".format(episode, step_count))

            if step_count > 10000:

                pass

                ##10,000이면 정지(무한루프방지)

                # break


            if episode % 10 == 1: # train every 10 episode

                # Get a random batch of experiences

                for _ in range(50):

                    minibatch = random.sample(replay_buffer, 10)

                    loss, _ = ddqn_replay_train(mainDQN, targetDQN, minibatch)


                print("Loss: ", loss)

                # copy q_net -> target_net

                sess.run(copy_ops)


        # See our trained bot in action

        env2 = wrappers.Monitor(env, 'gym-results', force=True)


        for i in range(200):

            bot_play(mainDQN, env=env2)


        env2.close()

        # gym.upload("gym-results", api_key="sk_VT2wPcSSOylnlPORltmQ")


if __name__ == "__main__":

    main()



[ 참고자료 ]


  https://www.inflearn.com/course/기본적인-머신러닝-딥러닝-강좌

  https://github.com/hunkim/deeplearningzerotoall

  https://www.tensorflow.org/api_docs/python/tf/layers

  https://www.inflearn.com/course/reinforcement-learning/



인공지능 구현 에 대한 글입니다.(Deep Reinforcement Learning)


글의 순서는 아래와 같습니다.


================================================

요약

 - 얼음얼린 호수에서 구멍에 빠지지 않고 길을 찾아나오는 게임임

 - 얼음은 미끄럽습니다. 길을 안내해주는 사람의 말을 전적으로 의지할경우, 오히려 바람등에 의해(불확실한 환경)

   미끄러질수가 있습니다. 따라서 약간만 의지하고, 나의 의지를 좀더 반영하는 방식으로 정확도를 높일수 있음 ( 1.5% ==> 66% 수준)

 - 현실도 주변 환경에의해 예측이 불가능한 경우가 많습니다. 이럴경우에 적용이 가능한 방식임.


1.OpenAI gym 게임을 위한 프로그램 설치


2. # 01_play_frozenlake_det_windows

 => 화살키를 입력하는 방향으로 이동

 => 화일 실행은 터미널에서 한다(키 인을 받기위함).

 => frozenlake_det_windows.py 가 있는 폴더로 가서, python 명령어 실행

 => 홀에 빠지거나,끝으로 가면 게임이 종료된다.


3.# 03_0_q_table_frozenlake_det

  => 초기에는 임의의 장소로, 두번째 부터는 큰곳으로 이동, 홀을 피해 길을 찾는 알고리즘임

  => 참고(random argmax) : 같은면 아무곳이나, 큰곳이 있으면 큰곳으로 간다

 

4. # 03_2_q_table_frozenlake_det

   ==>e보다 작으면 임의의 장소로 가고,그렇치 않으면, 큰 리워드방향으로 이동한다(explot &exporation)

  ==> 이사간 동네에서 초기에는 랜덤하게 식당을 다니고, 파악이 다되면 맞집위주로 찾아간다게 

  ==> 노이즈 값을 주어서, 기존의 data를 일부 반영하는 방법도 있음 ( 상기 e의 경우는 기존 data를 무시하됨)

      . 차선책을 선택하는 방법임

  ==> discount(0.9) 나중에 받을 리워드는 0.9를 곱해서 비중을 낮춘다.

       최단거리를 찾는 방법임


5. play_frozenlake_windows

 ==> keyboard 인식이 잘 않됨, 추후 보완필요

 ==> 빙판길로, 키보드 조작대로 움직이지 않고, Q 선생의 말에 전적으로 의존하지 않는 상황을 의미함

 ==> 현실세계와 비슷한 환경을 구현하는것임.


6. #05_0_q_table_frozenlake

  ==>미끄러운 환경 ('FrozenLake-v0' ) 에서는 Q 형님의 조언을 그대로 따르면 않된다, 주변환경에 의해  미끄질수 있기때문임

  ==> 빙판에서, 기존과 같이 Q 형님의 말에 전적으로 의존할 경우  1.55%  , 일부 의존할경우 66%


7. # 05_q_table_frozenlake

  ==>미끄러운 환경 ('FrozenLake-v0' ) 에서는 Q 형님의 조언을 그대로 따르면 않된다, 주변환경에 의해  미끄질수 있기때문임

     따라서 Q 형님의 말을 조금만 반영할 필요가 있음

 ==> Q[state, action] = (1-learning_rate) * Q[state, action] \

   + learning_rate*(reward + dis * np.max(Q[new_state, :]))

 ==> 정확도가 어느정도 상승함 (1.55% ==>  66% )

   . 빙판에서, Q 형님의 말에 전적으로 의존할 경우  1.55%  , 일부 의존할경우 66%


8. Next Step

  ==> 신경망 (Neural Network)를 이용하여  Q-Learing 구현


9. 참고자료

=================================================




[ 1.OpenAI gym 게임을 위한 프로그램 설치 ]


- 설치가이드 :https://gym.openai.com/docs

- step 1

  . anaconda3 prompt 실행

- step 2

  . git clone https\\github.com/openai/gym  ==> gym 다운받기

  . cd gym ==> gym 폴더로 이동

  . pip3 install -e . ==> minimal install , pip3로 해야함 , gym으로 다운받은 gym을 pc에 설치해 주는 과정임

- step3 ==> 해당 python 선택( 패키지 별로 python 설치경로가 틀려서, 해당 python을 찾아서 연결시켜 주어야함

  . python 편집기(pycharm 프로그램 설정) 의 interpreter 변경

  . 변경전 : tensorflow 폴더의 python

  . 변경후 : c:\\user\dhp\appdata3\python.exe



- step4 ==> 패키지 추가 ==> tensorflow 패키지 설치가 필요할경우 pycharm 에서 설치가능함

  . 우측상단의 " + " 버튼을 누르면, 설치가능한 패키지 목록이 나옵니다. 여기서 tesorflow를 선택해서 설치한다.

  . 이로서   c:\\user\dhp\appdata3\python.exe의 python에는  gym 과 tesorflow가 동시에 설치됨

  . 진행중 필요한 패키지는 상황에 맞게 추가 설치하면 됨.

- tensorflow 내의 python에 패키지 추가방법 추가확인 필요함. gym설치되어 있는데, 잘 동작하지 않고있음.



- 설치성공 여부 확인 : pycharm 화면에서 아래 코딩후 실행확인

  # cartpolo test


"""cartpolo test

"""

import gym

env = gym.make('CartPole-v0')

env.reset()

for _ in range(100):

    env.render()

    env.step(env.action_space.sample()) # take a random action


for i_episode in range(20):

    observation = env.reset()

    for t in range(100):

        env.render()

        print(observation)

        action = env.action_space.sample()

        observation, reward, done, info = env.step(action)

        if done:

            print("Episode finished after {} timesteps".format(t+1))

            break




[ # 01_play_frozenlake_det_windows ]


# 01_play_frozenlake_det_windows

"""

화일 실행은 터미널에서 한다(키 인을 받기위함).

frozenlake_det_windows.py 가 있는 폴더로 가서, python 명령어 실행

홀에 빠지거나,끝으로 가면 게임이 종료된다.

"""

import gym

from gym.envs.registration import register

from colorama import init

from kbhit import KBHit


init(autoreset=True)    # Reset the terminal mode to display ansi color


register(

    id='FrozenLake-v3',

    entry_point='gym.envs.toy_text:FrozenLakeEnv',

    kwargs={'map_name' : '4x4', 'is_slippery': False}

)


env = gym.make('FrozenLake-v3')        # is_slippery False

env.render()                             # Show the initial board


key = KBHit()


while True:


    action = key.getarrow();

    if action not in [0, 1, 2, 3]:

        print("Game aborted!")

        break


    state, reward, done, info = env.step(action)

    env.render()

    print("State: ", state, "Action: ", action, "Reward: ", reward, "Info: ", info)


    if done:

        print("Finished with reward", reward)

        break



[## 03_0_q_table_frozenlake_det ]


# 03_0_q_table_frozenlake_det

"""

 # random argmax : 같은면 아무곳이나, 큰곳이 있으면 큰곳으로 간다


"""


import gym

import numpy as np

import matplotlib.pyplot as plt

from gym.envs.registration import register

import random as pr


def rargmax(vector):    # https://gist.github.com/stober/1943451

    """ Argmax that chooses randomly among eligible maximum idices. """

    m = np.amax(vector)

    indices = np.nonzero(vector == m)[0]

    return pr.choice(indices)


register(

    id='FrozenLake-v3',

    entry_point='gym.envs.toy_text:FrozenLakeEnv',

    kwargs={'map_name' : '4x4', 'is_slippery': False}

)

env = gym.make('FrozenLake-v3')


# Initialize table with all zeros

Q = np.zeros([env.observation_space.n, env.action_space.n]) #16*4 사이즈임

# Set learning parameters

num_episodes = 2000


# create lists to contain total rewards and steps per episode

rList = []

for i in range(num_episodes):

    # Reset environment and get first new observation

    state = env.reset()

    rAll = 0

    done = False


    # The Q-Table learning algorithm

    while not done:

        action = rargmax(Q[state, :])  # random argmax : 같은면 아무곳이자, 큰곳이 있으면 큰곳으로 간다


        # Get new state and reward from environment

        new_state, reward, done, _ = env.step(action)


        # Update Q-Table with new knowledge using learning rate

        Q[state, action] = reward + np.max(Q[new_state, :])


        rAll += reward

        state = new_state

    rList.append(rAll)


print("Success rate: " + str(sum(rList) / num_episodes))

print("Final Q-Table Values")

print("LEFT DOWN RIGHT UP")

print(Q)


plt.bar(range(len(rList)), rList, color="blue")

#plt.bar(range(len(rList)), rList, color='b', alpha=0.4)

plt.show()



[# 03_2_q_table_frozenlake_det]


# 03_2_q_table_frozenlake_det

"""

   ==> e보다 작으면 임의의 장소로 가고,그렇치 않으면, 큰 리워드방향으로 이동한다(explot &exporation)

  ==> 이사간 동네에서 초기에는 랜덤하게 식당을 다니고, 파악이 다되면 맞집위주로 찾아간다게

  ==> 노이즈 값을 주어서, 기존의 data를 일부 반영하는 방법도 있음 ( 상기 e의 경우는 기존 data를 무시하됨)

      . 차선책을 선택하는 방법임

  ==> discount(0.9) 나중에 받을 리워드는 0.9를 곱해서 비중을 낮춘다.

       최단거리를 찾는 방법임



"""


import gym

import numpy as np

import matplotlib.pyplot as plt

from gym.envs.registration import register


register(

    id='FrozenLake-v3',

    entry_point='gym.envs.toy_text:FrozenLakeEnv',

    kwargs={'map_name' : '4x4', 'is_slippery': False}

)

env = gym.make('FrozenLake-v3')


# Initialize table with all zeros

Q = np.zeros([env.observation_space.n, env.action_space.n])

# Set learning parameters

dis = .99

num_episodes = 2000


# create lists to contain total rewards and steps per episode

rList = []

for i in range(num_episodes):

    # Reset environment and get first new observation

    state = env.reset()

    rAll = 0

    done = False


    e = 1. / ((i // 100) + 1)  # Python2 & 3

    # 후반부로 갈수록 e 값은 작아짐


    # The Q-Table learning algorithm

    while not done:

        # Choose an action by e-greedy

        if np.random.rand(1) < e:

            # e보다 작으면 임의의 장소로 가고

            # 랜더만 방향으로 많이 가게되면 정확도가 떨어질수 있음

            action = env.action_space.sample()

        else:

            # 그렇치 않으면, 큰 리워드방향으로 이동한다

            action = np.argmax(Q[state, :])


        # Get new state and reward from environment

        new_state, reward, done, _ = env.step(action)


        # Update Q-Table with new knowledge using decay rate

        Q[state, action] = reward + dis * np.max(Q[new_state, :])


        # 나중에 받을 리워드는 0.9를 곱해서 비중을 낮춘다. 최단거리를 찾는 방법임


        rAll += reward

        state = new_state

    rList.append(rAll)


print("Success rate: " + str(sum(rList) / num_episodes))

print("Final Q-Table Values")

print("LEFT DOWN RIGHT UP")

print(Q)

#plt.bar(range(len(rList)), rList, color="blue")

plt.bar(range(len(rList)), rList, color='b', alpha=0.4)

plt.show()



[# play_frozenlake_windows]


# play_frozenlake_windows


"""keyboard 인식이 잘 않됨, 추후 보완필요

빙판길로, 키보드 조작대로 움직이지 않고, Q 선생의 말에 전적으로 의존하지 않는 상황을 의미함"""


import gym

from gym.envs.registration import register

from colorama import init

from kbhit import KBHit


init(autoreset=True)    # Reset the terminal mode to display ansi color


env = gym.make('FrozenLake-v0')       # is_slippery True

env.render()                            # Show the initial board


key = KBHit()

while True:


    action = key.getarrow();

    if action not in [0, 1, 2, 3]:

        print("Game aborted!")

        break


    state, reward, done, info = env.step(action)

    env.render()

    print("State: ", state, "Action: ", action, "Reward: ", reward, "Info: ", info)


    if done:

        print("Finished with reward", reward)

        break


[#05_0_q_table_frozenlake]


#05_0_q_table_frozenlake

"""

미끄러운 환경 ('FrozenLake-v0' ) 에서는 Q 형님의 조언을 그대로 따르면 않된다, 주변환경에 의해  미끄질수 있기때문임

. 빙판에서, 기존과 같이 Q 형님의 말에 전적으로 의존할 경우  1.55%  , 일부 의존할경우 66%

"""


import gym

import numpy as np

import matplotlib.pyplot as plt

from gym.envs.registration import register

import random as pr


env = gym.make('FrozenLake-v0')


# Initialize table with all zeros

Q = np.zeros([env.observation_space.n, env.action_space.n])


# Set learning parameters

learning_rate = .85

dis = .99

num_episodes = 2000


# create lists to contain total rewards and steps per episode

rList = []

for i in range(num_episodes):

    # Reset environment and get first new observation

    state = env.reset()

    rAll = 0

    done = False


    # The Q-Table learning algorithm

    while not done:

        action = np.argmax(Q[state, :] + np.random.randn(1, env.action_space.n) / (i + 1))

        # Get new state and reward from environment

        new_state, reward, done, _ = env.step(action)


        # Update Q-Table with new knowledge using learning rate

        Q[state, action] = reward + dis * np.max(Q[new_state, :])

        state = new_state


        rAll += reward


    rList.append(rAll)


print("Success rate: " + str(sum(rList) / num_episodes))

print("Final Q-Table Values")

print("LEFT DOWN RIGHT UP")

print(Q)

plt.bar(range(len(rList)), rList, color="blue")

#plt.bar(range(len(rList)), rList, color='b', alpha=0.4)

plt.show()



[# 05_q_table_frozenlake]

# 05_q_table_frozenlake

"""

미끄러운 환경 ('FrozenLake-v0' ) 에서는 Q 형님의 조언을 그대로 따르면 않된다, 주변환경에 의해  미끄질수 있기때문임

따라서 Q 형님의 말을 조금만 반영할 필요가 있음

 ==> Q[state, action] = (1-learning_rate) * Q[state, action] \

   + learning_rate*(reward + dis * np.max(Q[new_state, :]))

 ==> 정확도가 어느정도 상승함 (1.55% ==>  66% )

   . 빙판에서, Q 형님의 말에 전적으로 의존할 경우  1.55%  , 일부 의존할경우 66%

"""


import gym

import numpy as np

import matplotlib.pyplot as plt

from gym.envs.registration import register

import random as pr


register(

    id='FrozenLake-v3',

    entry_point='gym.envs.toy_text:FrozenLakeEnv',

    kwargs={'map_name' : '4x4', 'is_slippery': False}

)


#env = gym.make('FrozenLake-v3')

env = gym.make('FrozenLake-v0')


# Initialize table with all zeros

Q = np.zeros([env.observation_space.n, env.action_space.n])


# Set learning parameters

learning_rate = .85

dis = .99

num_episodes = 2000


# create lists to contain total rewards and steps per episode

rList = []

for i in range(num_episodes):

    # Reset environment and get first new observation

    state = env.reset()

    rAll = 0

    done = False


    # The Q-Table learning algorithm

    while not done:

        action = np.argmax(Q[state, :] + np.random.randn(1, env.action_space.n) / (i + 1))


        # 노이즈 추가 ==>e 값 이용은, 처음부터 계산 , 노이지 값 이용은 기존값 반영  , 즉 차선책을 선택하는 방법임


        # Get new state and reward from environment

        new_state, reward, done, _ = env.step(action)

        # Q 형님의 말을 조금만 반영

        # Update Q-Table with new knowledge using learning rate

        Q[state, action] = (1-learning_rate) * Q[state, action] \

           + learning_rate*(reward + dis * np.max(Q[new_state, :]))


        rAll += reward

        state = new_state


    rList.append(rAll)


print("Success rate: " + str(sum(rList) / num_episodes))

print("Final Q-Table Values")

print("LEFT DOWN RIGHT UP")

print(Q)

#plt.bar(range(len(rList)), rList, color="blue")

plt.bar(range(len(rList)), rList, color='b', alpha=0.4)

plt.show()



[참고자료]


  https://www.inflearn.com/course/기본적인-머신러닝-딥러닝-강좌

  https://github.com/hunkim/deeplearningzerotoall

  https://www.tensorflow.org/api_docs/python/tf/layers

  https://www.inflearn.com/course/reinforcement-learning/



인공지능 구현에 대한 글입니다.


글의 순서는 아래와 같습니다.


================================================

1.#lab-12-1-hello-rnn

 전 단계의 출력이 다음단계의 출력에 영향을 주는 경우 적용함

  - 단어, 연관검색등..


2. # lab-12-2-char-seq-rnn

  -rnn 적용  ==> 정확도 높음

    . 49 loss: 0.000650434 Prediction: if you want you

    . y값 if you want you


3.  #lab-12-3-char-seq-softmax-only

  rnn 미적용 ==> 정확도 미흡함

  2999 loss: 0.277323 Prediction: yf you yant you

  y값 if you want you


4. # lab-12-4-rnn_long_char

  error : from __future__ import print_function ==> 실행불가로 주석처리함

  MultiRNNCell 로 여러단을 만들면 , 정확도가 높아짐 

  softmax =>reshape 수행


5. # lab-12-5-rnn_stock_prediction"""
  내일 주가 예측 : 기존의 7일의 data를 학습
  그래프 인쇄않되고 있음.


6. 코드탐구(추가)

  ==>lab-12-5-rnn_stock_prediction

      lab-13-1-mnist_using_scope

      lab-13-2-mnist_tensorboard

      lab-13-3-mnist_save_restore


7. 참고자료

=================================================




#lab-12-1-hello-rnn ]


#lab-12-1-hello-rnn

"""

전 단계의 출력이 다음단계의 출력에 영향을 주는 경우 적용함

- 단어, 연관검색등..


"""

import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# Lab 12 RNN

import tensorflow as tf

import numpy as np

tf.set_random_seed(777)  # reproducibility


idx2char = ['h', 'i', 'e', 'l', 'o']

# Teach hello: hihell -> ihello

x_data = [[0, 1, 0, 2, 3, 3]]   # hihell

x_one_hot = [[[1, 0, 0, 0, 0],   # h 0

              [0, 1, 0, 0, 0],   # i 1

              [1, 0, 0, 0, 0],   # h 0

              [0, 0, 1, 0, 0],   # e 2

              [0, 0, 0, 1, 0],   # l 3

              [0, 0, 0, 1, 0]]]  # l 3


y_data = [[1, 0, 2, 3, 3, 4]]    # ihello


num_classes = 5

input_dim = 5  # one-hot size

hidden_size = 5  # output from the LSTM. 5 to directly predict one-hot

batch_size = 1   # one sentence

sequence_length = 6  # |ihello| == 6

learning_rate = 0.1


X = tf.placeholder(

    tf.float32, [None, sequence_length, input_dim])  # X one-hot

Y = tf.placeholder(tf.int32, [None, sequence_length])  # Y label


cell = tf.contrib.rnn.BasicLSTMCell(num_units=hidden_size, state_is_tuple=True)

initial_state = cell.zero_state(batch_size, tf.float32)

outputs, _states = tf.nn.dynamic_rnn(

    cell, X, initial_state=initial_state, dtype=tf.float32)


# FC layer

X_for_fc = tf.reshape(outputs, [-1, hidden_size])

# fc_w = tf.get_variable("fc_w", [hidden_size, num_classes])

# fc_b = tf.get_variable("fc_b", [num_classes])

# outputs = tf.matmul(X_for_fc, fc_w) + fc_b

outputs = tf.contrib.layers.fully_connected(

    inputs=X_for_fc, num_outputs=num_classes, activation_fn=None)


# reshape out for sequence_loss

outputs = tf.reshape(outputs, [batch_size, sequence_length, num_classes])


weights = tf.ones([batch_size, sequence_length])

sequence_loss = tf.contrib.seq2seq.sequence_loss(

    logits=outputs, targets=Y, weights=weights)

loss = tf.reduce_mean(sequence_loss)

train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)


prediction = tf.argmax(outputs, axis=2)


with tf.Session() as sess:

    sess.run(tf.global_variables_initializer())

    for i in range(50):

        l, _ = sess.run([loss, train], feed_dict={X: x_one_hot, Y: y_data})

        result = sess.run(prediction, feed_dict={X: x_one_hot})

        print(i, "loss:", l, "prediction: ", result, "true Y: ", y_data)


        # print char using dic

        result_str = [idx2char[c] for c in np.squeeze(result)]

        print("\tPrediction str: ", ''.join(result_str))


'''

0 loss: 1.71584 prediction:  [[2 2 2 3 3 2]] true Y:  [[1, 0, 2, 3, 3, 4]]

Prediction str:  eeelle

1 loss: 1.56447 prediction:  [[3 3 3 3 3 3]] true Y:  [[1, 0, 2, 3, 3, 4]]

Prediction str:  llllll

2 loss: 1.46284 prediction:  [[3 3 3 3 3 3]] true Y:  [[1, 0, 2, 3, 3, 4]]

Prediction str:  llllll

3 loss: 1.38073 prediction:  [[3 3 3 3 3 3]] true Y:  [[1, 0, 2, 3, 3, 4]]

Prediction str:  llllll

4 loss: 1.30603 prediction:  [[3 3 3 3 3 3]] true Y:  [[1, 0, 2, 3, 3, 4]]

Prediction str:  llllll

5 loss: 1.21498 prediction:  [[3 3 3 3 3 3]] true Y:  [[1, 0, 2, 3, 3, 4]]

Prediction str:  llllll

6 loss: 1.1029 prediction:  [[3 0 3 3 3 4]] true Y:  [[1, 0, 2, 3, 3, 4]]

Prediction str:  lhlllo

7 loss: 0.982386 prediction:  [[1 0 3 3 3 4]] true Y:  [[1, 0, 2, 3, 3, 4]]

Prediction str:  ihlllo

8 loss: 0.871259 prediction:  [[1 0 3 3 3 4]] true Y:  [[1, 0, 2, 3, 3, 4]]

Prediction str:  ihlllo

9 loss: 0.774338 prediction:  [[1 0 2 3 3 4]] true Y:  [[1, 0, 2, 3, 3, 4]]

Prediction str:  ihello

10 loss: 0.676005 prediction:  [[1 0 2 3 3 4]] true Y:  [[1, 0, 2, 3, 3, 4]]

Prediction str:  ihello


...


'''


[# lab-12-2-char-seq-rnn ]


# lab-12-2-char-seq-rnn

"""

rnn 적용  ==> 정확도 높음

 - 49 loss: 0.000650434 Prediction: if you want you

 - y값 if you want you

"""

import os


os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# Lab 12 Character Sequence RNN

import tensorflow as tf

import numpy as np

tf.set_random_seed(777)  # reproducibility


sample = " if you want you"

idx2char = list(set(sample))  # index -> char

char2idx = {c: i for i, c in enumerate(idx2char)}  # char -> idex


# hyper parameters

dic_size = len(char2idx)  # RNN input size (one hot size)

hidden_size = len(char2idx)  # RNN output size

num_classes = len(char2idx)  # final output size (RNN or softmax, etc.)

batch_size = 1  # one sample data, one batch

sequence_length = len(sample) - 1  # number of lstm rollings (unit #)

learning_rate = 0.1


sample_idx = [char2idx[c] for c in sample]  # char to index

x_data = [sample_idx[:-1]]  # X data sample (0 ~ n-1) hello: hell

y_data = [sample_idx[1:]]   # Y label sample (1 ~ n) hello: ello


X = tf.placeholder(tf.int32, [None, sequence_length])  # X data

Y = tf.placeholder(tf.int32, [None, sequence_length])  # Y label


x_one_hot = tf.one_hot(X, num_classes)  # one hot: 1 -> 0 1 0 0 0 0 0 0 0 0

cell = tf.contrib.rnn.BasicLSTMCell(

    num_units=hidden_size, state_is_tuple=True)

initial_state = cell.zero_state(batch_size, tf.float32)

outputs, _states = tf.nn.dynamic_rnn(

    cell, x_one_hot, initial_state=initial_state, dtype=tf.float32)


# FC layer

X_for_fc = tf.reshape(outputs, [-1, hidden_size])

outputs = tf.contrib.layers.fully_connected(X_for_fc, num_classes, activation_fn=None)


# reshape out for sequence_loss

outputs = tf.reshape(outputs, [batch_size, sequence_length, num_classes])


weights = tf.ones([batch_size, sequence_length])

sequence_loss = tf.contrib.seq2seq.sequence_loss(

    logits=outputs, targets=Y, weights=weights)

loss = tf.reduce_mean(sequence_loss)

train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)


prediction = tf.argmax(outputs, axis=2)


with tf.Session() as sess:

    sess.run(tf.global_variables_initializer())

    for i in range(50):

        l, _ = sess.run([loss, train], feed_dict={X: x_data, Y: y_data})

        result = sess.run(prediction, feed_dict={X: x_data})


        # print char using dic

        result_str = [idx2char[c] for c in np.squeeze(result)]


        print(i, "loss:", l, "Prediction:", ''.join(result_str))



'''

0 loss: 2.35377 Prediction: uuuuuuuuuuuuuuu

1 loss: 2.21383 Prediction: yy you y    you

2 loss: 2.04317 Prediction: yy yoo       ou

3 loss: 1.85869 Prediction: yy  ou      uou

4 loss: 1.65096 Prediction: yy you  a   you

5 loss: 1.40243 Prediction: yy you yan  you

6 loss: 1.12986 Prediction: yy you wann you

7 loss: 0.907699 Prediction: yy you want you

8 loss: 0.687401 Prediction: yf you want you

9 loss: 0.508868 Prediction: yf you want you

10 loss: 0.379423 Prediction: yf you want you

11 loss: 0.282956 Prediction: if you want you

12 loss: 0.208561 Prediction: if you want you


...


'''


[#lab-12-3-char-seq-softmax-only]

#lab-12-3-char-seq-softmax-only

"""

rnn 미적용 ==> 정확도 미흡함

 - 2999 loss: 0.277323 Prediction: yf you yant you

 - y값 if you want you

"""


import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# Lab 12 Character Sequence Softmax only

import tensorflow as tf

import numpy as np

tf.set_random_seed(777)  # reproducibility


sample = " if you want you"

idx2char = list(set(sample))  # index -> char

char2idx = {c: i for i, c in enumerate(idx2char)}  # char -> idex


# hyper parameters

dic_size = len(char2idx)  # RNN input size (one hot size)

rnn_hidden_size = len(char2idx)  # RNN output size

num_classes = len(char2idx)  # final output size (RNN or softmax, etc.)

batch_size = 1  # one sample data, one batch

sequence_length = len(sample) - 1  # number of lstm rollings (unit #)

learning_rate = 0.1


sample_idx = [char2idx[c] for c in sample]  # char to index

x_data = [sample_idx[:-1]]  # X data sample (0 ~ n-1) hello: hell

y_data = [sample_idx[1:]]   # Y label sample (1 ~ n) hello: ello


X = tf.placeholder(tf.int32, [None, sequence_length])  # X data

Y = tf.placeholder(tf.int32, [None, sequence_length])  # Y label


# flatten the data (ignore batches for now). No effect if the batch size is 1

X_one_hot = tf.one_hot(X, num_classes)  # one hot: 1 -> 0 1 0 0 0 0 0 0 0 0

X_for_softmax = tf.reshape(X_one_hot, [-1, rnn_hidden_size])


# softmax layer (rnn_hidden_size -> num_classes)

softmax_w = tf.get_variable("softmax_w", [rnn_hidden_size, num_classes])

softmax_b = tf.get_variable("softmax_b", [num_classes])

outputs = tf.matmul(X_for_softmax, softmax_w) + softmax_b


# expend the data (revive the batches)

outputs = tf.reshape(outputs, [batch_size, sequence_length, num_classes])

weights = tf.ones([batch_size, sequence_length])


# Compute sequence cost/loss

sequence_loss = tf.contrib.seq2seq.sequence_loss(

    logits=outputs, targets=Y, weights=weights)

loss = tf.reduce_mean(sequence_loss)  # mean all sequence loss

train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)


prediction = tf.argmax(outputs, axis=2)


with tf.Session() as sess:

    sess.run(tf.global_variables_initializer())

    for i in range(3000):

        l, _ = sess.run([loss, train], feed_dict={X: x_data, Y: y_data})

        result = sess.run(prediction, feed_dict={X: x_data})


        # print char using dic

        result_str = [idx2char[c] for c in np.squeeze(result)]

        print(i, "loss:", l, "Prediction:", ''.join(result_str))


'''

0 loss: 2.29513 Prediction: yu yny y y oyny

1 loss: 2.10156 Prediction: yu ynu y y oynu

2 loss: 1.92344 Prediction: yu you y u  you


..


2997 loss: 0.277323 Prediction: yf you yant you

2998 loss: 0.277323 Prediction: yf you yant you

2999 loss: 0.277323 Prediction: yf you yant you

'''


[# lab-12-4-rnn_long_char]


# lab-12-4-rnn_long_char

"""

error : from __future__ import print_function ==> 실행불가로 주석처리함

# MultiRNNCell 로 여러단을 만들면 , 정확도가 높아짐

# softmax =>reshape 수행

"""


import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# from __future__ import print_function


import tensorflow as tf

import numpy as np

from tensorflow.contrib import rnn


tf.set_random_seed(777)  # reproducibility


sentence = ("if you want to build a ship, don't drum up people together to "

            "collect wood and don't assign them tasks and work, but rather "

            "teach them to long for the endless immensity of the sea.")


char_set = list(set(sentence))

char_dic = {w: i for i, w in enumerate(char_set)}


data_dim = len(char_set)

hidden_size = len(char_set)

num_classes = len(char_set)

sequence_length = 10  # Any arbitrary number

learning_rate = 0.1


dataX = []

dataY = []

for i in range(0, len(sentence) - sequence_length):

    x_str = sentence[i:i + sequence_length]

    y_str = sentence[i + 1: i + sequence_length + 1]

    print(i, x_str, '->', y_str)


    x = [char_dic[c] for c in x_str]  # x str to index

    y = [char_dic[c] for c in y_str]  # y str to index


    dataX.append(x)

    dataY.append(y)


batch_size = len(dataX)


X = tf.placeholder(tf.int32, [None, sequence_length])

Y = tf.placeholder(tf.int32, [None, sequence_length])


# One-hot encoding

X_one_hot = tf.one_hot(X, num_classes)

print(X_one_hot)  # check out the shape



# Make a lstm cell with hidden_size (each unit output vector size)

def lstm_cell():

    cell = rnn.BasicLSTMCell(hidden_size, state_is_tuple=True)

    return cell


multi_cells = rnn.MultiRNNCell([lstm_cell() for _ in range(2)], state_is_tuple=True)

# 위와 같이.MultiRNNCell 로 여러단을 만들면 , 정확도가 높아짐


# outputs: unfolding size x hidden size, state = hidden size

outputs, _states = tf.nn.dynamic_rnn(multi_cells, X_one_hot, dtype=tf.float32)


# softmax =>reshape 수행


# FC layer

X_for_fc = tf.reshape(outputs, [-1, hidden_size])

outputs = tf.contrib.layers.fully_connected(X_for_fc, num_classes, activation_fn=None)


# reshape out for sequence_loss

outputs = tf.reshape(outputs, [batch_size, sequence_length, num_classes])


# All weights are 1 (equal weights)

weights = tf.ones([batch_size, sequence_length])


sequence_loss = tf.contrib.seq2seq.sequence_loss(

    logits=outputs, targets=Y, weights=weights)

mean_loss = tf.reduce_mean(sequence_loss)

train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(mean_loss)


sess = tf.Session()

sess.run(tf.global_variables_initializer())


for i in range(500):

    _, l, results = sess.run(

        [train_op, mean_loss, outputs], feed_dict={X: dataX, Y: dataY})

    for j, result in enumerate(results):

        index = np.argmax(result, axis=1)

        print(i, j, ''.join([char_set[t] for t in index]), l)


# Let's print the last char of each result to check it works

results = sess.run(outputs, feed_dict={X: dataX})

for j, result in enumerate(results):

    index = np.argmax(result, axis=1)

    if j is 0:  # print all for the first result to make a sentence

        print(''.join([char_set[t] for t in index]), end='')

    else:

        print(char_set[index[-1]], end='')


'''

0 167 tttttttttt 3.23111

0 168 tttttttttt 3.23111

0 169 tttttttttt 3.23111

499 167  of the se 0.229616

499 168 tf the sea 0.229616

499 169   the sea. 0.229616


g you want to build a ship, don't drum up people together to collect wood and don't assign them tasks and work, but rather teach them to long for the endless immensity of the sea.


'''


[# lab-12-5-rnn_stock_prediction]


# lab-12-5-rnn_stock_prediction
"""
내일 주가 예측 : 기존의 7일의 data를 학습
그래프 인쇄않되고 있음.
"""

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
'''
This script shows how to predict stock prices using a basic RNN
'''
import tensorflow as tf
import numpy as np
import matplotlib
import os

tf.set_random_seed(777)  # reproducibility

if "DISPLAY" not in os.environ:
    # remove Travis CI Error
    matplotlib.use('Agg')

import matplotlib.pyplot as plt

def MinMaxScaler(data):
    ''' Min Max Normalization

    Parameters
    ----------
    data : numpy.ndarray
        input data to be normalized
        shape: [Batch size, dimension]

    Returns
    ----------
    data : numpy.ndarry
        normalized data
        shape: [Batch size, dimension]

    References
    ----------
    .. [1] http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

    '''
    numerator = data - np.min(data, 0)
    denominator = np.max(data, 0) - np.min(data, 0)
    # noise term prevents the zero division
    return numerator / (denominator + 1e-7)


# train Parameters
seq_length = 7
data_dim = 5
hidden_dim = 10
output_dim = 1
learning_rate = 0.01
iterations = 500

# Open, High, Low, Volume, Close
xy = np.loadtxt('data-02-stock_daily.csv', delimiter=',')
xy = xy[::-1]  # reverse order (chronically ordered)
xy = MinMaxScaler(xy)
x = xy
y = xy[:, [-1]]  # Close as label

# build a dataset
dataX = []
dataY = []
for i in range(0, len(y) - seq_length):
    _x = x[i:i + seq_length]
    _y = y[i + seq_length]  # Next close price
    print(_x, "->", _y)
    dataX.append(_x)
    dataY.append(_y)

# train/test split
train_size = int(len(dataY) * 0.7)
test_size = len(dataY) - train_size
trainX, testX = np.array(dataX[0:train_size]), np.array(
    dataX[train_size:len(dataX)])
trainY, testY = np.array(dataY[0:train_size]), np.array(
    dataY[train_size:len(dataY)])

# input place holders
X = tf.placeholder(tf.float32, [None, seq_length, data_dim])
Y = tf.placeholder(tf.float32, [None, 1])

# build a LSTM network
cell = tf.contrib.rnn.BasicLSTMCell(
    num_units=hidden_dim, state_is_tuple=True, activation=tf.tanh)
outputs, _states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)
Y_pred = tf.contrib.layers.fully_connected(
    outputs[:, -1], output_dim, activation_fn=None)  # We use the last cell's output

# cost/loss
loss = tf.reduce_sum(tf.square(Y_pred - Y))  # sum of the squares
# optimizer
optimizer = tf.train.AdamOptimizer(learning_rate)
train = optimizer.minimize(loss)

# RMSE
targets = tf.placeholder(tf.float32, [None, 1])
predictions = tf.placeholder(tf.float32, [None, 1])
rmse = tf.sqrt(tf.reduce_mean(tf.square(targets - predictions)))

with tf.Session() as sess:
    init = tf.global_variables_initializer()
    sess.run(init)

    # Training step
    for i in range(iterations):
        _, step_loss = sess.run([train, loss], feed_dict={
                                X: trainX, Y: trainY})
        print("[step: {}] loss: {}".format(i, step_loss))

    # Test step
    test_predict = sess.run(Y_pred, feed_dict={X: testX})
    rmse_val = sess.run(rmse, feed_dict={
                    targets: testY, predictions: test_predict})
    print("RMSE: {}".format(rmse_val))

    # Plot predictions
    plt.plot(testY)
    plt.plot(test_predict)
    plt.xlabel("Time Period")
    plt.ylabel("Stock Price")
    plt.show()


[참고자료]


  https://www.inflearn.com/course/기본적인-머신러닝-딥러닝-강좌

  https://github.com/hunkim/deeplearningzerotoall

  https://www.tensorflow.org/api_docs/python/tf/layers