VOYAGER Minecraft 论文阅读+源码分析

Basic Information:

  • Title: VOYAGER: An Open-Ended Embodied Agent with Large Language Models (基于大语言模型的开放式交互体验智能体Voyager)
  • Keywords: lifelong learning, embodied agent, large language model, open-ended exploration, skill library
  • URLs: Paper: https://arxiv.org/abs/2305.16291v1

主要思路

  • 本文介绍了VOYAGER,这是一个具有(in-context lifelong learning)能力的代理系统,旨在在Minecraft中进行探索、获取技能和进行新颖发现,无需人类干预。VOYAGER由大型语言模型(LLMs)驱动,三个关键组件:自动课程、自增长的可执行代码技能库、配合环境反馈信息的迭代可交互提示机制。该代理系统旨在最大化探索,依据当前技能水平和世界状态提出越来越困难的任务。技能库是一个不断增长的可执行代码存储库,用于存储和检索在新情境中可以重复使用的复杂行为,而迭代提示机制通过结合Minecraft模拟和代码解释器的反馈来改进生成的程序。经验上,VOYAGER 展现出强大的情境终生学习能力,并且在玩 Minecraft 时的表现非常专业。相比之前的 SOTA,它能获得的物品数多了 3.3 倍,探索的距离多了 2.3 倍,解锁关键技能树的速度快了 15.3 倍。VOYAGER 能够从零开始在新的 Minecraft 世界中利用学到技能库,这是传统强化学习方法较难做到的部分。

  • 自动课程:基于探索尽可能多的内容为终极目标,GPT4模型基于当前探索进度和智能体的状态不断生成新的任务。

  • 技能库:以js脚本的形式存储所有成功解决任务的技能库,并将他们的描述文本embedding存储,在后续遇到需要与该技能描述类似的任务时直接调用。

  • 迭代式提示词机制 iterative prompting mechanism:

    • 环境反馈:执行程序获取当前游戏的观察空间以及上一轮生成代码的错误信息。
    • 将反馈信息组合成新prompt返回给gpt4,能发现程序中任何不合理的操作和语法错误,为下一轮代码精炼做准备。
    • 重复这个循环,直到自我确认机制确认当前任务已经完成,这时再将生成的技能提交并查询下一个自动课程。

    Untitled

  • 项目代码文件可分为4部分:

    • Python:voyager核心逻辑代码
    • JS:在 voyager/control_primitives 目录下, 定义了一组JS脚本,存储了一组预先实现的基础技能
    • Nodejs:在 voyager/env/mineflayer 目录下, 启动一个 http server, 定义一组 HTTP API
    • prompts: voyager/prompts,存放所有提示词
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#voyager.py 核心部分 

# 四个Agent:CurriculumAgent、CriticAgent、SkillManager、ActionAgent
#reset()、step(),包装成rollout()执行循环。learn()为代码主入口,在初始化后执行rollout()主循环
#rollout()里除以维护context和task
#...........
#课程循环 learn()
while True:
if self.recorder.iteration > self.max_iterations:
print("Iteration limit reached")
break
#这一步更新上下文
task, context = self.curriculum_agent.propose_next_task(
events=self.last_events,
chest_observation=self.action_agent.render_chest_observation(),
max_retries=5,
)
print(
f"\033[35mStarting task {task} for at most {self.action_agent_task_max_retries} times\033[0m"
)
try:
messages, reward, done, info = self.rollout(
task=task,
context=context,
reset_env=reset_env,
)
except Exception as e:
time.sleep(3) # wait for mineflayer to exit
info = {
"task": task,
"success": False,
}
# reset bot status here
self.last_events = self.env.reset(
options={
"mode": "hard",
"wait_ticks": self.env_wait_ticks,
"inventory": self.last_events[-1][1]["inventory"],
"equipment": self.last_events[-1][1]["status"]["equipment"],
"position": self.last_events[-1][1]["status"]["position"],
}
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def rollout(self, *, task, context, reset_env=True):
"""
- 传入具体任务(task)和上下文(context), 执行核心逻辑

- 详细功能: GPT 自主分析任务 -> 生成 JS 代码 -> call mineflayer server,
执行 js 代码 -> 反馈 new_skill

"""
self.reset(task=task, context=context, reset_env=reset_env)
# =======================================================================

while True:
#
# GPT 分析任务+生成JS代码+自动执行+自动反馈(判断执行效果)
# - 1. call GPT 分析任务,并生成 JS 代码
# - 2. 对 JS 代码预处理:基于 python + javascript + babel 预处理
# - 3. 通过 HTTP 请求本地启动的 mineflayer 服务, 远程执行 js 代码
# - 4. GPT 自主判定:对执行结果进行判断,判断任务是否完成
# - 5. 并返回新学会的技能 new_skill
#
messages, reward, done, info = self.step()
if done:
break
return messages, reward, done, info
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def step(self):
"""

"""
if self.action_agent_rollout_num_iter < 0:
raise ValueError("Agent must be reset before stepping")

# =======================================================================
#
#
ai_message = self.action_agent.llm(self.messages)
print(f"\033[34m****Action Agent ai message****\n{ai_message.content}\033[0m")
self.conversations.append(
(self.messages[0].content, self.messages[1].content, ai_message.content)
)
# =======================================================================
#
# 导入 JS 模块,调用 JS lib
#
parsed_result = self.action_agent.process_ai_message(message=ai_message)
success = False
if isinstance(parsed_result, dict):
code = parsed_result["program_code"] + "\n" + parsed_result["exec_code"]
# =======================================================================
#
# todo x: HTTP 请求本地启动的 mineflayer 服务, 远程执行 js 代码
#
events = self.env.step(
code,
programs=self.skill_manager.programs, # todo x: js 代码
)

# =======================================================================

self.recorder.record(events, self.task)
self.action_agent.update_chest_memory(events[-1][1]["nearbyChests"])

#
# todo x: call OpenAI(GPT), 根据 AI(GPT) 回答,自动判断 task 是否完成
#
success, critique = self.critic_agent.check_task_success(
events=events,
task=self.task,
context=self.context,
chest_observation=self.action_agent.render_chest_observation(),
max_retries=5,
)

# =======================================================================

if self.reset_placed_if_failed and not success:
# revert all the placing event in the last step
blocks = []
positions = []
for event_type, event in events:
if event_type == "onSave" and event["onSave"].endswith("_placed"):
block = event["onSave"].split("_placed")[0]
position = event["status"]["position"]
blocks.append(block)
positions.append(position)

# =======================================================================

#
# todo x: HTTP 请求本地启动的 mineflayer 服务, 远程执行 js 代码
#
new_events = self.env.step(
f"await givePlacedItemBack(bot, {U.json_dumps(blocks)}, {U.json_dumps(positions)})",
programs=self.skill_manager.programs, # todo x: js 代码
)
events[-1][1]["inventory"] = new_events[-1][1]["inventory"]
events[-1][1]["voxels"] = new_events[-1][1]["voxels"]

# =======================================================================

#
# todo x: 检索向量数据库, 尝试复用已存在技能
#
new_skills = self.skill_manager.retrieve_skills(
query=self.context
+ "\n\n"
+ self.action_agent.summarize_chatlog(events)
)

# =======================================================================

#
#
#
system_message = self.action_agent.render_system_message(skills=new_skills) # todo x: GPT 自己写代码,实现控制逻辑
human_message = self.action_agent.render_human_message(
events=events,
code=parsed_result["program_code"],
task=self.task,
context=self.context,
critique=critique,
)
self.last_events = copy.deepcopy(events)
self.messages = [system_message, human_message]
else:
assert isinstance(parsed_result, str)
self.recorder.record([], self.task)
print(f"\033[34m{parsed_result} Trying again!\033[0m")

# =======================================================================
assert len(self.messages) == 2
self.action_agent_rollout_num_iter += 1
done = (
self.action_agent_rollout_num_iter >= self.action_agent_task_max_retries
or success
)
info = {
"success": success,
"conversations": self.conversations,
}
# =======================================================================

if success:
assert (
"program_code" in parsed_result and "program_name" in parsed_result
), "program and program_name must be returned when success"
info["program_code"] = parsed_result["program_code"]
info["program_name"] = parsed_result["program_name"]
else:
print(
f"\033[32m****Action Agent human message****\n{self.messages[-1].content}\033[0m"
)
return self.messages, 0, done, info
1
2
3
4
5
6
def render_system_message(self, skills=[]):
##加载action_template,与默认base_skill、action_response_format拼接,加上输入的技能组,组成系统消息,
##让gpt返回生成的js代码以及代码相关描述信息。
def render_human_message()
## 获取当前环境信息,包括玩家基础状态、生物种群,附近的实体、当前任务、context、critique

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
def propose_next_task(self, *, events, chest_observation, max_retries=5):
# 预设任务砍木头
if self.progress == 0 and self.mode == "auto":
task = "Mine 1 wood log"
context = "You can mine one of oak, birch, spruce, jungle, acacia, dark oak, or mangrove logs."
return task, context

# hard code task when inventory is almost full
inventoryUsed = events[-1][1]["status"]["inventoryUsed"]
# 当持有物品大于33个时把多余的存在附近箱子里,没有箱子就创建任务造一个箱子
if inventoryUsed >= 33:
if chest_observation != "Chests: None\n\n":
chests = chest_observation[8:-2].split("\n")
for chest in chests:
content = chest.split(":")[1]
if content == " Unknown items inside" or content == " Empty":
position = chest.split(":")[0]
task = f"Deposit useless items into the chest at {position}"
context = (
f"Your inventory have {inventoryUsed} occupied slots before depositing. "
"After depositing, your inventory should only have 20 occupied slots. "
"You should deposit useless items such as andesite, dirt, cobblestone, etc. "
"Also, you can deposit low-level tools, "
"For example, if you have a stone pickaxe, you can deposit a wooden pickaxe. "
"Make sure the list of useless items are in your inventory "
"(do not list items already in the chest), "
"You can use bot.inventoryUsed() to check how many inventory slots are used."
)
return task, context
if "chest" in events[-1][1]["inventory"]:
task = "Place a chest"
context = (
f"You have a chest in inventory, place it around you. "
f"If chests is not None, or nearby blocks contains chest, this task is success."
)
else:
task = "Craft 1 chest"
context = "Craft 1 chest with 8 planks of any kind of wood."
return task, context
#加载提示词
messages = [
#加载prompt:创建curriclum,该prompt包括玩家当前环境和个人信息以及规范回答格式内容
self.render_system_message(),
#用户消息处理
self.render_human_message(
events=events, chest_observation=chest_observation
),
]

if self.mode == "auto":
return self.propose_next_ai_task(messages=messages, max_retries=max_retries)
elif self.mode == "manual":
return self.propose_next_manual_task()
else:
raise ValueError(f"Invalid curriculum agent mode: {self.mode}")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#critic
def human_check_task_success(self): #人工确认任务
confirmed = False
success = False
critique = ""
while not confirmed:
success = input("Success? (y/n)")
success = success.lower() == "y"
critique = input("Enter your critique:")
print(f"Success: {success}\nCritique: {critique}")
confirmed = input("Confirm? (y/n)") in ["y", ""]
return success, critique
##
def ai_check_task_success(self, messages, max_retries=5): #gpt确认任务
#LLM, call OpenAI, 返回 AI 回答, 解析该回答中的关键词,判断 task 是否成功
critic = self.llm(messages).content

# =======================================================================

print(f"\033[31m****Critic Agent ai message****\n{critic}\033[0m")
try:
response = fix_and_parse_json(critic)
assert response["success"] in [True, False]
if "critique" not in response:
response["critique"] = ""
return response["success"], response["critique"]
except Exception as e:
print(f"\033[31mError parsing critic response: {e} Trying again!\033[0m")
# todo x: 默认,失败重试 5 次
return self.ai_check_task_success(
messages=messages,
max_retries=max_retries - 1,
)