The number of tools and functions that aim to enhance the abilities of language models (LMs) is growing rapidly. For example, the popular LM framework LangChain grew its tool catalog from three to seventy-seven in the last 15 months. However, this approach of building tools for every little thing may be misguided and ultimately counterproductive. Instead, providing AI with direct access to a terminal, where it can use the many command line tools already created, and even create its own tools, will lead to more powerful, flexible, and future-proof systems.

Theory

Rich Sutton's short essay "The Bitter Lesson" strikes at the heart of this issue. On the surface, the essay is about choosing general methods over human-designed algorithms. But if that's all it was, it'd be just a lesson. The bitter part is that we humans want to feel special. We desperately need to believe that our knowledge and our contributions matter. So we saddle the AI with our knowledge, come up with processes for it to follow, and fill the gaps with our own code - and it all works, for a while. However, the next wave of innovation always comes and washes this sand castle away. We don't remember the clever algorithms modeling human vocal cords in speech recognition, we don't remember the clever algorithms searching for generalized cylinders in computer vision, and so neither will we remember most of LangChain's current seventy-seven tools.

The Princeton Language & Intelligence lab recently released SWE-Agent, shocking the world with its ability to bump OpenAI's GPT-4's percentage resolution of real-world GitHub Python issues from 1.4% to 12%. What will be remembered from their achievement is not all the clever work in optimizing GPT-4 (remember the Bitter Lesson), but the introduction of the idea of the Agent-Computer Interface, and the focus on improving this interface. The Princeton researchers took a very important step of asking how we can improve the 'user experience' of an AI agent. Here are some features they implemented:

File search with 50 results at a time.
File viewer that displays 100 lines at a time.
Context management to remind the agent what it's working on.

These user experience challenges are not exclusive to AI. Humans are also more productive when we can use search and pagination instead of having to read and remember thousands of lines of text. Which is why there's already a solution to most of these challenges, the king of all software: The terminal. I'd estimate that 80% of any tasks that need to be done on the computer can be done in the terminal. And the terminal can be used to develop solutions for the remaining 20%.

In fact, the Princeton researchers explicitly called out this possibility in their paper, but dismissed it as insufficiently user-friendly for LMs. They write:

Consider a text editor like Vim which relies on cursor-based line navigation and editing. Carrying out any operation leads to long chains of granular and inefficient interactions. Furthermore, humans can ignore unexpected inputs, such as accidentally outputting a binary file or thousands of lines returned from a ‘grep’ search. LMs are sensitive to these types of inputs, which can be distracting and take up a lot of the limited context available to the model. On the other end, commands that succeed silently confuses LMs. We observe that LMs will often expend extra actions to verify that a file was removed or an edit was applied if no automatic confirmation is given. We show that LMs are substantially more effective when using interfaces built with their needs and limitations in mind.

These are important points, but I don't agree that they have any bearing on the viability of just letting an LM use the terminal:

A terminal window has fixed dimensions, so an LM that interacts with a terminal won't be forced to process thousands of lines of unexpected output - only whatever it can see in that window.
Having to carry out lots of granular interactions and execute silent commands that don't return output may confuse the current generation of LMs, but this seems a limitation of just the current generation, not a universal limitation of all LMs. Per the Bitter Lesson, it's counterproductive to build tools while focusing on current limitations, as those tools then get an expiration date to their usefulness.

Implementation

As the Princeton paper says, the current generation of LM models aren't capable of using the terminal to perform any complex tasks reliably.

However, you can try it out for yourself by installing langchain-community from my fork (I opened a pull request into the official LangChain repo but it's currently blocked due to security concerns).

You'll have to install tmux first.

If you're using pip:

pip install libtmux git+https://github.com/panasenco/langchain.git@terminal-window-tool#subdirectory=libs/community

Alternatively, if you're using poetry, add this to your dependencies section in pyproject.yml:

langchain_community = {git = "https://github.com/panasenco/langchain.git", branch="terminal-window-tool", subdirectory="libs/community"}
libtmux = "^0.37.0"

Then follow the instructions in the documentation notebook. Here's an excerpt that shows how an LM can interact with the terminal:

from langchain_community.tools import (
    TerminalLiteralInputTool,
    TerminalSpecialInputTool,
    TerminalBottomCaptureTool, 
)
from langchain_openai import ChatOpenAI

lm = ChatOpenAI(model="gpt-4o-2024-05-13").bind_tools(
    [TerminalLiteralInputTool(), TerminalSpecialInputTool(), TerminalBottomCaptureTool()]
)

msg = lm.invoke("What top 3 processes consuming the most memory?")
msg.tool_calls

[{'name': 'terminal_literal_input',
  'args': {'__arg1': 'ps aux --sort=-%mem | head -n 4'},
  'id': 'call_xS2CDWlgZFkslqe7lM2QBoBz'},
 {'name': 'terminal_special_input',
  'args': {'__arg1': 'Enter'},
  'id': 'call_p6b4MVlPZ5FdC2aWsk2F4dEo'}]

In the above example we can see that GPT-4o knows how to translate the problem statement into a shell command, and knows to press the Enter key after entering the command. It can't do much more than this without a lot of hand-holding yet, but I'm sure we'll have a model for which navigating the terminal won't be a challenge by the end of 2024.

Security

AIs are prone to many categories of unsafe behavior, as highlighted by the paper Concrete Problems in AI Safety. The more power and freedom we give AI agents, the more likely they are to behave in unexpected and unwanted ways, and I can't think of a single application that gives its users as much power and freedom as the terminal.

The minimum that needs to be done to satisfy the concerns of safety and security in granting LMs terminal access depends on the generation of LMs we're talking about.

GPT-4 and equivalent LMs merely need local containerization. Just use a development container when working on your app. Development containers work in VS Code and GitHub CodeSpaces, and are a best practice in general to let others easily collaborate. Working inside a development container will ensure that oopsies like rm -rf / do minimal damage to the parent systems. These LMs don't yet seem capable of using the terminal to intentionally break out of the container.
The next generation of LMs will need a greater degree of isolation. Cloud providers like Amazon already allow arbitrary users access to their systems without knowing whether the user is a hacker. The startup E2B brings the same secure containerization technology Amazon uses to the AI space. Treating these LMs as if they were potentially malicious human-level hackers and placing similar restrictions on them as cloud providers do on human users should be sufficient to contain the threat.
The following generation of LMs will probably need to be treated as superhuman hackers. These LMs should probably not be given access to any tools at all, not just the terminal, at least until the Superalignment team figures something out.

Conclusion

The Bitter Lesson teaches us that AI researchers over the years have tried desperately to retain a feeling of human control and a sense of value of human knowledge, only for those illusions to be shattered over and over. The only things of lasting value we human engineers seem to have to offer to AI are general methods that scale as the AI's abilities grow. I predict the "more dangerous" tools that also offer AI agents more flexibility and power will begin to replace the more limited ones. In the end, only the terminal tool will be necessary.

What about you? Can you use these tools to find a way to make current-generation LMs interact with the terminal reliably, contrary to what the Princeton paper claims? What do you think about the safety of letting LMs use a terminal? Let me know in the comments!

A Terminal Is All AI Needs

Theory

Implementation

Security

Conclusion