Cover Photo by Jason Leung on Unsplash

This is the first in probably many “why I made” posts which explains my reasoning for making a piece of software and diving deeper into future improvements.

Today, I’m going to focus on my most popular project, TabbyAPI. TabbyAPI is a python based FastAPI server that allows users to interact with Large Language Models (or LLMs) using the ExllamaV2 library and adheres to the OpenAI API specification.

If you’re not sure what any of those words mean, you’re not in the AI space. But, that’s okay! This article is meant to explain my experiences without throwing the entire kitchen sink of AI terms at you.

The start

Let me go back to the time of November 2023. AI was booming, companies were releasing models left and right, and the hype train seemed to have no end. It seems like I’m talking about a time period that was in ancient times, but back then, every day felt like a whole month of innovation.

In the onslaught of these new technologies, I was focused on running them with my paltry 3090ti. Yes, paltry is the correct word to use since 24GB of VRAM in a graphics card is entry level for running most AI models. At this time, running quantized versions of models was the norm. Quantization is analogous to compression which allows users to run these massive models on consumer GPUs.

The format that I’ve grown to love was exl2, a format that focused on speed, optimization, and getting as much as possible onto a graphics card. and tokens were generating at the speed of sound. So this format is great! What’s the issue?

The issue is running the model. Exl2 is part of the ExllamaV2 library, but to run a model, a user needs an API server. The only option out there was using text-generation-webui (TGW), a program that bundled every loader out there into a Gradio webui. Gradio is a common “building-block” UI framework for python development and is often used for AI applications. This setup was good for a while, until it wasn’t.

Essentially, the main reason for creating Tabby was annoyance. I got tired at the amount of work involved to load one model. Not to mention the overhead from Gradio and the sheer amount of dependencies from TGW. I respect the developer, but while TGW is good for many people who want an all-in-one solution, it was not good for me.

The plan…

Photo by Glenn Carstens-Peters on Unsplash

is simple. Create an API server that can sit on my computer and doesn’t require a ton of bloat to run. That sounds easy, but could I actually do it? I don’t have much experience in AI model theory, but I have a lot of experience creating backend servers and understanding API design.

Therefore, I needed someone to help, but who? Enter turboderp, the person behind ExllamaV2. He pretty much knows everything behind how models work since he made the library, which is a great pairing to my API knowledge. In addition, another interested person named Splice joined due to his experience with Python. Together, the three of us started TabbyAPI.

But was the plan really that simple? Well, kind of. While I had the people for the job, my knowledge in Python and API servers was basically 0. I ended up using a webserver framework called FastAPI which made my life much easier. It’s also very popular in the python community and well documented.

After using FastAPI for a few days, I was hooked on writing python webserver code. The documentation is very good, there are many examples online, and the developers are receptive to feedback. Overall, the community is welcoming and I’d love to use python for networking more often.

After a few weeks, I felt that everything was ready for a public deploy and decided to release everything in the best way I know. YOLO and push everything to GitHub.

Issues and more issues

When releasing an open source project to the world, expect issues… A lot of issues. People always have use cases that the utility does not fit into. Since Tabby is a backend server, many of those cases popped up. For this post, I’ll only mention a few things that were difficult for me to deal with at first.

A large pain point was that I released Tabby in the middle of the RAG hype cycle. RAG stands for “Retrieval Augmented Generation”, or using external documents in addition to the LLM’s knowledge when getting a response. The problem was that these new techniques (such as function calling) need completely different API endpoints and ways to accomplish tasks.

On top of that, there is little to no documentation on how these features actually work on the backend. To this day, I haven’t implemented OpenAI’s tool calling since I have no idea how it works. The lack of documentation is sadly common in the AI world and it stifles the ability for developers to implement code in their projects without a lot of information gathering beforehand.

Another issue that lasted for several months was multi-user generation. It turns out that handling distributed queries on a server isn’t an easy topic for a developer to work with. FastAPI provides support for this type of workload, but Tabby was written with synchronous code. This meant that I had to learn asynchronous programming in Python (which is not easy by a long shot).

The worst part is that AI developers do not like asynchronous python while networking servers embrace it. What this means is that I had to learn how to communicate between asynchronous and synchronous libraries in the form of threading. This is an even deeper dive into understanding Python’s threading issues and why the asynchronous model exists in the first place. I’ll go over all of this in another blogpost, but hopefully this explains the amount of learning that I had to do in the span of 2–3 months while battling these issues.

Eventually, turbo and I worked together to create a better generator in the ExllamaV2 library which stripped away all multi-user woes and weird bugs from threading libraries. After 9 months, it’s safe to say that Tabby is finally a stable program to run models with.

Burnout

Photo by Annie Spratt on Unsplash

During my entire time of developing software, I’ve never had a burnout period. That’s hard to believe since burnout is a common thing within the software world, but I’ve always wanted to code something for the past 6 years. Coding is my favorite pastime and helps me escape the stresses of the day.

However, Tabby and the AI community in general changed things. At the start, I made a lot of friends and people who shared common interests in exploring the booming field of AI. My community used to engage in voice calls pretty much every day and focused on sharing projects and ideas about what’s new in the space. It made development fun and enjoyable since I got to hang out with like-minded people and share new ideas.

Unfortunately, those voice calls started having less people and happened less often. I was also under a lot of stress due to finishing up my first year of medical school. In the online world, this was a huge period of loneliness for me and developing Tabby felt like a burden on top of my med student life. Eventually, these events culminated in a large ball of frustration and tiredness. To solve it, I decided to take an indefinite break from AI.

During my break, I spent time away from Tabby and spent more time enjoying my summer vacation. I actually worked on some older iOS app projects and spent time with my family. Nowadays, I’m getting back into developing Tabby again. Those voice calls I used to partake in probably won’t happen for a long while due to the fading of AI hype. It’s a tough pill to swallow, but I’ve found different motivations for continuing development.

Lessons I learned

Tabby was the first LLM project I’ve ever made. It somehow became a popular name within the community and I was thrown into the deep end of management. Knowing that, here’s a few thoughts that I learned from this experience.

Know who you want to cater to: Anyone can use an open source project. For Tabby, I prioritize features that will benefit the project’s ease of use, my friends, and myself. By keeping this philosophy in check, I can manage my schedule and I’ll know what features to work on.

Understand your limits: Burnout isn’t fun. Don’t do what I did and run yourself down because a user has an issue for the umpteenth time. If the feelings of frustration, anger, or boredom ever show up, take a break. It’s good to relax once in a while.

Don’t bend over backwards for everyone: An idea may look good when it’s first presented, but people don’t understand that the developer needs to maintain this feature afterwards. If it’s a pain and not used much, the feature isn’t going to be maintained and will become tech debt. Remember that random strangers on the internet always have ideas. It’s up to you or your team to decide which ones to commit brainpower to.

Create something you love and enjoy: Developers often lose enjoyment on a project because maintaining can be troublesome and take a long time. This is especially true if the developer no longer actively uses the project. Figure out what your motivation is, and if it changes, that’s okay.

I’ll probably elaborate on these in another article since this can be its own topic, but I feel that working on Tabby has given me more insights to how I want my projects to work. In addition, my knowledge of the open source community has been expanded.

What the future holds

I’m thankful to all the people that contribute and give suggestions daily to improve both TabbyAPI and ExllamaV2. Everyone helps refine and improve the programs to work better for general use. I’m one person and helping out takes a lot off of my plate.

For the foreseeable future, I’m going to cut back on how much I’m working on Tabby. The project is still going strong and many are committing to improving it, but my mental health is more important and taking breaks will help with that.

Thanks for reading this retrospective. If you want to learn more about me and what I do, please visit kingbri.dev.

Brian Dashore's personal website

theroyallab / tabbyAPI

An OAI compatible exllamav2 API that's both lightweight and fast

TabbyAPI

Name: Why I made TabbyAPI
Rating: 4.4 (9219 reviews)
Author: kingbri

Important

In addition to the README, please read the Wiki page for information about getting started!

Note

Need help? Join the Discord Server and get the Tabby role. Please be nice when asking questions.

A FastAPI based application that allows for generating text using an LLM (large language model) using the Exllamav2 backend

Disclaimer

This project is marked rolling release. There may be bugs and changes down the line. Please be aware that you might need to reinstall dependencies if needed.

TabbyAPI is a hobby project solely for a small amount of users. It is not meant to run on production servers. For that, please look at other backends that support those workloads.

Getting Started

Important

This README is not for getting started. Please read the Wiki.

Read the Wiki for more information. It contains user-facing documentation for installation, configuration, sampling, API usage, and so much more.

Supported Model

…

View on GitHub