Large Language models are becoming smaller and better over time, and, today, models like Llama3.1 which has competing benchmark scores with GPT-3.5 Turbo can be easily run locally from your own computer.

This opens up endless opportunities to build cool stuff on top of this cutting-edge innovation, and, if you bundle together a neat stack with Docker, Ollama and Spring AI, you have all you need to architect production-grade RAG systems locally.

Technology Stack

We will build a basic RAG app with:

Streamlit as a basic frontend;
Spring AI version 1.0.0-M1;
Springboot 3.3.0;
Java 21;
Docker+compose, to use PGVector as a RAG database in a container and Ollama to serve local models;
Claude.ai as a pair programming assistant to spar ideas with;

Note that I've added claude.ai as part of the technology stack, because I believe, that now, if you're not using genAI tools to help you program and sketch out ideas, you will quickly fall behind the curve.
My take about using genAI as coding assistants is very simple: you need to know what you're doing to extract value from them, but, if you have a good mental model (that you need to devise by doing and learning and reading by yourself for many years; there's no replacement for this but doing it alone, by yourself, seeking deep understanding) you can move much faster than otherwise.

The app

The application will function as a chatbot that allows you to select different LLMs and upload documents, so you can “chat” with them.

When the document(s) are uploaded, you can then “query” them via text messages that can be “matched” with contents of said documents.

How does it really work?

But how does this work, really?
The answer might be a lot less exciting than you think: it uses plain, decades old concepts from the fields of databases and information retrieval, there's nothing fancy here.

The basic idea is like taking an open-book exam and being allowed to take notes with you:

You have a question you need an answer to;
You have access to a giant knowledge base of information (your entire book);
Your notes represent a distilled version of the knowledge in the book (this is called an embedding in RAG slang);
You then perform a similarity search on your notes: map your question to your notes, and together with your own knowledge produce an answer;

The place where this analogy falls short is in something that our brain does implicitly, but, in RAG systems, we need to do very explicitly: the “question” and your “notes” need to be in a similar format so they can be compared with each other: you'd compare a “distilled” version of your question with your “distilled” knowledge, represented by your notes.

Then, using classic similarity algorithms, like cosine similarity, dot product between vectors, even plain old full-text search (and others I don't know about), you can find results that are close enough to what you need to retrieve.

The flow is complete when these “very close matches”, after being retrieved from your RAG database are sent to an LLM as “context”, so that it can use that context to provide you with better answers.

The reason this is so effective is that by providing an LLM with this additional information "retrieved" from your own personal or private documents, you are supplementing the model with extra information that it wouldn't otherwise have, which greatly improves its accuracy.

Diving into the code

Since LLMs emerged as the new kid on the block, most of the popular ecosystems have jumped head first into it, offering support for working with LLMs and RAG databases directly. We will use Springboot together with Spring AI and Postgres as our “Knowledge Base”, by installing the pgvector extension which gives vector searching capabilities to postgres.

There are many aspects and moving parts to make this work together, and, understanding the foundational concepts is important, so, let's go step by step.

First, there was a RAG database

We start with the database:

  db:
    image: ankane/pgvector:latest
    environment:
      - POSTGRES_DB=ragdb
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=password
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    networks:
      - app-network

This uses a docker service as part of a larger docker-compose file, which we will build incrementally as we go along, that sets up a postgres database with support for the pgvector extension. It also uses a dedicated init.sql file to perform database initialization by creating the necessary DB tables:

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS hstore;
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

-- Create the documents table
CREATE TABLE IF NOT EXISTS documents (
    id UUID DEFAULT uuid_generate_v4() PRIMARY KEY,
    document_name TEXT UNIQUE NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    metadata JSONB
);

-- Create the vector_store table with reference to documents
CREATE TABLE IF NOT EXISTS vector_store (
    id UUID DEFAULT uuid_generate_v4() PRIMARY KEY,
    document_id UUID NOT NULL,
    content TEXT,
    metadata JSONB,
    embedding vector(1024),
    chunk_index INTEGER,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
);

-- Create indexes
CREATE INDEX idx_documents_document_name ON documents(document_name);
CREATE INDEX idx_vector_store_document_id ON vector_store(document_id);
CREATE INDEX idx_vector_store_embedding ON vector_store USING HNSW (embedding vector_cosine_ops);

A lot happening here, so, let's distill the most important concepts:

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS hstore;
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

These three lines, while not strictly needed here, because the docker image already has them pre-installed, are included for completeness and these are needed to enable the RAG support of a standard Postgres distribution.

Vector adds support for vector similarity search algorithms, hstore, a data type for storing sets of key/value pairs within a single PostgreSQL value, which can come in handy to map, for example, documents to their corresponding metadata values, or even chunks.
And lastly, uuid-ossp provides utility functions to generate UUIDs using several known standard algorithms.

Then, we have the table creation:

-- Create the documents table
CREATE TABLE IF NOT EXISTS documents (
    id UUID DEFAULT uuid_generate_v4() PRIMARY KEY,
    document_name TEXT UNIQUE NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    metadata JSONB
);

-- Create the vector_store table with reference to documents
CREATE TABLE IF NOT EXISTS vector_store (
    id UUID DEFAULT uuid_generate_v4() PRIMARY KEY,
    document_id UUID NOT NULL,
    content TEXT,
    metadata JSONB,
    embedding vector(1024),
    chunk_index INTEGER,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
);

This requires some additional explanation: we create a relationship between a document and its embeddings, while adding metadata to both entities: a document can have many embeddings, so, we will see a @ManyToOne relationship in the document id column of the vector_store table in Java, when we define the hibernate entities.

I've decided to create a separate documents table besides the vector_store one because it's useful to keep a separate table to manage the relationship between the document(s) and their many embeddings. It's also crucial because by enforcing a uniqueness constraint on the document_name in the documents table and referencing that via id in our vector_store table, that represents the embeddings of document chunks (or potentially full documents) we gain referential integrity: if we try to upload similar documents we won't generate repeated embeddings to bloat the DB unnecessarily. Obviously, for documents with the exact same name whose contents have been changed, this would cause problems, but, this is still just testing the waters and seeing how it all ties together.

Finally, indexes come into play:

-- Create indexes
CREATE INDEX idx_documents_document_name ON documents(document_name);
CREATE INDEX idx_vector_store_document_id ON vector_store(document_id);
CREATE INDEX idx_vector_store_embedding ON vector_store USING HNSW (embedding vector_cosine_ops);

Indexes on embeddings are a whole different beast, as we need to somehow find ways to efficiently index data inside potentially dense vectors, which comes with efficiency, size, and speed challenges that are not seen in regular types. This post is the best I've read about this so far.

Then, `PGVectorStore` enters the chat

We have looked at the DB structure for our documents and embeddings, now it's time to look into how Spring AI helps us leverage this.

The new Spring AI module of the Springboot framework offers support for Vector Databases, namely for configuring Postgres as a vector store.

On startup, the PgVectorStore will attempt to install the required database extensions and create the required vector_store table with an index if not existing.

By default, as indicated in the URL above, the created vector_store table will have a dimension of 1536.
This seems to be a standard dimension for embedding vectors, used for example by OpenAI for its small embedding models, so, it's, in a sense, a sensible default.

However, we might want to have either smaller or larger embedding vector dimensions (PGVector supports up to 2000 dimensions), which can mean we will be using specific embedding models that are smaller in size (for instance, Voyage-2 embedding model from Anthropic, which has 1024 dimensions).

Because of this, we will see, soon, that we need to pass some special environment variables to our Springboot app to have more fine-grained control over how the vector_store default PGVectorStore "interface" table is created, and which embedding model will be used in our application.

We can manually configure PGVectorStore by exposing some Beans in a configuration class:

@Bean(name = "ragDataSource")
    public DataSource ragDataSource() {
        return DataSourceBuilder.create()
                .url(datasourceUrl)
                .username(dataSourceUsername)
                .password(dataSourcePassword)
                .driverClassName("org.postgresql.Driver")
                .build();
    }

    @Bean(name = "ragDB")
    public JdbcTemplate jdbcTemplate() {
        return new JdbcTemplate(ragDataSource());
    }

    @Bean
    public VectorStore vectorStore(
            @Qualifier("ragDB") JdbcTemplate jdbcTemplate,
            @Qualifier("ollamaEmbeddingModel") EmbeddingModel embeddingModel) {
        return new PgVectorStore(jdbcTemplate, embeddingModel);
    }

We will now focus our attention on configuring our Springboot app via a docker-compose file to tie this together.

Docker-compose based Springboot configuration

Here is our Springboot app configuration in a docker-compose file:

  backend:
    build: .
    ports:
      - "8080:8080"
    environment:
      - SPRING_DATASOURCE_URL=jdbc:postgresql://db:5432/ragdb
      - SPRING_DATASOURCE_USERNAME=postgres
      - SPRING_DATASOURCE_PASSWORD=password
      - SPRING_AI_OLLAMA_BASE_URL=http://ollama-llm:11434/
      - SPRING_AI_OLLAMA_CHAT_OPTIONS_MODEL=tinydolphin
      - SPRING_AI_OLLAMA_CHAT_OPTIONS_ALTERNATIVE_MODEL=tinyllama
      - SPRING_AI_OLLAMA_CHAT_OPTIONS_ALTERNATIVE_SECOND_MODEL=llama3.1
      - SPRING_AI_OLLAMA_EMBEDDING_OPTIONS_MODEL=mxbai-embed-large
      - SPRING_AI_VECTORSTORE_PGVECTOR_REMOVE_EXISTING_VECTOR_STORE_TABLE=true
      - SPRING_AI_VECTORSTORE_PGVECTOR_INDEX_TYPE=HNSW
      - SPRING_AI_VECTORSTORE_PGVECTOR_DISTANCE_TYPE=COSINE_DISTANCE
      - SPRING_AI_VECTORSTORE_PGVECTOR_DIMENSIONS=1024
    depends_on:
      - prepare-models
      - db
    volumes:
      - ollama_data:/root/.ollama
    networks:
      - app-network

This is the docker-compose service representing our Java Springboot app that gets built fully locally. The picture is not 100% complete yet, as we need to look at how we configure Ollama via docker-compose, but, we will get there.

The most critical env. variables are the first 5 and the last 5. The other ones are extra, added to allow runtime configuration of different LLMs.

We define data source credentials to ensure that our application knows about and can connect to our Postgres database, which will serve as the RAG database.

Then we configure the URL for the API that is "hosting" the LLM. Usually, these large language models are served via APIs and client code will just connect to them to be able to use them. We need a host, where the inference API is running, and we need to specify the model we want to use to do the inference itself:

 - SPRING_AI_OLLAMA_BASE_URL=http://ollama-llm:11434/
 - SPRING_AI_OLLAMA_CHAT_OPTIONS_MODEL=tinydolphin

Since we are building a RAG application, we need an "embedding model" besides only an inference API. Remember the "notes" for the open-book exam? If we want to do RAG, we need a way to "create the notes". This is done by specifying an embedding "model".

- SPRING_AI_OLLAMA_EMBEDDING_OPTIONS_MODEL=mxbai-embed-large

I used an open-source embedding model from Mixedbread that is offered via the Ollama project, as an embedding model.

Usually, current LLM providers of services, such as OpenAI, Anthropic, Amazon Bedrock, etc, tend to offer an inference API, that uses a large language model and an embeddings API, which is an API that uses another model that just does embeddings. This is that concept materialized in the Ollama project.

Finally, we configure the PGVectorStore abstraction offered by Spring AI on top of a pgvector-enabled postgres installation to tweak its defaults in a way that everything matches:

      - SPRING_AI_VECTORSTORE_PGVECTOR_REMOVE_EXISTING_VECTOR_STORE_TABLE=true
      - SPRING_AI_VECTORSTORE_PGVECTOR_INDEX_TYPE=HNSW
      - SPRING_AI_VECTORSTORE_PGVECTOR_DISTANCE_TYPE=COSINE_DISTANCE
      - SPRING_AI_VECTORSTORE_PGVECTOR_DIMENSIONS=1024

The key aspects are that we define the remove-existing-vector-store-table to true, so that we don't let the autowiring of PGVectorStore create the vector_store table for us in the postgres installation. Remember, the default embedding dimensions in there are 1536, and we need compatibility in the dimensions of the embedding model we are using, which has 1024 dimensions, that we define in the end.
We also define the index type as well as the distance metric to use when performing similarity searches.

Ollama docker container to pull the models so they are locally available

Here is the setup I used to pull the models (both the large language models as well as the embeddings models so they get downloaded to my local machine:

  ollama-llm:
    image: ollama/ollama:latest
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    networks:
      - app-network

  prepare-models:
    image: ollama/ollama:latest
    depends_on:
      - ollama-llm
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=http://ollama-llm:11434
    networks:
      - app-network
    entrypoint: >
      sh -c "
        echo 'Waiting for Ollama server to start...' &&
        sleep 10 &&
        echo 'Pulling tinydolphin...' &&
        ollama pull tinydolphin &&
        echo 'Pulling tinyllama...' &&
        ollama pull tinyllama &&
      echo 'Pulling llama3.1...' &&
      ollama pull llama3.1 &&
      echo 'Pulling embedding model...' &&
      ollama pull mxbai-embed-large &&
        echo 'Model preparation complete.'"

I use a shared volume between these two services to make sure that there are two things happening: in a preparatory step, I pull the LLMs I want to use and the embedding model, and they get stored in the ollama_data:/root/.ollama volume.
Then, at a "runtime" step, where the actual Ollama embedding models are used as defined above in the Springboot app confuration, they will be available because the runtime Ollama container will look into the same volume where the models were downloaded and things will work as expected, just like if the models would be hosted somewhere as remote API endpoints for embedding and completion generation.

Java code to tie it all together

Since we have created special tables besides the standard ones offered via PGVectorStore, we want to leverage them fully, so, we create dedicated Hibernate entities so we can query them:

@Entity
@Table(name = "documents")
@Data
@Builder
@AllArgsConstructor
public class DocumentEntity {
    @Id
    @GeneratedValue(generator = "uuid2")
    @GenericGenerator(name = "uuid2", strategy = "uuid2")
    @Column(columnDefinition = "UUID")
    private UUID id;

    @Column(name = "document_name", unique = true, nullable = false)
    private String documentName;

    @Column(name = "created_at")
    private Instant createdAt;

    @Column(name = "updated_at")
    private Instant updatedAt;

    @JdbcTypeCode(SqlTypes.JSON)
    @Column(columnDefinition = "jsonb")
    private Map<String, Object> metadata;

    public DocumentEntity() {}

    @PrePersist
    protected void onCreate() {
        createdAt = Instant.now();
        updatedAt = Instant.now();
    }

    @PreUpdate
    protected void onUpdate() {
        updatedAt = Instant.now();
    }
}

and:

@Entity
@Table(name = "vector_store")
@Data
@Builder
@AllArgsConstructor
public class VectorStoreEntity {
    @Id
    @GeneratedValue(generator = "uuid2")
    @GenericGenerator(name = "uuid2", strategy = "uuid2")
    @Column(columnDefinition = "UUID")
    private UUID id;

    @ManyToOne(fetch = FetchType.LAZY)
    @JoinColumn(name = "document_id", nullable = false)
    private DocumentEntity document;

    @Column(name = "content")
    private String content;

    @JdbcTypeCode(SqlTypes.JSON)
    @Column(columnDefinition = "jsonb")
    private Map<String, Object> metadata;

    @Column(name = "embedding", columnDefinition = "vector(1024)")
    private double[] embedding;

    @Column(name = "chunk_index")
    private Integer chunkIndex;

    @Column(name = "created_at")
    private Instant createdAt;

    @Column(name = "updated_at")
    private Instant updatedAt;

    public VectorStoreEntity() {}

    @PrePersist
    protected void onCreate() {
        createdAt = Instant.now();
        updatedAt = Instant.now();
    }

    @PreUpdate
    protected void onUpdate() {
        updatedAt = Instant.now();
    }
}

Then, we can create a dedicated embedding service that uses our table structure to store the embeddings:

@Slf4j
@Service
@AllArgsConstructor
@ConditionalOnProperty("spring.datasource.url")
public class EmbeddingService {

    private VectorStoreRepository vectorStoreRepository;
    private DocumentRepository documentRepository;

    @Qualifier("ollamaEmbeddingModel")
    private EmbeddingModel embeddingModel;

    public void generateAndStoreEmbedding(String input, String originalFilename) {
        var document = DocumentEntity.builder()
                .documentName(originalFilename)
                .metadata(Map.of("category", "PUBLIC"))
                .build();
        var savedDocument = documentRepository.save(document);

        vectorStoreRepository.saveAll(List.of(VectorStoreEntity.builder()
                .id(savedDocument.getId())
                .content(input)
                .embedding(
                        embeddingModel.embed(input).stream().mapToDouble(v -> v).toArray())
                .metadata(document.getMetadata())
                .build()));

        log.info("Stored embedding for input: " + input);
    }
}

And, with this in place, we can now create a dedicated RAG service with which we can perform document similarity search:

@Service
@AllArgsConstructor
@ConditionalOnProperty("spring.datasource.url")
public class RagService {

    private ChatModel chatModel;

    private VectorStore vectorStore;

    private final Map<String, DocumentProcessor> documentProcessors;

    @Getter
    private final Set<String> uploadedFileNames = new HashSet<>();

    private final VectorStoreRepository vectorStoreRepository;

    public List<VectorStoreEntity> findAll(String name) {
        return vectorStoreRepository.findVectorStoreEntitiesByDocument_DocumentName(name);
    }

    public void processFiles(List<MultipartFile> files) {
        for (MultipartFile file : files) {
            var extension = file.getOriginalFilename().split("\\.")[1];
            DocumentProcessor documentProcessor = documentProcessors.get(extension);
            documentProcessor.processDocument(file);
            uploadedFileNames.add(file.getOriginalFilename());
        }
    }

    public Flux<ChatResponse> query(String question) {
        var similarDocuments = vectorStore.similaritySearch(SearchRequest.query(question)
                .withTopK(5)
                .withFilterExpression(new Filter.Expression(
                        Filter.ExpressionType.EQ, new Filter.Key("category"), new Filter.Value("PUBLIC"))));
        System.out.println("Retrieved " + similarDocuments.size() + " similar docs.");
        String context = similarDocuments.stream().map(Document::getContent).collect(Collectors.joining("\n"));

        String prompt = "Context: " + context + "\n\nQuestion: " + question + "\n\nAnswer:";

        System.out.println("The prompt is " + prompt);
        return chatModel.stream(new Prompt(prompt));
    }
}

with this in place, we are now equiped with a solid foundation to extend our app with whatever we want.

Some food for tought:

Adding dedicated metadata at the document level and chunk level to enrich the LLM prompt even further;
Always remember to program against interfaces where possible: the DocumentProcessor above is an interface that selects a dedicated document processor at runtime to perform distinct pre-processing operations to distinct documents before generating embeddings;
We can extend this with a "post-processing" step for the embeddings, like adding metadata if desired, etc;

Finally, here's the full docker-compose file which includes the streamlit front-end too:

version: '3.8'

services:
  ollama-llm:
    image: ollama/ollama:latest
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    networks:
      - app-network

  db:
    image: ankane/pgvector:latest
    environment:
      - POSTGRES_DB=ragdb
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=password
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    networks:
      - app-network

  prepare-models:
    image: ollama/ollama:latest
    depends_on:
      - ollama-llm
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=http://ollama-llm:11434
    networks:
      - app-network
    entrypoint: >
      sh -c "
        echo 'Waiting for Ollama server to start...' &&
        sleep 10 &&
        echo 'Pulling tinydolphin...' &&
        ollama pull tinydolphin &&
        echo 'Pulling tinyllama...' &&
        ollama pull tinyllama &&
      echo 'Pulling llama3.1...' &&
      ollama pull llama3.1 &&
      echo 'Pulling embedding model...' &&
      ollama pull mxbai-embed-large &&
        echo 'Model preparation complete.'"

  backend:
    build: .
    ports:
      - "8080:8080"
    environment:
      - SPRING_DATASOURCE_URL=jdbc:postgresql://db:5432/ragdb
      - SPRING_DATASOURCE_USERNAME=postgres
      - SPRING_DATASOURCE_PASSWORD=password
      - SPRING_AI_OLLAMA_BASE_URL=http://ollama-llm:11434/
      - SPRING_AI_OLLAMA_CHAT_OPTIONS_MODEL=tinydolphin
      - SPRING_AI_OLLAMA_CHAT_OPTIONS_ALTERNATIVE_MODEL=tinyllama
      - SPRING_AI_OLLAMA_CHAT_OPTIONS_ALTERNATIVE_SECOND_MODEL=llama3.1
      - SPRING_AI_OLLAMA_EMBEDDING_OPTIONS_MODEL=mxbai-embed-large
      - SPRING_AI_VECTORSTORE_PGVECTOR_REMOVE_EXISTING_VECTOR_STORE_TABLE=true
      - SPRING_AI_VECTORSTORE_PGVECTOR_INDEX_TYPE=HNSW
      - SPRING_AI_VECTORSTORE_PGVECTOR_DISTANCE_TYPE=COSINE_DISTANCE
      - SPRING_AI_VECTORSTORE_PGVECTOR_DIMENSIONS=1024
    depends_on:
      - prepare-models
      - db
    volumes:
      - ollama_data:/root/.ollama
    networks:
      - app-network

  frontend:
    build:
      context: ./frontend
      dockerfile: Dockerfile
    ports:
      - "8501:8501"
    environment:
      - BACKEND_URL=http://backend:8080
      - BACKEND_AUTH_USERNAME=admin
      - BACKEND_AUTH_PASSWORD=cst=stom
    depends_on:
      - backend
    networks:
      - app-network

networks:
  app-network:
    driver: bridge

volumes:
  postgres_data:
  ollama_data:

Conclusion

With Spring AI we can leverage all the latest advancements in the LLM and AI research fields and, together with concepts from DB, information retrieval and data representation, we can build full-fledged local RAG apps that are modular and easy to extend!

Basic RAG app with Spring AI, Docker and Ollama