Playing with Generative AI with WildFly

This blog post provides information on how to use the AI Galleon Feature pack to write generative AI applications.

Please note that this feature pack is currently a proof of concept and is not ready for production.

You can access the release org.wildfly:wildfly-ai-galleon-pack:0.1.0 to follow the examples provided here.

What’s in this feature pack ?

This feature pack is work in progress so we will only cover what is already available as of today. It relies on configuring and exposing LangChain4J via smallrye-llm to your application. It provides a way to inject some basic elements to make creating a Retrieval-Augmented Generation (RAG) application easy. It helps by configuring and providing all the required elements for such an application.

What is RAG ?

As its name indicates, RAG is about retrieving relevant elements from pieces of information in your data and using them to enrich the prompt before sending the whole query (initial prompt + relevant data) to the LLM. The goal is to provide sufficient context to the LLM so that it can use those pieces of information and not hallucinate.

The core elements of RAG

As you can see on the schema below, you need a few elements to provide a RAG functionality:

  • some components to get relevant pieces of information. We will focus on using a 'semantic' search approach, a.k.a. vector search, so we will need:

    • an embedding model to create embeddings (aka semantic vectors) from the user query thus creating an embedding query.

    • an embedding store containing all the embeddings and the associated segments of data and against which/ the embedding query will be run.

  • a client to the LLM to send the enriched query against.

RAG

RAG with LangChain4j

The current feature pack provides a basic integration with the library LangChain4J. That means that currently we are exposing the LangChain4J API to the application. So it provides:

  • dev.langchain4j.model.embedding.EmbeddingModel

  • dev.langchain4j.store.embedding.EmbeddingStore

  • dev.langchain4j.rag.content.retriever.ContentRetriever

  • dev.langchain4j.model.chat.ChatLanguageModel

So you can configure instances of these objects via the WildFly management API to be injected via CDI into your application.

We also support partially LangChain4J AI Services using RegisterAIService from smallrye-llm. Please note that this will change in the future as this project is quite new.

The feature pack

The feature pack is composed of a subsystem and a CDI extension. It supports several layers based on what you are trying to acheive. The list is not exhaustive and may vary in the future.

The layers

This feature pack provides several layers to choose from when provisioning your server:

  • Support for chat models to interact with a LLM via ChatModelLanguage:

    • mistral-ai-chat-model: provides integration with Mistral AI

    • ollama-chat-model: provides integration with Ollama

    • openai-chat-model: provides integration with OpenAI (current configuration target Groq)

  • Support for embedding models via EmbeddingModel:

    • in-memory-embedding-model-all-minilm-l6-v2

    • in-memory-embedding-model-all-minilm-l6-v2-q

    • in-memory-embedding-model-bge-small-en

    • in-memory-embedding-model-bge-small-en-q

    • in-memory-embedding-model-bge-small-en-v15

    • in-memory-embedding-model-bge-small-en-v15-q

    • in-memory-embedding-model-e5-small-v2

    • in-memory-embedding-model-e5-small-v2-q

    • ollama-embedding-model: using Ollama to compute embeddings.

  • Support for embedding stores via EmbeddingStore:

    • in-memory-embedding-store: provides integration with an in memory embedding store (for demo purpose only).

    • weaviate-embedding-store: provides integration with Weaviate.

  • Support for content retriever as ContentRetriever for RAG:

    • default-embedding-content-retriever: default content retriever using an in-memory-embedding-store and in-memory-embedding-model-all-minilm-l6-v2 for embedding model.

    • web-search-engines: provides support for Google and Tavily search engine.

The layers will provide the required management resources, operations and modules.

Provisioning the feature pack

For example:

<!-- The WildFly plugin deploys your war to a local WildFly container -->
<plugin>
    <groupId>org.wildfly.plugins</groupId>
    <artifactId>wildfly-maven-plugin</artifactId>
    <version>${version.wildfly.maven.plugin}</version>
    <configuration>
        <feature-packs>
            <feature-pack>
                <location>org.wildfly:wildfly-galleon-pack:${version.wildfly.bom}</location>
            </feature-pack>
            <feature-pack>
                <location>org.wildfly:wildfly-ai-galleon-pack:0.1.0</location>
            </feature-pack>
        </feature-packs>
        <layers>
            <layer>cloud-server</layer>
            <layer>ollama-chat-model</layer>
            <layer>default-embedding-content-retriever</layer>
            <!-- default-embedding-content-retriever provides the following layers -->
            <!--
                <layer>in-memory-embedding-model-all-minilm-l6-v2</layer>
                <layer>in-memory-embedding-store</layer>
            -->
            <!-- Existing layers that can be used instead-->
            <!--
                <layer>ollama-embedding-model</layer>
                <layer>openai-chat-model</layer>
                <layer>mistral-ai-chat-model</layer>
                <layer>weaviate-embedding-store</layer>
                <layer>web-search-engines</layer>
            -->
        </layers>
        <name>ROOT.war</name>
        <extraServerContentDirs>
            <extraServerContentDir>extra-content</extraServerContentDir>
        </extraServerContentDirs>
        <packagingScripts>
            <packaging-script>
                <scripts>
                    <script>./src/scripts/configure_llm.cli</script>
                </scripts>
            </packaging-script>
        </packagingScripts>
    </configuration>
    <executions>
        <execution>
            <goals>
                <goal>package</goal>
            </goals>
        </execution>
    </executions>
</plugin>

In our example we are provisioning everything via layers. The script configure_llm.cli provides sample commands to further configure the subsystem manually. Please note that all modules might not be provisionned so you need to add the corresponding layers in the pom.xml.

###Embedding Models
# Adding the SentenceTransformers all-MiniLM-L6-v2 EmbeddingModel that runs within the server JVM.
#/subsystem=ai/embedding-model=myembedding:add(module=dev.langchain4j.embeddings.all-minilm-l6-v2, embedding-class=dev.langchain4j.model.embedding.onnx.allminilml6v2.AllMiniLmL6V2EmbeddingModel)


# Adding an Ollama EmbeddingModel connecting to http://192.168.1.11:11434 using the model llama3:8b.
#/subsystem=ai/ollama-embedding-model=test:add(base-url="http://192.168.1.11:11434", model-name="llama3:8b")

###Chat Language Models
# Adding an OpenAI REST ChatLanguageModel connecting to Groq using the model llama3-8b-8192.
#/subsystem=ai/openai-chat-model=mychat:add(base-url="https://api.groq.com/openai/v1", api-key="${env.GROQ_API_KEY}", log-requests="true", log-responses="true", model-name="llama3-8b-8192")

### Mistral
#/subsystem=ai/mistral-ai-chat-model=test:add(api-key="${env.MISTRAL_API_KEY}", base-url="https://api.mistral.ai/v1", log-requests="true", log-responses="true", model-name="mistral-small-latest")


# Adding an Ollama ChatLanguageModel connecting to http://127.0.0.1:11434 using the model llama3:8b.
#/subsystem=ai/ollama-chat-model=mychat:add(model-name="llama3.1:8b", base-url="http://127.0.0.1:11434", log-requests="true", log-responses="true", temperature="0.9")
#/subsystem=ai/ollama-chat-model=mychat:add(model-name="mistral", base-url="http://127.0.0.1:11434", log-requests="true", log-responses="true", temperature="0.9")
#/subsystem=ai/openai-chat-model=mychat:add(base-url="https://api.groq.com/openai/v1", api-key="${env.GROQ_API_KEY}",model-name="llama3:8b")


###Embedding Stores

# Adding Weaviate as an embedding store
# podman run --rm -p 8090:8080 -p 50051:50051  -e AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED="true" -v $SOME_PATH/volumes/weaviate/_data:/data --name=weaviate cr.weaviate.io/semitechnologies/weaviate:1.24.10
#/socket-binding-group=standard-sockets/remote-destination-outbound-socket-binding=weaviate:add(host=localhost, port=8090)
#/subsystem=ai/weaviate-embedding-store=mystore:add(socket-binding=weaviate, ssl-enabled=false, object-class=Simple, metadata=[url,language,parent_url,file_name,file_path,title,subtitle])
#/subsystem=logging/logger=io.weaviate.client.client:add(level=TRACE)

# Adding in memory embedding store loading form a json file
#/subsystem=ai/in-memory-embedding-store=mystore:add(file=/home/ehugonne/dev/AI/crawler/crawler/wildfly-admin-embeddings.json)


###Content retrievers

# Adding a content retriever using embeddings
#/subsystem=ai/embedding-store-content-retriever=myretriever:add(embedding-model=myembedding,embedding-store=mystore, max-results=2, min-score=0.7)

# Adding a content retriever using Tavily search engine
#/subsystem=ai/web-search-content-retriever=myretriever:add(tavily={api-key=${env.TAVILY_API_KEY}, base-url=https://api.tavily.com, connect-timeout=20000, exclude-domains=[example.org], include-domains=[example.com], include-answer=true})

Putting it all together: The WebChat example

To put it all together we are going to execute a sample RAG application with a web interface. It will use embeddings that were previously computed using WildFly documentation. If you want to check the code of the application used to create embeddings out of WildFly documentation and store the results you can look at https://github.com/ehsavoie/crawler. It can be used to fill either a JSON file or a weaviate embedding store.The embeddings were computed using the All-MiniLM-L6-v2 EmbeddingModel, so we need to use the same model in our RAG application. This embedding model will be used to compute the embedding of the user query and then the application will search for the nearest contents in the in-memory embedding store. The content retriever will retrieve and append those contents to the user query to create the prompt that will be sent to the LLM we are connected to, via the ChatLanguageModel.

Running Ollama locally

podman run -d --rm --name ollama --replace --pull=always -p 11434:11434 -v ollama:/root/.ollama --stop-signal=SIGKILL docker.io/ollama/ollama

Execute the following command to select the expected model (type /bye to quit the ollama prompt):

podman exec -it ollama ollama run llama3.1:8b

Configuring the server

We will use the following layers:

  • default-embedding-content-retriever

  • ollama-chat-model

So the pom.xml should look like this:

<!-- The WildFly plugin deploys your war to a local JBoss AS container -->
<plugin>
    <groupId>org.wildfly.plugins</groupId>
    <artifactId>wildfly-maven-plugin</artifactId>
    <version>${version.wildfly.maven.plugin}</version>
    <configuration>
        <feature-packs>
            <feature-pack>
                <location>org.wildfly:wildfly-galleon-pack:${version.wildfly.bom}</location>
            </feature-pack>
            <feature-pack>
                <location>org.wildfly:wildfly-ai-galleon-pack:0.1.0</location>
            </feature-pack>
        </feature-packs>
        <layers>
            <layer>cloud-server</layer>
            <layer>default-embedding-content-retriever</layer>
            <layer>ollama-chat-model</layer>
        </layers>
        <name>ROOT.war</name>
        <packagingScripts>
            <packaging-script>
                <scripts>
                    <script>./src/main/resources/scripts/configure_llm.cli</script>
                </scripts>
            </packaging-script>
        </packagingScripts>
    </configuration>
    <executions>
        <execution>
            <goals>
                <goal>package</goal>
            </goals>
        </execution>
    </executions>
</plugin>

The code

The code itself is quite straightforward:

@ServerEndpoint(value = "/websocket/chatbot",
        configurator = org.wildfly.ai.websocket.CustomConfigurator.class)
public class RagChatBot {
    @Inject
    @Named(value = "ollama")
    ChatLanguageModel chatModel;
    @Inject
    @Named(value = "embedding-store-retriever")
    ContentRetriever retriever;

  private static final String PROMPT_TEMPLATE = "You are a WildFly expert who understands well how to administrate the WildFly server and its components\n"
            + "Objective: answer the user question delimited by  ---\n"
            + "\n"
            + "---\n"
            + "{{userMessage}}\n"
            + "---"
            + "\n Here is some data to help you:\n"
            + "{{contents}}";

  @OnMessage
  public String sayHello(String question, Session session) throws IOException {
        ChatMemory chatMemory = MessageWindowChatMemory.builder().id(session.getUserProperties().get("httpSessionId")).maxMessages(3).build();
        ConversationalRetrievalChain chain = ConversationalRetrievalChain.builder()
                .chatLanguageModel(chatModel)
                .chatMemory(chatMemory)
                .retrievalAugmentor(createBasicRag())
                .build();
        String result = chain.execute(question).replace("\n", "<br/>");
        return result;
    }

    private RetrievalAugmentor createBasicRag() {
        return DefaultRetrievalAugmentor.builder()
                .contentRetriever(retriever)
                .contentInjector(DefaultContentInjector.builder()
                        .promptTemplate(PromptTemplate.from(PROMPT_TEMPLATE))
                        .build())
                .queryRouter(new DefaultQueryRouter(retriever))
                .build();
    }
}

The dev.langchain4j.rag.content.retriever.ContentRetriever called embedding-store-retriever defined in the subsystem is injected into our WebSocket endpoint using the the @Named annotation.

It is used to instantiate a dev.langchain4j.rag.RetrievalAugmentor which is in charge of retrieving the contents and enriching the prompt with them.

In the same way, the dev.langchain4j.model.chat.ChatLanguageModel called ollama defined in the subsystem is injected into our WebSocket endpoint using the @Named annotation.

With those two elements, a dev.langchain4j.chain.ConversationalRetrievalChain is created and used to interact with the LLM and send back the answer to the client using the WebSocket.

The final code of the application is a bit more complex than what is exposed here as it tries to keep some context between user queries.

Building and Running the application

First you need to clone it from https://github.com/ehsavoie/webchat and select the branch 0.1.x.

In a console execute the following commands:

git clone https://github.com/ehsavoie/webchat.git
cd webchat
git checkout 0.1.x
mvn clean install
./target/server/bin/standalone.sh

Now that the server is started, you can access the application. You need to open the WebSocket connection and then you can ask your questions.

For example you can ask "How do you configure a connection factory to a remote Artemis server ?".

If you look into the server log file you will see the effective prompt sent to the LLM as well as the answer from it.

Going further

This feature pack is currently just a proof of concept and is quite limited. If you look at the code you can easily tweak it to try several scenarios like using a Weaviate Embedding Store with metadata. Also you can try OpenAI or Groq instead of Ollama.

All the source code of the feature pack is available on Github and can be improved by extending the components that may be of use.