Iqbal´s DLQ Help

AI CLI — Building a Production Newsletter Engine with Java, Spring Boot & LangChain4j

TL;DR

AI CLI is a Java 21 terminal application built on Spring Boot, PicoCLI, and LangChain4j that transforms files using LLMs. It powers three fully automated newsletters — researched, written, validated, and delivered entirely by AI agents, on a schedule, with zero human intervention.

This article walks you through the stack, the architecture, the production lessons, and the live output.

Why Build This?

You probably subscribe to a handful of newsletters. Some of them are great. Most are cluttered, irrelevant, and clearly optimized for someone else's agenda — not yours.

That was the starting point: what if the newsletter was built for you, by you, covering exactly what you care about?

AI CLI was born out of a simple idea — a personal information feed that is 100% in your control. No vendor lock-in. No algorithmic feed deciding what you should see. No attention-harvesting social media platform reducing your world to engagement bait.

Instead, you define exactly what you want through the power of natural language. You write prompt files in Markdown. You choose the LLM, the search engine, the data sources, and the output format. You own the code, the infrastructure, and the data.

With AI CLI, you can spin up a highly specialized newsletter in a matter of hours and run it indefinitely on a schedule via CI/CD — for example:

  • Morocco Run Radar: all verified running events near Casablanca for the next 60 days

  • IT Events Casablanca: tech meetups and conferences for developers in the city

  • Assistant Professor Jobs: academic job openings matching specific criteria

Each of these newsletters is a configuration, not a product. The engine is the same. The prompts are different.

And unlike any SaaS newsletter builder, you can swap the LLM, the embedding model, the search engine, or the delivery platform at any time — because the architecture was designed around pluggability, not dependency.

This is also the follow-up and natural evolution from JBang Meets Spring Boot & LangChain4j and Easy RAG — Using Embeddings in LangChain4j, where we built the foundations of chaining LLM calls and embedding data for context. This time, we went to production.

The Stack

Technology

Version

Role

Java

21

Runtime — Records, Virtual Threads, modern Stream/Optional APIs

Spring Boot

3.4.11

Application framework and dependency injection

PicoCLI

4.7.7

Command-line interface (type-safe argument parsing)

LangChain4j

1.11.0

AI/LLM orchestration — the main course

Playwright + Stealth4j

1.58 / 1.1.2

Browser automation for JS-rendered page crawling

MailerSend SDK

1.4.1

Email delivery

Commonmark

0.27.1

Markdown → HTML rendering (with GFM tables)

DuckDB

In-process vector store for RAG

The Beauty of Spring Boot + PicoCLI

Before talking about LangChain4j, let's appreciate the backbone: Spring Boot and PicoCLI working together.

PicoCLI gives us a declarative, type-safe CLI layer where every argument (--chat-model, --tools, --embedding-model, --search-engine) is parsed, validated, and converted before it reaches business logic.

Spring Boot then takes those parsed values and engineers the perfect ephemeral context for each execution. This is the key design insight: the Spring context is different for every run, because the beans loaded depend entirely on what the user requested via CLI flags.

Here is how the ChatModel bean is resolved at startup:

@Configuration public class AIChatModelConfig { @Bean ChatModel chatModel(ApplicationArguments aa, ProviderProperties providerProperties, Environment environment, List<ChatModelFactory> factories) { ContextUtils.ParsingMainCommand parsingCommand = ContextUtils .parseIntoArgs(new ContextUtils.ParsingMainCommand(), aa, environment); String finalModelName = parsingCommand.getMainArgs().getChatModel(); return factories.stream() .filter(factory -> factory.supports(finalModelName, providerProperties)) .findFirst() .map(factory -> factory.create(finalModelName, providerProperties)) .orElseThrow(() -> new IllegalArgumentException( "Unsupported chat model: " + finalModelName)); } }

The same factory pattern is replicated for embedding models, vector stores, search engines, and scoring models. This means swapping from OpenAI GPT-5 to a local Ollama model is nothing more than changing a CLI flag — no code change, no reconfiguration, no rebuild. The factory resolves the right bean, and Spring wires it.

Here is the full factory resolution flow across all 5 component types:

Resolved Beans

ScoringModel Factories

SearchEngine Factories

EmbeddingStore Factories

EmbeddingModel Factories

ChatModel Factories

CLI Args (PicoCLI)

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

supports()?

--chat-model

--embedding-model

--embedding-store

--search-engine

--scoring-model

AIChatModelConfig

OpenAiChatModelFactory

GeminiChatModelFactory

DeepSeekChatModelFactory

GroqChatModelFactory

OllamaChatModelFactory

AIEmbeddingModelConfig

OpenAiEmbeddingModelFactory

GeminiEmbeddingModelFactory

VoyageAiEmbeddingModelFactory

OllamaEmbeddingModelFactory

LocalEmbeddingModelFactory

AIEmbeddingStoreConfig

DuckDbEmbeddingStoreFactory

LuceneEmbeddingStoreFactory

QdrantEmbeddingStoreFactory

AISearchEngineConfig

TavilySearchEngineFactory

GoogleSearchEngineFactory

DuckDuckGoSearchEngineFactory

StubSearchEngineFactory

AIScoringModelConfig

VoyageAiScoringModelFactory

OnnxScoringModelFactory

ChatModel

EmbeddingModel

EmbeddingStore

WebSearchEngine

ScoringModel

Similarly, every tool (search, crawler, RAG, validation) is a Spring @Component guarded by a custom @Conditional annotation. If you don't ask for search, the WebSearchTool bean is never instantiated. This keeps the runtime lean and the configuration explicit.

--tools=search,rag,content_crawler,json_schema_validate

This single flag composes a different Spring context every time. You get the flexibility and versatility of a plugin architecture without writing a single plugin loader — because Spring's dependency injection is the plugin loader.

LangChain4j — The Main Course

LangChain4j is the orchestration layer that connects your Java code to the LLM world. The project uses 16 LangChain4j modules covering:

  • Chat models (OpenAI, Gemini, DeepSeek, Groq, Ollama)

  • Embedding models (OpenAI, Gemini, Voyage AI, Ollama, local ONNX)

  • Vector stores (DuckDB, Lucene, Qdrant)

  • Search engines (Tavily, Google Custom Search, DuckDuckGo)

  • RAG infrastructure (document splitting, content retrieval, query routing, reranking)

  • Agentic workflows (UntypedAgent, AgenticScope)

It's an impressive library that lets you go from "call an LLM" to "build a multi-agent pipeline with RAG, tool calling, and structured output" — all in pure Java.

Architecture Overview

The application follows a layered architecture, where each layer has a clear responsibility:

External Services

Output Dispatch

RAG System

Pipeline Builders

Service Layer

CLI Layer (PicoCLI)

produces beans

produces beans

produces beans

produces beans

Configuration Layer

AIChatModelConfig

AIEmbeddingModelConfig

AIEmbeddingStoreConfig

AIScoringModelConfig

AISearchEngineConfig

Tool Layer (LLM-callable)

WebSearchTool

SocialMediaSearchTool

ContentCrawlerTool

TimeTool

JsonSchemaValidationTool

HtmlValidationTool

MarkdownValidationTool

ApifyInstagramScraperTool

ApifyFacebookScraperTool

DistanceTool

ToolRateLimiter

MainCommand (global options)

TransformCommand

TransformAgenticCommand

ForwardCommand

TransformService

TransformAgenticService

ForwardService

AssistantService

AgenticAssistantService

SequentialAgenticWorkflowService

IngestionService

ContentRetrievalService

ConfigurableDocumentSplitter

MetadataEnrichedTransformer

OutputServiceProvider

NewFileOutputService

ReplaceFileOutputService

MailOutputService

ButtondownOutputService

LLM Providers (OpenAI, Gemini, DeepSeek, Groq, Ollama)

Embedding Providers (Voyage AI, OpenAI, Gemini, Ollama, ONNX)

Vector Stores (DuckDB, Lucene, Qdrant)

Search Engines (Tavily, Google, DuckDuckGo)

MailerSend

Buttondown API

Apify Actors

Google Web Risk

Nominatim Geocoding

Playwright Browser

Four key design principles power this architecture:

  1. Factory Pattern for AI Components — Every LLM component is resolved via a *Factory interface matched against CLI arguments.

  2. Spring @Conditional for Tool Activation — Each tool bean is conditionally instantiated based on the --tools flag.

  3. Optional<> Constructor Injection — Services accept optional dependencies (Optional<IngestionService>, Optional<WebSearchTool>) so that missing beans don't break the wiring.

  4. Stage-Scoped Tools — Each prompt file can declare its own tools in YAML front matter. Tools are resolved per-stage, not globally.

From Multi-Stage Pipelines to Agentic Workflows

This section tells the evolution story — from simple chaining to production-grade agent orchestration.

Level 1: Simple Multi-Stage Prompt Pipeline

The simplest building block is the multi-stage prompt pipeline. You define a directory of numbered Markdown files, each containing a prompt.

Here is a visual overview of the pipeline flow:

Output ServiceRAG StoreTool LayerStage 2 - WriterStage 1 - ResearcherAssistantServiceUserOutput ServiceRAG StoreTool LayerStage 2 - WriterStage 1 - ResearcherAssistantServiceUserInput + Prompt DirectoryPrompt 1 + tools configsearch, crawl, validateAuto-ingest search resultsTool resultsStage 1 output (JSON)Stage 1 output as inputStage 2 output (Markdown)Final resultEmail / File / API

You define a directory of numbered Markdown files:

transformations/moroccan_runners/ ├── 1-research.md # Stage 1: research via web search └── 2-presentation.md # Stage 2: format into newsletter

Each file can optionally declare tools in YAML front matter:

--- tools: [search, search_social_media, content_crawler, rag, rerank, now, json_schema_validate, apify_instagram_scraper, distance] --- # Prompt: Morocco Running Events Researcher (Phase 1) You are an expert researcher tasked with finding running events in Morocco...

At runtime, these files are loaded, parsed, and chained together using Function::andThen:

public Function<String, String> build(AssistantRequest ar) { return ar.prompts().stream() .map(pd -> build(pd, ar)) // Build a GenericAssistant per stage .map(this::safeAssistantStep) // Wrap with null-safety .reduce(Function.identity(), Function::andThen); // Chain stages }

Each stage gets its own isolated InMemoryChatMemoryStore — this is critical for defeating context pollution, where the tool-call noise from Stage 1 leaks into Stage 2 and causes hallucinations:

InMemoryChatMemoryStore isolatedMemoryStore = new InMemoryChatMemoryStore(); ChatMemory chatMemory = MessageWindowChatMemory.builder() .maxMessages(1000) .chatMemoryStore(isolatedMemoryStore) .build();

The output of Stage 1 becomes the input of Stage 2 — clean, focused, no residual tool-call artifacts.

Level 2: Agentic Workflows with transform-agentic

The multi-stage pipeline was powerful but had limitations: all stages shared the same flat Function<String, String> interface. There was no structured memory, no explicit input/output contracts, and no way for Stage 3 to directly read Stage 1's output without it being piped through Stage 2.

Enter transform-agentic — the evolution from chaining functions to orchestrating LangChain4j UntypedAgents via an AgenticScope (a shared state dictionary).

Each stage is now defined by a richer frontmatter contract:

--- name: editor-in-chief description: Produces the editorial brief for downstream agents. tools: [now, json_schema_validate] input_keys: [input] output_key: brief --- You are the editor-in-chief for "Morocco Run Radar"... --- Use {{input}} as the raw operator request and convert it into the editorial brief.

Key differences from the simple pipeline:

  • Explicit input_keys and output_key — each agent declares what it reads and what it writes

  • Separate system message and user message sections (separated by ---)

  • Named agents — each stage has a human-readable name for logging and debugging

  • State dictionary (AgenticScope) — agents communicate via a shared map, not piping strings

Here is how agents are built and composed:

// Build each agent from its frontmatter definition var builder = AgenticServices.agentBuilder() .chatModel(chatModel) .chatMemory(chatMemory) // isolated per stage .name(prompt.name()) .description(prompt.description()) .systemMessage(prompt.systemMessage()) .userMessage(prompt.userMessage()) .inputs(prompt.inputKeys().stream() .map(key -> new AgentArgument(String.class, key)) .toArray(AgentArgument[]::new)) .outputKey(prompt.outputKey()) .tools(stageTools) .maxSequentialToolsInvocations(hardLimit); // Compose all agents into a sequential workflow return AgenticServices.sequenceBuilder() .name("transform-agentic-sequence") .subAgents(promptAgents.toArray()) .outputKey(request.transformation().lastOutputKey()) .build();

The real-world Moroccan Runners agentic pipeline is a 3-agent sequence:

Agent

Role

Input Key

Output Key

Tools

Editor-in-Chief

Produces the editorial brief

input

brief

now, json_schema_validate

Search Specialist

Researches and verifies events

brief

research

search, crawler, apify, distance, rag, rerank

Output Specialist

Formats the newsletter

research

newsletter

markdown_validate, markdown_security

The state flows through the scope as shown in this diagram:

Post-Processing

AgenticScope (State Dictionary)

SequentialAgenticWorkflowService

AgenticAssistantService (Per-Agent Factory)

Prompt-as-Code (Markdown Files)

invokeWithAgenticScope

Editor reads 'input'

Researcher reads 'brief'

Writer reads 'research'

Agent 3: Output Specialist

Isolated ChatMemory

Tools: markdown_validate

Security Scan

Agent 2: Search Specialist

Isolated ChatMemory

Tools: search, crawler, apify, social_media, distance

RAG + Reranker

Agent 1: Editor-in-Chief

Isolated ChatMemory

Tools: now, json_validate

No RAG

1-editor.md - tools: now, json_schema_validate - input → brief

2-search-specialist.md - tools: search, crawler, rag, rerank, apify - brief → research

3-output-specialist.md - tools: markdown_validate, markdown_security - research → newsletter

sequenceBuilder().subAgents(editor, researcher, writer).outputKey('newsletter')

input: 'Find Morocco running events'

brief: '{editorial JSON...}'

research: '{verified events JSON...}'

newsletter: '# Morocco Run Radar ...'

MarkdownSecurityValidation (Google Web Risk API)

OutputServiceProvider (MailerSend / Buttondown / File)

Each agent reads from specific keys and writes to its designated output key. The state dictionary accumulates context across the pipeline — Stage 3 can read Stage 1's output directly without it being piped through Stage 2.

This is a significant leap from Function::andThen. Each agent operates with full awareness of its role, its inputs, and its outputs — and the state dictionary provides structured memory across the pipeline.

Tool System — Empowering the LLM

Tools are what turn an LLM from a text generator into an agent that can act on the world. AI CLI ships with 11 callable tools, each guarded by a @Conditional annotation so it only loads when requested.

The activation flow works like this:

ToolsService Dispatch

Activated RAG Beans

Activated Tool Beans

@Conditional Conditions

CLI --tools flag

Optional injection

Optional injection

Optional injection

Optional injection

Optional injection

Optional injection

Optional injection

Optional injection

Optional injection

Optional injection

--tools=search,rag,content_crawler,...

SearchEnabledCondition

SocialMediaSearchEnabledCondition

ContentCrawlerEnabledCondition

NowEnabledCondition

JsonSchemaValidationEnabledCondition

HtmlValidationEnabledCondition

MarkdownValidationEnabledCondition

ApifyInstagramEnabledCondition

ApifyFacebookEnabledCondition

DistanceEnabledCondition

MarkdownSecurityEnabledCondition

RagEnabledCondition

RagSearchEnabledCondition

RerankerEnabledCondition

WebSearchTool

SocialMediaSearchTool

ContentCrawlerTool

TimeTool

JsonSchemaValidationTool

HtmlValidationTool

MarkdownValidationTool

ApifyInstagramScraperTool

ApifyFacebookScraperTool

DistanceTool

IngestionService

EmbeddingStoreContentRetriever

WebSearchContentRetriever

ScoringModel

ToolsService.getTools(stageToolEnums)

How Spring @Conditional Actually Works

The @Conditional annotation is one of Spring's most powerful mechanisms, and it runs very early in the application lifecycle — during the bean definition phase, before any bean is actually instantiated.

Here's the contract: you implement org.springframework.context.annotation.Condition, which gives you a single method — matches(). Spring calls this method while scanning your @Component or @Bean classes. If matches() returns false, the bean is never registered in the application context. It doesn't exist. No constructor is called, no dependencies are wired, no memory is allocated.

This is what our SearchEnabledCondition looks like:

public class SearchEnabledCondition implements Condition { @Override public boolean matches(ConditionContext context, AnnotatedTypeMetadata metadata) { // Access the already-parsed CLI arguments from Spring's bean factory ApplicationArguments aa = context.getBeanFactory() .getBean(ApplicationArguments.class); // Resolve which tools the user requested via --tools=... List<Tool> tools = ContextUtils.resolveRequestedTools( aa, context.getEnvironment()); // Only activate if "search" or "rag_search" was requested return Optional.ofNullable(tools) .filter(l -> l.contains(Tool.SEARCH) || l.contains(Tool.RAG_SEARCH)) .isPresent(); } }

Key things to notice:

  1. The ConditionContext — Spring gives you access to the BeanFactory, the Environment, the ClassLoader, and the ResourceLoader. You can inspect anything about the current application state.

  2. It runs before instantiation — this is not a runtime check. When matches() returns false, the WebSearchTool class is never constructed. This means its dependencies (the WebSearchEngine bean, for example) also don't need to exist.

  3. CLI → Condition → Bean graph — PicoCLI parses --tools=search,rag, Spring stores the arguments as ApplicationArguments, and the Condition reads them to decide which beans to load. This is how a single CLI flag reshapes the entire Spring context.

You then annotate the tool with it:

@Component @Conditional(SearchEnabledCondition.class) public class WebSearchTool { // Only exists if --tools contains "search" }

Every tool in AI CLI follows this pattern. The result is that the bean graph is perfectly tailored to whatever the user requested — no unused beans, no wasted connections, no accidental API calls.

Tool

Description

search

Web search via Tavily, Google, or DuckDuckGo

search_social_media

Targeted site: queries for Instagram, Facebook, TikTok

content_crawler

Full-page extraction via Playwright + Stealth4j (JS-rendered pages)

now

Current date/time in a timezone

json_schema_validate

Draft 2020-12 JSON Schema validation with actionable diagnostics

html_validate

Newsletter/email HTML safety and compatibility checks

markdown_validate

GFM-aware Markdown syntax validation

apify_instagram_scraper

Instagram profile scraping via Apify actors

apify_facebook_scraper

Facebook page scraping via Apify actors

distance

Geocoding + Haversine distance between cities (OpenStreetMap Nominatim)

markdown_security

URL extraction + Google Web Risk scanning (malware, phishing)

Here is how the WebSearchTool is implemented — note the constructor injection, rate limiting check, query sanitization, and auto-ingestion into the RAG store:

@Component @Conditional(SearchEnabledCondition.class) public class WebSearchTool { private final WebSearchEngine webSearchEngine; private final Optional<IngestionService> ingestionService; private final ToolRateLimiter rateLimiter; // Constructor injection... @Tool("Performs a web search to find relevant information.") public List<WebSearchOrganicResult> search(String query) { var limitReached = rateLimiter.tryAcquire("search"); if (limitReached.isPresent()) { return limitReached.get(); } var sanitizedQuery = sanitize(query); var results = webSearchEngine.search(request); // Auto-ingest into RAG store if IngestionService is available ingestionService.ifPresent(service -> service.ingestSearchResults(results.results())); return results.results(); } }

The auto-ingestion is the bridge between the Tool layer and the RAG layer — every search result is automatically embedded into the vector store for retrieval during the same run or future runs. This is the self-feeding loop that makes the system progressively more informed.

That said, this approach has a clear tradeoff: ingesting everything makes the RAG layer noisy over time. Not every search result is relevant, and low-quality pages dilute the store, making retrieval less precise. Some ideas for future releases:

  • Reranker as a pre-ingestion gate — the scoring model is already wired for retrieval. Running search results through it before ingesting and only keeping segments above a relevance threshold would filter out noise at the source.

  • TTL-based expiration — the ingestion_timestamp metadata is already stored. Adding a time-to-live filter at retrieval time would let stale entries (e.g., past events) naturally age out of the results.

For now, the current approach works well enough for weekly newsletters where the store is rebuilt frequently. But for long-running stores, smarter ingestion filtering will be necessary.

The Infinite Tool-Loop Caveat

One of the hardest production lessons: LLMs obsessively retry failing tools. A confused model can enter an infinite loop of calling search("running events Morocco") hundreds of times, draining your API budget in minutes.

The ToolRateLimiter is the deterministic Java boundary that stops this:

@Component public class ToolRateLimiter { private final ConcurrentHashMap<String, AtomicInteger> counters = new ConcurrentHashMap<>(); public Optional<List<WebSearchOrganicResult>> tryAcquire(String toolName) { int limit = resolveLimit(toolName); AtomicInteger counter = counters.computeIfAbsent(toolName, k -> new AtomicInteger(0)); if (counter.get() >= limit) { return Optional.of(Collections.singletonList(new WebSearchOrganicResult( "SYSTEM", URI.create("https://system"), "LIMIT_REACHED", "SYSTEM NOTIFICATION: You have reached the maximum number of allowed " + toolName + " calls. Do not search again."))); } counter.incrementAndGet(); return Optional.empty(); } }

Two levels of defense:

  1. Soft limit (per-tool): The ToolRateLimiter returns a LIMIT_REACHED sentinel response — the LLM reads this as a signal to stop.

  2. Hard limit (global): LangChain4j's maxSequentialToolsInvocations() throws a Java exception and forcefully terminates if the LLM keeps calling tools beyond the hard ceiling.

The soft limit allows the LLM one last strategic attempt; the hard limit is the kill switch.

RAG — The Self-Feeding Intelligence Layer

RAG (Retrieval-Augmented Generation) in AI CLI is not a simple "embed some files and query them." It's a carefully designed two-phase pipeline that feeds itself.

Vector Store

Retrieval Pipeline

Ingestion Pipeline

auto-ingest (dedup by URL)

rag tool

rag_search tool

rerank tool

single

single

multiple

multiple

attached to

--data files

WebSearchTool results

IngestionService

ConfigurableDocumentSplitter

MetadataEnrichedTransformer

EmbeddingStoreIngestor

ContentRetrievalService

EmbeddingStoreContentRetriever (vector similarity)

WebSearchContentRetriever (live web search)

LanguageModelQueryRouter (LLM-based routing)

ReRankingContentAggregator (minScore: 0.3)

DefaultRetrievalAugmentor

EmbeddingModel

EmbeddingStore (DuckDB / Lucene / Qdrant)

AiServices.builder().retrievalAugmentor()

Phase 1: Ingestion

Documents enter the embedding store through two paths:

  1. Static data files (--data=docs/) — loaded at startup

  2. Dynamic web search results — auto-ingested during execution by the WebSearchTool

Every document goes through the same pipeline:

Documents → ConfigurableDocumentSplitter → MetadataEnrichedTransformer → EmbeddingStoreIngestor → VectorStore

The MetadataEnrichedTextSegmentTransformer is where the magic happens. It doesn't just store the text — it enriches every segment with contextual labels:

@Component public class MetadataEnrichedTextSegmentTransformer implements TextSegmentTransformer { @Override public TextSegment transform(TextSegment textSegment) { var metadata = textSegment.metadata().copy(); // Store statistics and timestamps metadata.put("original_text", originalText); metadata.put("character_count", String.valueOf(originalText.length())); metadata.put("ingestion_timestamp", String.valueOf(System.currentTimeMillis())); // Contextualize: prepend filename or title to the embedded text String contextPrefix = ""; if (metadata.containsKey("file_name")) { contextPrefix = metadata.getString("file_name") + "\n"; } else if (metadata.containsKey("title")) { contextPrefix = "Title: " + metadata.getString("title") + "\n"; } return TextSegment.from(contextPrefix + originalText, metadata); } }

Why prepend the filename or title to the embedded text? Because embedding models retrieve based on semantic similarity, and a naked paragraph of text about "registration opens April 15" is meaningless without the context of which event it belongs to. The label grounds the embedding.

For web search results, the IngestionService also handles deduplication by URL before ingesting — it queries the store's metadata filter to check if it has already been seen:

private boolean exists(String url) { var request = EmbeddingSearchRequest.builder() .queryEmbedding(embeddingModel.embed(url).content()) .filter(MetadataFilterBuilder.metadataKey("url").isEqualTo(url)) .maxResults(1) .build(); return !embeddingStore.search(request).matches().isEmpty(); }

Phase 2: Retrieval

When a stage requests RAG (via tools: [rag] in front matter), the ContentRetrievalService builds a DefaultRetrievalAugmentor:

  • Single retriever (just rag) → direct attachment for efficiency

  • Multiple retrievers (rag + rag_search) → LanguageModelQueryRouter with LLM-based routing and ROUTE_TO_ALL fallback

  • Reranking (when rerank is active) → ReRankingContentAggregator with a minScore(0.3) threshold

Reranking is essential in production. Embedding similarity alone returns many segments, but not all are relevant. The reranker (Voyage AI API or a local ONNX model) re-scores the results based on semantic relevance to the actual query, filtering out noise.

This two-phase architecture means the RAG layer is not static. It grows during every pipeline run as web search results are auto-ingested, and future queries benefit from the enriched store. It's a self-improving loop.

Output System & Newsletter Delivery

After the pipeline finishes, the result needs to go somewhere. The OutputServiceProvider dispatches to the right handler based on the --output-mode flag:

Mode

Handler

Flow

new_file

NewFileOutputService

Write to a new timestamped file

replace_file

ReplaceFileOutputService

Overwrite the input file

mail

MailOutputService

Markdown → HTML (Commonmark with GFM tables) → MailerSend email

buttondown

ButtondownOutputService

Raw Markdown → Buttondown newsletter API

Here is the full output dispatch flow:

Buttondown Processing

Mail Processing

OutputHandler implementations

OutputCapableRequest

Pipeline Output

supports()?

supports()?

supports()?

supports()?

write

overwrite

LLM Output (Markdown/Text)

outputMode()

OutputServiceProvider.provide(request, output)

NewFileOutputService - mode: new_file

ReplaceFileOutputService - mode: replace_file

MailOutputService - mode: mail

ButtondownOutputService - mode: buttondown

Commonmark Parser

HtmlRenderer (+GFM Tables)

MailerSend SDK

POST /v1/emails {body: markdown}

New File

Input File

Email Delivery

Newsletter Published

The mail flow is particularly interesting: Commonmark parses the Markdown, renders it to HTML with GFM table extensions, and the MailerSend SDK delivers it to configured recipients. All automated, all from a cron-triggered GitLab CI job.

The CI/CD pipeline for newsletters is straightforward:

Output Delivery

2-Stage Transform Pipeline

Newsletter Jobs

Build Stage

GitLab CI Scheduled Pipeline

--output=mail

--output=buttondown

Cron Schedule

mvn clean package

ai-cli.jar

Stage 1: Research (search, scrape, validate JSON)

Stage 2: Presentation (format, validate Markdown)

MailerSend (Markdown to HTML email)

Buttondown (raw Markdown newsletter)

Each newsletter is a separate CI job with its own transformation directory, tools, and delivery configuration. Adding a new newsletter is creating a new prompt directory and a new CI job — nothing else.

Production Realities — Hard-Won Lessons

This is the most important section. Building an LLM application is easy. Keeping it running reliably in production is hard.

Context Pollution

When multiple stages share the same chat memory, Stage 2 sees all of Stage 1's tool calls — including failed attempts, retries, and debugging noise. The LLM starts hallucinating based on stale tool-call artifacts.

Fix: Every stage gets its own isolated InMemoryChatMemoryStore. Stage 2 never sees Stage 1's conversations.

Tool Isolation

The generation agent (Stage 2) must never have access to web-scraping or search tools. If it does, it will try to "verify" its own output by searching, find contradictory results, and enter a loop of self-correction that produces garbage.

Fix: Front matter tool declarations per stage. The researcher has search, crawler, apify. The writer has markdown_validate. They never overlap.

Hybrid Predictability

LLMs are non-deterministic. But newsletters need consistent structure. The solution is deterministic Java guardrails around non-deterministic LLM output:

  • JSON Schema validation (json_schema_validate) enforces the exact structure of Stage 1's output

  • Markdown validation (markdown_validate) catches malformed formatting before delivery

  • Security scanning (markdown_security) checks every URL against Google Web Risk before the email goes out

The LLM is creative. Java is the enforcer.

The Refeed Loop

The best newsletter outputs are validated, curated, and structurally sound. By ingesting these outputs back into the vector database, future generations are improved — they can reference previous editions as examples of good formatting, successful event verification, and proper structure. The system improves itself.

Testing — The Quest for the Right Model

The Model Testing Journey

We tested multiple models across different providers and scenarios.

Tool calling turned out to be a bit of a challenge. The LLM doesn't just generate text — it needs to decide when to call tools, how to interpret the results, and when to stop. Not all models handle this well.

Some models would enter infinite tool loops, calling search 200 times with the same query. Others would skip the validation tool before returning their final answer. Some would ignore the JSON schema entirely.

We tested across OpenAI, Gemini, DeepSeek, Groq, and local Ollama models. For local zero-cost testing, Qwen 3 (1.7B) turned out to be well suited for the job. At 1.7 billion parameters it runs fast on local machines, and it was fairly consistent with tool calling — it follows structured prompts, calls tools in the right order, and respects validation cycles. It became the model powering our entire zero-cost test suite.

Zero-Cost Integration Testing

The project runs 43 integration tests — all via shell scripts, no unit tests. The entire test suite runs against a local stack that costs exactly $0:

Production (Paid)

Zero-Cost Alternative

OpenAI / DeepSeek chat

Ollama qwen3:1.7b

Voyage AI embeddings

Ollama mxbai-embed-large

Google Custom Search

DuckDuckGo / Stub engine

Qdrant Cloud vector store

DuckDB (in-process)

Voyage AI reranker

ONNX ms-marco-mini-l6-v2

This is possible because of the factory pattern. The same code, same tests, different beans. Swapping --chat-model=gpt-5 to --chat-model=qwen3:1.7b loads a completely different Spring context with zero code changes.

Zero-Cost Alternatives

Spring @Conditional Factory Pattern

Production (Paid $$$)

Chat Model: OpenAI, DeepSeek, Anthropic, Google

Embedding Model: Voyage AI, OpenAI, Cohere

Search Engine: Google Custom Search, Serper

Vector Store: Qdrant Cloud, Pinecone, Weaviate

Scoring / Reranker: Cohere, Voyage AI

Same Code - Same Tests - Different Bean

Ollama (qwen3, llama3) / Llama.cpp, GPT4All

Ollama (mxbai, nomic) / ONNX all-MiniLM-L6-v2

DuckDuckGo / Tavily (free tier)

DuckDB (in-process) / Chroma, In-Memory

ONNX ms-marco-mini (local, no API)

43 Integration Tests - $0 API Cost

# Full zero-cost suite (43 tests) ./scripts/test_integration_ollama.sh # Dedicated agentic workflow coverage ./scripts/test_transform_agentic_ollama.sh

The tests cover: basic transformations, RAG, data ingestion, search tools, social media search, content crawler (JS-rendered pages), tool execution limits, validation tools (JSON, HTML, Markdown), reranking (ONNX), and output modes.

Assertion Strategy: Structure Over Content

LLM output is non-deterministic, so exact-string assertions are a recipe for flaky tests. Instead, we assert on structure:

  • "Did it return valid JSON with the required fields?"

  • "Does the output contain at least one event?"

  • "Was the validation tool called before the final answer?"

This approach gives us reliable CI/CD without fighting non-determinism.

The Real Output — Live Newsletters

This is not a demo. These newsletters run on a schedule and land in real inboxes.

Morocco Run Radar: Every week, the pipeline researches all upcoming running events in Morocco through web search, Instagram scraping, Facebook scraping, and official race websites. It computes the distance of each event from Casablanca, validates the data against a JSON schema, formats it into a Markdown newsletter, scans every URL for malware, and delivers it via email.

IT Events Casablanca: Targets tech professionals in Casablanca with upcoming meetups, conferences, and workshops.

Assistant Professor Jobs: Monitors academic job openings matching specific criteria and delivers a curated digest.

Each pipeline follows the same 2-stage (classic) or 3-agent (agentic) architecture. The prompt files are the only difference.

What's Next

The engine works, the newsletters ship, and the architecture holds up. But there is plenty of room to grow.

Multimodality

Right now, AI CLI operates in a text-only world. The LLM reads text prompts, searches text results, and produces text output. But the real world is not text-only — race organizers post flyers as images, trail maps are PDFs, and results are sometimes scanned documents. Introducing multimodal inputs (image understanding, PDF extraction) would let the pipeline process these richer sources directly rather than relying on whatever text happens to be on the webpage.

Streaming & Batch

The current pipeline runs synchronously — the LLM generates its full response before anything is returned. For long-running agentic workflows, streaming would provide real-time feedback and reduce perceived latency. On the other end of the spectrum, batch APIs would allow high-volume processing (e.g., ingesting hundreds of data sources at once) at reduced cost, since most providers offer batch endpoints at a discount.

Stateful Newsletters — Delta Reporting

Weekly readers lose engagement when they see the same 50 events repeated. The next evolution is giving the pipeline memory across runs by injecting the previous edition's structured JSON into the agentic scope. The agents can then highlight what changed: "3 new marathons added," "Rabat Marathon now sold out," or "Early bird registration ends tomorrow." This shifts the newsletter from a static list to a living update.

Smarter Ingestion

As discussed in the RAG section, blindly ingesting all search results makes the vector store noisy over time. Future iterations could introduce relevance-scored ingestion (using the reranker as a gate), TTL-based expiration, or even limiting ingestion to only the final validated output — so the RAG layer learns from curated content, not raw web noise.

Beyond Sequential — Other Workflow Patterns

It's worth being honest about what this engine is and what it isn't. Everything described in this article uses sequential workflows — agents execute one after another in a fixed order, passing state forward. This was the right choice for newsletters: there is a natural pipeline from research → editorial → formatting → delivery, and sequence gives you predictability and debuggability.

But sequential is not the only pattern, and it's not always the best one. LangChain4j supports several others that would suit different use cases:

  • Parallel workflows — multiple agents running concurrently. Useful when you have independent research tasks (e.g., one agent searches web, another scrapes social media, a third queries a database) and want to merge results at the end.

  • Loop workflows — self-correcting agents that iterate until a quality threshold is met. Instead of validating once and hoping, the agent retries with feedback until the output passes.

  • Supervisor workflows — a manager agent that dynamically decides which worker agents to call, in what order, and how many times. This is the most powerful pattern — it handles complex, branching tasks where the right next step depends on what was learned so far.

For a newsletter engine, sequential gives us exactly the control we need. But if the use case grows into something more complex — say, a real-time research assistant that needs to decide dynamically whether to search, crawl, or ask a follow-up question — a supervisor or loop pattern would be the right tool for the job. The factory-based architecture makes that transition straightforward: the workflow pattern is just another bean.

Further Reading

This article builds on concepts introduced in two predecessor articles. If you want to understand the foundations before diving into the production engine:

Downloads — See It in Action

Live Presentation

LangChain4j in Action: A Walkthrough from Basic Chaining to Agentic Workflows (PDF)

Newsletter Examples (Real Output)

Morocco Run Radar — Agentic Edition (April 2026)

IT Events Casablanca Newsletter (April 2026)

Assistant Professor Jobs Newsletter (April 2026)

13 April 2026