AI CLI — Building a Production Newsletter Engine with Java, Spring Boot & LangChain4j
TL;DR
AI CLI is a Java 21 terminal application built on Spring Boot, PicoCLI, and LangChain4j that transforms files using LLMs. It powers three fully automated newsletters — researched, written, validated, and delivered entirely by AI agents, on a schedule, with zero human intervention.
This article walks you through the stack, the architecture, the production lessons, and the live output.
Why Build This?
You probably subscribe to a handful of newsletters. Some of them are great. Most are cluttered, irrelevant, and clearly optimized for someone else's agenda — not yours.
That was the starting point: what if the newsletter was built for you, by you, covering exactly what you care about?
AI CLI was born out of a simple idea — a personal information feed that is 100% in your control. No vendor lock-in. No algorithmic feed deciding what you should see. No attention-harvesting social media platform reducing your world to engagement bait.
Instead, you define exactly what you want through the power of natural language. You write prompt files in Markdown. You choose the LLM, the search engine, the data sources, and the output format. You own the code, the infrastructure, and the data.
With AI CLI, you can spin up a highly specialized newsletter in a matter of hours and run it indefinitely on a schedule via CI/CD — for example:
Morocco Run Radar: all verified running events near Casablanca for the next 60 days
IT Events Casablanca: tech meetups and conferences for developers in the city
Assistant Professor Jobs: academic job openings matching specific criteria
Each of these newsletters is a configuration, not a product. The engine is the same. The prompts are different.
And unlike any SaaS newsletter builder, you can swap the LLM, the embedding model, the search engine, or the delivery platform at any time — because the architecture was designed around pluggability, not dependency.
This is also the follow-up and natural evolution from JBang Meets Spring Boot & LangChain4j and Easy RAG — Using Embeddings in LangChain4j, where we built the foundations of chaining LLM calls and embedding data for context. This time, we went to production.
The Stack
Technology | Version | Role |
|---|---|---|
Java | 21 | Runtime — Records, Virtual Threads, modern Stream/Optional APIs |
Spring Boot | 3.4.11 | Application framework and dependency injection |
PicoCLI | 4.7.7 | Command-line interface (type-safe argument parsing) |
LangChain4j | 1.11.0 | AI/LLM orchestration — the main course |
Playwright + Stealth4j | 1.58 / 1.1.2 | Browser automation for JS-rendered page crawling |
MailerSend SDK | 1.4.1 | Email delivery |
Commonmark | 0.27.1 | Markdown → HTML rendering (with GFM tables) |
DuckDB | — | In-process vector store for RAG |
The Beauty of Spring Boot + PicoCLI
Before talking about LangChain4j, let's appreciate the backbone: Spring Boot and PicoCLI working together.
PicoCLI gives us a declarative, type-safe CLI layer where every argument (--chat-model, --tools, --embedding-model, --search-engine) is parsed, validated, and converted before it reaches business logic.
Spring Boot then takes those parsed values and engineers the perfect ephemeral context for each execution. This is the key design insight: the Spring context is different for every run, because the beans loaded depend entirely on what the user requested via CLI flags.
Here is how the ChatModel bean is resolved at startup:
The same factory pattern is replicated for embedding models, vector stores, search engines, and scoring models. This means swapping from OpenAI GPT-5 to a local Ollama model is nothing more than changing a CLI flag — no code change, no reconfiguration, no rebuild. The factory resolves the right bean, and Spring wires it.
Here is the full factory resolution flow across all 5 component types:
Similarly, every tool (search, crawler, RAG, validation) is a Spring @Component guarded by a custom @Conditional annotation. If you don't ask for search, the WebSearchTool bean is never instantiated. This keeps the runtime lean and the configuration explicit.
This single flag composes a different Spring context every time. You get the flexibility and versatility of a plugin architecture without writing a single plugin loader — because Spring's dependency injection is the plugin loader.
LangChain4j — The Main Course
LangChain4j is the orchestration layer that connects your Java code to the LLM world. The project uses 16 LangChain4j modules covering:
Chat models (OpenAI, Gemini, DeepSeek, Groq, Ollama)
Embedding models (OpenAI, Gemini, Voyage AI, Ollama, local ONNX)
Vector stores (DuckDB, Lucene, Qdrant)
Search engines (Tavily, Google Custom Search, DuckDuckGo)
RAG infrastructure (document splitting, content retrieval, query routing, reranking)
Agentic workflows (UntypedAgent, AgenticScope)
It's an impressive library that lets you go from "call an LLM" to "build a multi-agent pipeline with RAG, tool calling, and structured output" — all in pure Java.
Architecture Overview
The application follows a layered architecture, where each layer has a clear responsibility:
Four key design principles power this architecture:
Factory Pattern for AI Components — Every LLM component is resolved via a
*Factoryinterface matched against CLI arguments.Spring
@Conditionalfor Tool Activation — Each tool bean is conditionally instantiated based on the--toolsflag.Optional<>Constructor Injection — Services accept optional dependencies (Optional<IngestionService>,Optional<WebSearchTool>) so that missing beans don't break the wiring.Stage-Scoped Tools — Each prompt file can declare its own tools in YAML front matter. Tools are resolved per-stage, not globally.
From Multi-Stage Pipelines to Agentic Workflows
This section tells the evolution story — from simple chaining to production-grade agent orchestration.
Level 1: Simple Multi-Stage Prompt Pipeline
The simplest building block is the multi-stage prompt pipeline. You define a directory of numbered Markdown files, each containing a prompt.
Here is a visual overview of the pipeline flow:
You define a directory of numbered Markdown files:
Each file can optionally declare tools in YAML front matter:
At runtime, these files are loaded, parsed, and chained together using Function::andThen:
Each stage gets its own isolated InMemoryChatMemoryStore — this is critical for defeating context pollution, where the tool-call noise from Stage 1 leaks into Stage 2 and causes hallucinations:
The output of Stage 1 becomes the input of Stage 2 — clean, focused, no residual tool-call artifacts.
Level 2: Agentic Workflows with transform-agentic
The multi-stage pipeline was powerful but had limitations: all stages shared the same flat Function<String, String> interface. There was no structured memory, no explicit input/output contracts, and no way for Stage 3 to directly read Stage 1's output without it being piped through Stage 2.
Enter transform-agentic — the evolution from chaining functions to orchestrating LangChain4j UntypedAgents via an AgenticScope (a shared state dictionary).
Each stage is now defined by a richer frontmatter contract:
Key differences from the simple pipeline:
Explicit
input_keysandoutput_key— each agent declares what it reads and what it writesSeparate system message and user message sections (separated by
---)Named agents — each stage has a human-readable name for logging and debugging
State dictionary (AgenticScope) — agents communicate via a shared map, not piping strings
Here is how agents are built and composed:
The real-world Moroccan Runners agentic pipeline is a 3-agent sequence:
Agent | Role | Input Key | Output Key | Tools |
|---|---|---|---|---|
Editor-in-Chief | Produces the editorial brief |
|
|
|
Search Specialist | Researches and verifies events |
|
|
|
Output Specialist | Formats the newsletter |
|
|
|
The state flows through the scope as shown in this diagram:
Each agent reads from specific keys and writes to its designated output key. The state dictionary accumulates context across the pipeline — Stage 3 can read Stage 1's output directly without it being piped through Stage 2.
This is a significant leap from Function::andThen. Each agent operates with full awareness of its role, its inputs, and its outputs — and the state dictionary provides structured memory across the pipeline.
Tool System — Empowering the LLM
Tools are what turn an LLM from a text generator into an agent that can act on the world. AI CLI ships with 11 callable tools, each guarded by a @Conditional annotation so it only loads when requested.
The activation flow works like this:
How Spring @Conditional Actually Works
The @Conditional annotation is one of Spring's most powerful mechanisms, and it runs very early in the application lifecycle — during the bean definition phase, before any bean is actually instantiated.
Here's the contract: you implement org.springframework.context.annotation.Condition, which gives you a single method — matches(). Spring calls this method while scanning your @Component or @Bean classes. If matches() returns false, the bean is never registered in the application context. It doesn't exist. No constructor is called, no dependencies are wired, no memory is allocated.
This is what our SearchEnabledCondition looks like:
Key things to notice:
The
ConditionContext— Spring gives you access to theBeanFactory, theEnvironment, theClassLoader, and theResourceLoader. You can inspect anything about the current application state.It runs before instantiation — this is not a runtime check. When
matches()returnsfalse, theWebSearchToolclass is never constructed. This means its dependencies (theWebSearchEnginebean, for example) also don't need to exist.CLI → Condition → Bean graph — PicoCLI parses
--tools=search,rag, Spring stores the arguments asApplicationArguments, and theConditionreads them to decide which beans to load. This is how a single CLI flag reshapes the entire Spring context.
You then annotate the tool with it:
Every tool in AI CLI follows this pattern. The result is that the bean graph is perfectly tailored to whatever the user requested — no unused beans, no wasted connections, no accidental API calls.
Tool | Description |
|---|---|
| Web search via Tavily, Google, or DuckDuckGo |
| Targeted |
| Full-page extraction via Playwright + Stealth4j (JS-rendered pages) |
| Current date/time in a timezone |
| Draft 2020-12 JSON Schema validation with actionable diagnostics |
| Newsletter/email HTML safety and compatibility checks |
| GFM-aware Markdown syntax validation |
| Instagram profile scraping via Apify actors |
| Facebook page scraping via Apify actors |
| Geocoding + Haversine distance between cities (OpenStreetMap Nominatim) |
| URL extraction + Google Web Risk scanning (malware, phishing) |
Here is how the WebSearchTool is implemented — note the constructor injection, rate limiting check, query sanitization, and auto-ingestion into the RAG store:
The auto-ingestion is the bridge between the Tool layer and the RAG layer — every search result is automatically embedded into the vector store for retrieval during the same run or future runs. This is the self-feeding loop that makes the system progressively more informed.
That said, this approach has a clear tradeoff: ingesting everything makes the RAG layer noisy over time. Not every search result is relevant, and low-quality pages dilute the store, making retrieval less precise. Some ideas for future releases:
Reranker as a pre-ingestion gate — the scoring model is already wired for retrieval. Running search results through it before ingesting and only keeping segments above a relevance threshold would filter out noise at the source.
TTL-based expiration — the
ingestion_timestampmetadata is already stored. Adding a time-to-live filter at retrieval time would let stale entries (e.g., past events) naturally age out of the results.
For now, the current approach works well enough for weekly newsletters where the store is rebuilt frequently. But for long-running stores, smarter ingestion filtering will be necessary.
The Infinite Tool-Loop Caveat
One of the hardest production lessons: LLMs obsessively retry failing tools. A confused model can enter an infinite loop of calling search("running events Morocco") hundreds of times, draining your API budget in minutes.
The ToolRateLimiter is the deterministic Java boundary that stops this:
Two levels of defense:
Soft limit (per-tool): The
ToolRateLimiterreturns aLIMIT_REACHEDsentinel response — the LLM reads this as a signal to stop.Hard limit (global): LangChain4j's
maxSequentialToolsInvocations()throws a Java exception and forcefully terminates if the LLM keeps calling tools beyond the hard ceiling.
The soft limit allows the LLM one last strategic attempt; the hard limit is the kill switch.
RAG — The Self-Feeding Intelligence Layer
RAG (Retrieval-Augmented Generation) in AI CLI is not a simple "embed some files and query them." It's a carefully designed two-phase pipeline that feeds itself.
Phase 1: Ingestion
Documents enter the embedding store through two paths:
Static data files (
--data=docs/) — loaded at startupDynamic web search results — auto-ingested during execution by the
WebSearchTool
Every document goes through the same pipeline:
The MetadataEnrichedTextSegmentTransformer is where the magic happens. It doesn't just store the text — it enriches every segment with contextual labels:
Why prepend the filename or title to the embedded text? Because embedding models retrieve based on semantic similarity, and a naked paragraph of text about "registration opens April 15" is meaningless without the context of which event it belongs to. The label grounds the embedding.
For web search results, the IngestionService also handles deduplication by URL before ingesting — it queries the store's metadata filter to check if it has already been seen:
Phase 2: Retrieval
When a stage requests RAG (via tools: [rag] in front matter), the ContentRetrievalService builds a DefaultRetrievalAugmentor:
Single retriever (just
rag) → direct attachment for efficiencyMultiple retrievers (
rag+rag_search) →LanguageModelQueryRouterwith LLM-based routing andROUTE_TO_ALLfallbackReranking (when
rerankis active) →ReRankingContentAggregatorwith aminScore(0.3)threshold
Reranking is essential in production. Embedding similarity alone returns many segments, but not all are relevant. The reranker (Voyage AI API or a local ONNX model) re-scores the results based on semantic relevance to the actual query, filtering out noise.
This two-phase architecture means the RAG layer is not static. It grows during every pipeline run as web search results are auto-ingested, and future queries benefit from the enriched store. It's a self-improving loop.
Output System & Newsletter Delivery
After the pipeline finishes, the result needs to go somewhere. The OutputServiceProvider dispatches to the right handler based on the --output-mode flag:
Mode | Handler | Flow |
|---|---|---|
|
| Write to a new timestamped file |
|
| Overwrite the input file |
|
| Markdown → HTML (Commonmark with GFM tables) → MailerSend email |
|
| Raw Markdown → Buttondown newsletter API |
Here is the full output dispatch flow:
The mail flow is particularly interesting: Commonmark parses the Markdown, renders it to HTML with GFM table extensions, and the MailerSend SDK delivers it to configured recipients. All automated, all from a cron-triggered GitLab CI job.
The CI/CD pipeline for newsletters is straightforward:
Each newsletter is a separate CI job with its own transformation directory, tools, and delivery configuration. Adding a new newsletter is creating a new prompt directory and a new CI job — nothing else.
Production Realities — Hard-Won Lessons
This is the most important section. Building an LLM application is easy. Keeping it running reliably in production is hard.
Context Pollution
When multiple stages share the same chat memory, Stage 2 sees all of Stage 1's tool calls — including failed attempts, retries, and debugging noise. The LLM starts hallucinating based on stale tool-call artifacts.
Fix: Every stage gets its own isolated InMemoryChatMemoryStore. Stage 2 never sees Stage 1's conversations.
Tool Isolation
The generation agent (Stage 2) must never have access to web-scraping or search tools. If it does, it will try to "verify" its own output by searching, find contradictory results, and enter a loop of self-correction that produces garbage.
Fix: Front matter tool declarations per stage. The researcher has search, crawler, apify. The writer has markdown_validate. They never overlap.
Hybrid Predictability
LLMs are non-deterministic. But newsletters need consistent structure. The solution is deterministic Java guardrails around non-deterministic LLM output:
JSON Schema validation (
json_schema_validate) enforces the exact structure of Stage 1's outputMarkdown validation (
markdown_validate) catches malformed formatting before deliverySecurity scanning (
markdown_security) checks every URL against Google Web Risk before the email goes out
The LLM is creative. Java is the enforcer.
The Refeed Loop
The best newsletter outputs are validated, curated, and structurally sound. By ingesting these outputs back into the vector database, future generations are improved — they can reference previous editions as examples of good formatting, successful event verification, and proper structure. The system improves itself.
Testing — The Quest for the Right Model
The Model Testing Journey
We tested multiple models across different providers and scenarios.
Tool calling turned out to be a bit of a challenge. The LLM doesn't just generate text — it needs to decide when to call tools, how to interpret the results, and when to stop. Not all models handle this well.
Some models would enter infinite tool loops, calling search 200 times with the same query. Others would skip the validation tool before returning their final answer. Some would ignore the JSON schema entirely.
We tested across OpenAI, Gemini, DeepSeek, Groq, and local Ollama models. For local zero-cost testing, Qwen 3 (1.7B) turned out to be well suited for the job. At 1.7 billion parameters it runs fast on local machines, and it was fairly consistent with tool calling — it follows structured prompts, calls tools in the right order, and respects validation cycles. It became the model powering our entire zero-cost test suite.
Zero-Cost Integration Testing
The project runs 43 integration tests — all via shell scripts, no unit tests. The entire test suite runs against a local stack that costs exactly $0:
Production (Paid) | Zero-Cost Alternative |
|---|---|
OpenAI / DeepSeek chat | Ollama |
Voyage AI embeddings | Ollama |
Google Custom Search | DuckDuckGo / Stub engine |
Qdrant Cloud vector store | DuckDB (in-process) |
Voyage AI reranker | ONNX |
This is possible because of the factory pattern. The same code, same tests, different beans. Swapping --chat-model=gpt-5 to --chat-model=qwen3:1.7b loads a completely different Spring context with zero code changes.
The tests cover: basic transformations, RAG, data ingestion, search tools, social media search, content crawler (JS-rendered pages), tool execution limits, validation tools (JSON, HTML, Markdown), reranking (ONNX), and output modes.
Assertion Strategy: Structure Over Content
LLM output is non-deterministic, so exact-string assertions are a recipe for flaky tests. Instead, we assert on structure:
"Did it return valid JSON with the required fields?"
"Does the output contain at least one event?"
"Was the validation tool called before the final answer?"
This approach gives us reliable CI/CD without fighting non-determinism.
The Real Output — Live Newsletters
This is not a demo. These newsletters run on a schedule and land in real inboxes.
Morocco Run Radar: Every week, the pipeline researches all upcoming running events in Morocco through web search, Instagram scraping, Facebook scraping, and official race websites. It computes the distance of each event from Casablanca, validates the data against a JSON schema, formats it into a Markdown newsletter, scans every URL for malware, and delivers it via email.
IT Events Casablanca: Targets tech professionals in Casablanca with upcoming meetups, conferences, and workshops.
Assistant Professor Jobs: Monitors academic job openings matching specific criteria and delivers a curated digest.
Each pipeline follows the same 2-stage (classic) or 3-agent (agentic) architecture. The prompt files are the only difference.
What's Next
The engine works, the newsletters ship, and the architecture holds up. But there is plenty of room to grow.
Multimodality
Right now, AI CLI operates in a text-only world. The LLM reads text prompts, searches text results, and produces text output. But the real world is not text-only — race organizers post flyers as images, trail maps are PDFs, and results are sometimes scanned documents. Introducing multimodal inputs (image understanding, PDF extraction) would let the pipeline process these richer sources directly rather than relying on whatever text happens to be on the webpage.
Streaming & Batch
The current pipeline runs synchronously — the LLM generates its full response before anything is returned. For long-running agentic workflows, streaming would provide real-time feedback and reduce perceived latency. On the other end of the spectrum, batch APIs would allow high-volume processing (e.g., ingesting hundreds of data sources at once) at reduced cost, since most providers offer batch endpoints at a discount.
Stateful Newsletters — Delta Reporting
Weekly readers lose engagement when they see the same 50 events repeated. The next evolution is giving the pipeline memory across runs by injecting the previous edition's structured JSON into the agentic scope. The agents can then highlight what changed: "3 new marathons added," "Rabat Marathon now sold out," or "Early bird registration ends tomorrow." This shifts the newsletter from a static list to a living update.
Smarter Ingestion
As discussed in the RAG section, blindly ingesting all search results makes the vector store noisy over time. Future iterations could introduce relevance-scored ingestion (using the reranker as a gate), TTL-based expiration, or even limiting ingestion to only the final validated output — so the RAG layer learns from curated content, not raw web noise.
Beyond Sequential — Other Workflow Patterns
It's worth being honest about what this engine is and what it isn't. Everything described in this article uses sequential workflows — agents execute one after another in a fixed order, passing state forward. This was the right choice for newsletters: there is a natural pipeline from research → editorial → formatting → delivery, and sequence gives you predictability and debuggability.
But sequential is not the only pattern, and it's not always the best one. LangChain4j supports several others that would suit different use cases:
Parallel workflows — multiple agents running concurrently. Useful when you have independent research tasks (e.g., one agent searches web, another scrapes social media, a third queries a database) and want to merge results at the end.
Loop workflows — self-correcting agents that iterate until a quality threshold is met. Instead of validating once and hoping, the agent retries with feedback until the output passes.
Supervisor workflows — a manager agent that dynamically decides which worker agents to call, in what order, and how many times. This is the most powerful pattern — it handles complex, branching tasks where the right next step depends on what was learned so far.
For a newsletter engine, sequential gives us exactly the control we need. But if the use case grows into something more complex — say, a real-time research assistant that needs to decide dynamically whether to search, crawl, or ask a follow-up question — a supervisor or loop pattern would be the right tool for the job. The factory-based architecture makes that transition straightforward: the workflow pattern is just another bean.
Further Reading
This article builds on concepts introduced in two predecessor articles. If you want to understand the foundations before diving into the production engine:
JBang Meets Spring Boot & LangChain4j — how single-file Java scripts evolve into Spring Boot applications with LLM chaining
Easy RAG — Using Embeddings in LangChain4j — embedding models, vector stores, and content retrieval fundamentals
Downloads — See It in Action
Live Presentation
LangChain4j in Action: A Walkthrough from Basic Chaining to Agentic Workflows (PDF)
Newsletter Examples (Real Output)
Morocco Run Radar — Agentic Edition (April 2026)