Why We Don't Use LLMs in loadmaster (Yet)

Every couple of weeks, someone asks the same question: "Are you using LLMs in loadmaster?" It is a fair question. Large language models have become the default lens through which people see AI, and it would be strange not to ask. The answer however is no, and certainly not in the core of the system. The three agents in loadmaster (StowAI for vessel stowage, StackAI for yard placement, JobAI for job dispatching) plan how containers move through a terminal, and that work is done by reinforcement learning and more traditional optimisation methods rather than a language model.

This is not technological conservatism, and it is also not a bet against LLMs. It comes down to what these models are good at and what the problem actually requires. In this case, the two do not line up, at least not where it counts. I have written elsewhere on this site about why we frame RL as a decision-making framework rather than an optimiser, why we keep humans in the loop, and why the data is harder than the algorithms. This post is the other side of that coin: why the one technology everyone now reaches for first is the one we have deliberately kept out of the decision loop.

An LLM is not an optimisation engine

The core of loadmaster is a set of optimisation problems. Given a vessel, a loading list, a yard topology, and a tangle of constraints, where should every container go such that the plan is feasible, stable, and cheap to execute? StowAI augments a vessel planner to minimise shifters while keeping the plan executable. StackAI optimises for container placement to balance the yard and protect future plans. Each of these is a combinatorial search over an enormous space of possibilities, governed by hard mathematical constraints.

A language model does not search this space. It was never designed to be able to do this. An LLM predicts the next token given the previous ones. When you ask it to "solve" a stowage problem, it produces text that looks like a solution, assembled from patterns it has seen in its training data, with no underlying mechanism that evaluates feasibility. It does not compare alternatives, or establish whether one arrangement beats another. No objective function is being minimised. There is no constraint solver checking if the answer is even legal. There is only a very sophisticated guess about what a plausible answer looks like.

This is easy to miss because the output reads so convincingly. Ask a capable model to stow a small vessel and it will give you a confident, neatly formatted plan. But "looks like a stowage plan" and "is a good stowage plan" are quite different claims. The gains we actually measure, such as cutting shifters, reducing handling time, and tightening crane schedules, come from an agent that analyses the real structure of the problem: weights, stack heights, crane reach, the cascading cost of a rehandle three moves from now. An LLM is reasoning about what stowage plans tend to look like in text. Only one of those activities gets containers onto a ship.

An LLM is not a planner

The same gap shows up in planning. As I argued in writing about RL as a decision-making framework, terminal operations are a long chain of interdependent decisions made under uncertainty. This stowage choice constrains that yard layout, which shapes crane scheduling, which determines whether the vessel sails on time. Good planning means reasoning about how a decision now changes the options available later, often many moves ahead.

Reinforcement learning is built for exactly this. Our agents learn their policies inside a closed-loop digital twin, experiencing the long-horizon consequences of their actions across millions of simulated scenarios rather than from historical logs. An agent learns that a cheap move today can be an expensive mistake tomorrow. That credit assignment over time is the whole point.

An LLM has no such mechanism. It does not maintain a model of state that it updates as a plan unfolds, and it does not weigh the downstream consequences of a choice against an objective. It can describe a plan fluently, and it can reason step by step in text when you prompt it to. But step-by-step text is not the same as rolling out a policy or searching a decision tree. When the problem demands real lookahead over a structured state space, the model is improvising a narrative instead of planning. The narrative can be wrong in ways that only surface several moves later, which in a terminal means after the damage is done.

They hallucinate, and they do it with total confidence

This is the part that should worry anyone considering an LLM for a high-stakes operational decision. Language models hallucinate. They invent facts, fabricate constraints, and produce answers that are fluent, specific, and completely wrong, delivered with the same calm authority as their correct answers.

In a casual setting, a hallucination is an annoyance. In a container terminal, it is a safety and financial event. Picture a model that confidently places a dangerous-goods container next to an incompatible one because the segregation rule never surfaced in its output. Or one that asserts a vessel-stability margin that does not exist. Or one that drops a container into a yard slot already occupied by another. These are not far-fetched failure modes for an LLM. They follow directly from a system that generates plausible text rather than verified fact.

The deeper problem is that the model gives you no honest signal of its own uncertainty. A solver tells you when a problem is infeasible. An RL policy can be wrapped in checks that reject illegal actions. An LLM, left to its own devices, will fill any gap with something confident-sounding. It cannot reliably flag the cases where it is guessing. In an environment where the cost of a wrong answer runs to hundreds of thousands of euros, "usually right, occasionally and invisibly catastrophic" is not a profile you can put on the quay.

Hard constraints are non-negotiable, and LLMs cannot guarantee them

Container terminals run on constraints that are not preferences. Vessel stability limits are physics. Dangerous-goods segregation is law. A plan that violates either does not get a stern talking-to. It does not sail, or it sails and something goes badly wrong.

A well-formulated optimisation model guarantees feasibility by construction. The constraints are encoded, so any solution it returns respects them, or it reports that no solution exists. That is precisely why our hybrid approach pairs learned policies with formal feasibility checks instead of trusting a single black box. An LLM offers no such guarantee. You can put the rules in the prompt and ask nicely, and it will mostly comply, but "mostly" is not something you can build a safety case on. There is no proof, no certificate, no formal boundary the model is mathematically prevented from crossing. For the parts of loadmaster where feasibility is the entire point, a tool that cannot guarantee feasibility is disqualified, however impressive it is elsewhere.

Determinism, auditability, and explanations operators trust

Three more practical reasons matter here.

Reproducibility. Operators, and increasingly regulators, need the same inputs to yield the same plan, and they need to reconstruct why a decision was made. A trained policy is reproducible: fix the inputs and the seed, and you get the same plan every time, with a traceable rationale. LLM outputs vary, sometimes substantially, between runs and between model versions. A plan you cannot reproduce is a plan you cannot defend in an incident review.

Latency and cost at scale. loadmaster makes a very large number of decisions across a terminal's operations, and JobAI in particular coordinates execution continuously so cranes and trucks stay busy and wait times fall. Our agents return good decisions in a sub-second loop, at negligible marginal cost per decision. Pushing that volume of structured combinatorial decisions through a large language model would be slower and far more expensive, and it would produce answers that are weaker on exactly the dimensions that matter. The economics are simply abysmal, before you even consider the quality issues.

Interpretability that operators trust. We train our agents against explainable KPIs such as crane moves, rehandle rates, weight distribution, and schedule adherence, so a planner can see which trade-offs drove a recommendation, expressed in their own language. As I argued in writing about keeping humans in the loop, trust scales with transparency, not accuracy alone. "The language model said so, and it might say something different tomorrow" is the opposite of that. An LLM can generate an explanation that sounds right but bears no causal relationship to how the answer was produced. A grounded explanation tied to costs and constraints the operator recognises comes naturally to optimisation. It does not come naturally to an LLM.

Where LLMs could fit, in the future

So when I say "not yet," I mean it literally. There is a real and growing set of places in and around loadmaster where a language model could earn its keep, precisely because they play to its strengths rather than against them. In all of them the LLM sits at the edges, interpreting, translating, and explaining, never inside the loop that decides where a container goes.

A natural-language interface to the agents. A planner should be able to ask "why did StowAI sequence this container here?" and get a clear answer. The agent already optimises against explainable KPIs and knows the trade-offs behind a decision. An LLM is a natural layer for turning that structured information into plain language, and for turning a planner's plain-language question back into a structured query.

Translating operator intent into constraints. Much of the friction in deploying optimisation comes from capturing the rules that live in people's heads: "we never stack hazardous cargo upwind of the break area," "this customer's containers always go in block C," "the vessel officer rejects any plan with heavy boxes in bay fourteen." This is exactly the contextual knowledge I have written about operators carrying that no dataset contains. An LLM could help operators express these rules in natural language and propose a formal encoding, which a human confirms before it ever enters an agent's constraint set. The model speeds up the conversation. It does not get a vote on the final plan.

Making sense of unstructured data. The hardest part of these systems is rarely the algorithm. It is the gap between what the TOS logs and what actually happened, which I have argued is where most projects quietly fail. Emails, shipping documents, free-text operator notes, the quirks of a particular EDI feed: this is unstructured text, and reading unstructured text is exactly what LLMs are good at. Using a model to extract structure from messy documents, with validation downstream, is a far better fit than using one to make the optimisation decision itself.

Explanations, summaries, and onboarding. Shift-handover reports, summaries of what the agents did and why over the last twelve hours, and training material that helps a new planner understand an agent's behaviour are all genuinely useful and genuinely low-risk. A hallucination in a summary that a human is reading is a correctable mistake, not a safety event. That is the right risk profile for an LLM.

What ties these together is that the language model interprets, translates, explains, and summarises. It never decides where a container goes or whether a plan is safe to execute. That decision stays with methods that can guarantee feasibility, reason about long horizons, and tell you honestly when they cannot find an answer.

The right tool for the job

None of this is a verdict on LLMs as a technology. It is the same methodology I try to apply to every method: start with the problem, not the tool. The question is never "can we use an LLM here?" It is "what is the best approach for this specific operational problem?" For combinatorial optimisation under hard constraints, with high decision volume and no tolerance for invisible errors, the best approach is reinforcement learning and optimisation working together. That is what StowAI, StackAI, and JobAI are.

The recurring mistake in the industry is treating LLMs as a tool that subsumes every other method. They are very good at a specific class of problems: problems that are fundamentally about language, unstructured text, and messy human input. A container stowage plan is not one of those problems. But a terminal is full of problems that are.

This is a view that I expect to hold for a long time. The core of loadmaster optimises and plans, because that is what the problem is. The day a language model can prove that a plan feasible, search a combinatorial space, and tell me truthfully what it does not know, I will happily revisit this post. Until then, the sensible thing is to use each tool for what it is actually good at, and to be honest about the difference.