Building a Java Lexer for the Monkey Programming Language

The inspiration behind this post is that I wanted a better understanding of how a programming language is created from scratch.

I picked up Thorsten Ball's book, which is a good book that takes a realistic and pragmatic approach to building a programming language.

It's in that middle ground between trivial examples and full-blown theoretical language design.

The book is a good read, and I recommend it if you are interested in the topic.

You can find better details on the programming language itself on the Monkey Language Official Website.

Lexer

The Lexer is responsible for reading the source code and tokenizing it. It's the first step in the pipeline of an interpreter.

I've decided to write in Java to explore some hidden corners of the language. In this case, it was the Spliterator API for which I figured this would be a good fit.

We will be taking as input a stream of characters and outputting a stream of tokens with the help of the Spliterator API.

The interface will allow us to define a custom way to provide a definition for stepping through the stream of characters, one token at a time instead of the default way which goes one character at a time.

We will of course be responsible for overriding and implementing the logic for the tryAdvance method, based on the same logic a traditional lexer would follow.

Implementation

The first thing we need during our walk through the characters is the ability to peek ahead of the stream.

I've used Apache's Commons PeekIterator to achieve this.

You can of course build your own peeking iterator by implementing Java's Iterator interface.

Here is an example where this need arises:

Take for example the case of identifying an identifier token.

We'll keep the definition simple for identifiers as a sequence of letters, so while you fix your starting position at the first letter,

you will need to keep advancing the pointer until you reach a character that is not a letter, without consuming the last character (which might as well be the semicolon or a space, signaling the move to the next iteration of this token identification process).

Here is a snippet of the code that does this, assuming an identifier is a sequence of letters:

        while (peekingIterator.hasNext() && Character.isLetter(peekingIterator.peek().charAt(0))) {
        tokenValue.append(peekingIterator.next());
        }

Token Types Representation

Let's define the token types that we will be recognizing in the Monkey language.

We'll use an enum to represent the token types, overridden with a toString method to provide a human-readable representation of the token type.

(This definition will change slightly as we move to the parser where other needs like identifying keywords, infix and prefix parsing, etc.).

package com.iqbalaissaoui.lex;

import com.iqbalaissaoui.parse.Expression;
import com.iqbalaissaoui.parse.Precedence;
import org.apache.commons.collections4.iterators.PeekingIterator;

import java.util.function.BiFunction;
import java.util.function.Function;

public enum TokenType {

    ILLEGAL(false),
    EOF(false),
    IDENT(false),
    INT(false),
    ASSIGN(false) {
        @Override
        public String toString() {
            return "=";
        }
    },
    PLUS(false) {
        @Override
        public String toString() {
            return "+";
        }
    },
    COMMA(false) {
        @Override
        public String toString() {
            return ",";
        }
    },
    SEMICOLON(false) {
        @Override
        public String toString() {
            return ";";
        }
    },
    LPAREN(false) {
        @Override
        public String toString() {
            return "(";
        }
    },
    RPAREN(false) {
        @Override
        public String toString() {
            return ")";
        }
    },
    LBRACE(false) {
        @Override
        public String toString() {
            return "{";
        }
    },
    RBRACE(false) {
        @Override
        public String toString() {
            return "}";
        }
    },
    FUNCTION(true) {
        @Override
        public String toString() {
            return "fn";
        }
    },
    LET(true) {
        @Override
        public String toString() {
            return "let";
        }
    },
    RETURN(true) {
        @Override
        public String toString() {
            return "return";
        }
    },
    TRUE(true) {
        @Override
        public String toString() {
            return "true";
        }
    },
    FALSE(true) {
        @Override
        public String toString() {
            return "false";
        }
    },
    IF(true) {
        @Override
        public String toString() {
            return "if";
        }
    },
    ELSE(true) {
        @Override
        public String toString() {
            return "else";
        }
    },
    MINUS(false) {
        @Override
        public String toString() {
            return "-";
        }
    },
    BANG(false) {
        @Override
        public String toString() {
            return "!";
        }
    },
    ASTERISK(false) {
        @Override
        public String toString() {
            return "*";
        }
    },
    SLASH(false) {
        @Override
        public String toString() {
            return "/";
        }
    },
    LT(false) {
        @Override
        public String toString() {
            return "<";
        }
    },
    GT(false) {
        @Override
        public String toString() {
            return ">";
        }
    },
    EQ(false) {
        @Override
        public String toString() {
            return "==";
        }
    },
    NOT_EQ(false) {
        @Override
        public String toString() {
            return "!=";
        }
    };

    final boolean isKeyword;

    TokenType(boolean isKeyword) {
        this.isKeyword = isKeyword;
    }

    public boolean isKeyword() {
        return this.isKeyword;
    }
}

Token Representation

A token is a pair of a token type and a value. So far, a record is enough to represent this:

package com.iqbalaissaoui.lex;

public record Token(TokenType type, String literal) {

    public String toString() {
        return "Token(" + type.name() + ", " + literal + ")";
    }
}

Lexer Implementation

Here is the implementation of the Lexer, which is essentially a Spliterator of Tokens in this implementation:

package com.iqbalaissaoui.lex;

import org.apache.commons.collections4.iterators.PeekingIterator;
import org.apache.commons.lang3.StringUtils;

import java.util.*;
import java.util.concurrent.atomic.AtomicReference;
import java.util.function.Consumer;
import java.util.stream.Stream;


public class TokenSpliterator implements Spliterator<Token> {

    PeekingIterator<String> peekingIterator;


    public TokenSpliterator(PeekingIterator<String> peekingIterator) {
        this.peekingIterator = peekingIterator;
    }


    @Override
    public boolean tryAdvance(Consumer<? super Token> action) {
        //todo

        //1 ignore spaces
        String current = peekingIterator.next();

        while (peekingIterator.hasNext() && StringUtils.isBlank(current)) {
            current = peekingIterator.next();
        }

        //2 check EOF
        if(!peekingIterator.hasNext()){
            action.accept(new Token(TokenType.EOF, TokenType.EOF.toString()));
            return false;
        }

        Token t = switch(current) {
            case "=" -> {
                if(peekingIterator.hasNext() && peekingIterator.peek().contentEquals("=")){
                    peekingIterator.next();
                    yield new Token(TokenType.EQ, TokenType.EQ.toString());
                }
                yield new Token(TokenType.ASSIGN, TokenType.ASSIGN.toString());
            }
            case "+" -> new Token(TokenType.PLUS, TokenType.PLUS.toString());
            case "," -> new Token(TokenType.COMMA, TokenType.COMMA.toString());
            case ";" -> new Token(TokenType.SEMICOLON, TokenType.SEMICOLON.toString());
            case "(" -> new Token(TokenType.LPAREN, TokenType.LPAREN.toString());
            case ")" -> new Token(TokenType.RPAREN, TokenType.RPAREN.toString());
            case "{" -> new Token(TokenType.LBRACE, TokenType.LBRACE.toString());
            case "}" -> new Token(TokenType.RBRACE, TokenType.RBRACE.toString());
            case "!" -> {
                if(peekingIterator.hasNext() && peekingIterator.peek().contentEquals("=")){
                    peekingIterator.next();
                    yield new Token(TokenType.NOT_EQ, TokenType.NOT_EQ.toString());
                }
                yield new Token(TokenType.BANG, TokenType.BANG.toString());
            }
            case "<" -> new Token(TokenType.LT, TokenType.LT.toString());
            case ">" -> new Token(TokenType.GT, TokenType.GT.toString());
            case "*" -> new Token(TokenType.ASTERISK, TokenType.ASTERISK.toString());
            case "/" -> new Token(TokenType.SLASH, TokenType.SLASH.toString());
            case "-" -> new Token(TokenType.MINUS, TokenType.MINUS.toString());

            default -> {
                if (Character.isDigit(current.charAt(0))) {
                    StringBuilder tokenValue = new StringBuilder();
                    tokenValue.append(current);

                    while (peekingIterator.hasNext() && Character.isDigit(peekingIterator.peek().charAt(0))) {
                        tokenValue.append(peekingIterator.next());
                    }

                    yield new Token(TokenType.INT, tokenValue.toString());

                } else if (Character.isLetter(current.charAt(0))) {
                    StringBuilder tokenValue = new StringBuilder();
                    tokenValue.append(current);

                    while (peekingIterator.hasNext() && Character.isLetter(peekingIterator.peek().charAt(0))) {
                        tokenValue.append(peekingIterator.next());
                    }



                    // identifier look up
                    Optional<TokenType> opKeyword = Stream.of(TokenType.values())
                            .filter(TokenType::isKeyword)
                            .filter(tt -> tt.toString().contentEquals(tokenValue))
                            .findFirst();

                    AtomicReference<Token> token = new AtomicReference<>();

                    //either keyword or identifier
                    opKeyword.ifPresentOrElse(
                            tt -> token.set(new Token(tt, tt.toString())),
                            () -> token.set(new Token(TokenType.IDENT, tokenValue.toString()))
                    );

                    yield token.get();
                } else {
                    yield new Token(TokenType.ILLEGAL, TokenType.ILLEGAL.toString());
                }
            }

        };

        action.accept(t);
        return true;


    }


    @Override
    public Spliterator<Token> trySplit() {
        return null;
    }

    @Override
    public long estimateSize() {
        return Long.MAX_VALUE;
    }

    @Override
    public int characteristics() {

        return ORDERED | SIZED | NONNULL;

    }
}

You can see that the main logic is in the tryAdvance method, where we are responsible for identifying the token type and value by walking conditionally through the stream of characters.

Entry Point and Project Structure

At the time of writing, I haven't implemented a proper REPL that would scan the input from the user and tokenize it.

I've used a simple main method to test the lexer with the help of JBang:

Structure:

➜  src git:(interpreter) ✗ tree .
.
└── com
    └── iqbalaissaoui
        ├── Repl.java
        ├── lex
        │   ├── Lexer.java
        │   ├── Token.java
        │   ├── TokenSpliterator.java
        │   └── TokenType.java
        ├── main.monkey

JBang Entry Point:

///usr/bin/env jbang "$0" "$@" ; exit $?
//DEPS org.apache.commons:commons-lang3:3.17.0
//DEPS org.apache.commons:commons-collections4:4.4

//SOURCES lex/TokenSpliterator.java
//SOURCES lex/TokenType.java
//SOURCES lex/Token.java

package com.iqbalaissaoui;

import com.iqbalaissaoui.lex.Token;
import com.iqbalaissaoui.lex.TokenSpliterator;
import org.apache.commons.collections4.iterators.PeekingIterator;

import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;

import static java.lang.System.*;

public class Repl {

    public static void main(String... args) throws IOException {

        Path mk = Path.of("main.monkey");

        try (Stream<String> lines = Files.lines(mk, StandardCharsets.UTF_8)) {

            Stream<String> chars = lines.flatMapToInt(String::chars)
                    .mapToObj(Character::toString);

            PeekingIterator<String> peekingIterator = new PeekingIterator<>(chars.iterator());

            TokenSpliterator tokenSpliterator = new TokenSpliterator(peekingIterator);

            Stream<Token> tokens = StreamSupport.stream(tokenSpliterator, false);

            tokens.forEach(out::println);
        }
    }
}

Monkey file Example:

main.monkey

let five = 5;
let ten = 10;
let add = fn(x, y) {
x + y;
};
let result = add(five, ten);

Output:

Let's run the entry point with the help of JBang:

jbang Repl.java

Output:

➜  iqbalaissaoui git:(interpreter) ✗ jbang Repl.java
Token(LET, let)
Token(IDENT, five)
Token(ASSIGN, =)
Token(INT, 5)
Token(SEMICOLON, ;)
Token(LET, let)
Token(IDENT, ten)
Token(ASSIGN, =)
Token(INT, 10)
Token(SEMICOLON, ;)
Token(LET, let)
Token(IDENT, add)
Token(ASSIGN, =)
Token(FUNCTION, fn)
Token(LPAREN, ()
Token(IDENT, x)
Token(COMMA, ,)
Token(IDENT, y)
Token(RPAREN, ))
Token(LBRACE, {)
Token(IDENT, x)
Token(PLUS, +)
Token(IDENT, y)
Token(SEMICOLON, ;)
Token(RBRACE, })
Token(SEMICOLON, ;)
Token(LET, let)
Token(IDENT, result)
Token(ASSIGN, =)
Token(IDENT, add)
Token(LPAREN, ()
Token(IDENT, five)
Token(COMMA, ,)
Token(IDENT, ten)
Token(RPAREN, ))
Token(EOF, EOF)

15 May 2025