The Lexer is responsible for reading the source code and tokenizing it. It's the first step in the pipeline of an interpreter.
I've decided to write in Java to explore some hidden corners of the language. In this case, it was the Spliterator API for which I figured this would be a good fit.
We will be taking as input a stream of characters and outputting a stream of tokens with the help of the Spliterator API.
The interface will allow us to define a custom way to provide a definition for stepping through the stream of characters, one token at a time instead of the default way which goes one character at a time.
We will of course be responsible for overriding and implementing the logic for the tryAdvance method, based on the same logic a traditional lexer would follow.
Implementation
The first thing we need during our walk through the characters is the ability to peek ahead of the stream.
I've used Apache's Commons PeekIterator to achieve this.
You can of course build your own peeking iterator by implementing Java's Iterator interface.
Here is an example where this need arises:
Take for example the case of identifying an identifier token.
We'll keep the definition simple for identifiers as a sequence of letters, so while you fix your starting position at the first letter,
you will need to keep advancing the pointer until you reach a character that is not a letter, without consuming the last character (which might as well be the semicolon or a space, signaling the move to the next iteration of this token identification process).
Here is a snippet of the code that does this, assuming an identifier is a sequence of letters:
while (peekingIterator.hasNext() && Character.isLetter(peekingIterator.peek().charAt(0))) {
tokenValue.append(peekingIterator.next());
}
Token Types Representation
Let's define the token types that we will be recognizing in the Monkey language.
We'll use an enum to represent the token types, overridden with a toString method to provide a human-readable representation of the token type.
(This definition will change slightly as we move to the parser where other needs like identifying keywords, infix and prefix parsing, etc.).
package com.iqbalaissaoui.lex;
import com.iqbalaissaoui.parse.Expression;
import com.iqbalaissaoui.parse.Precedence;
import org.apache.commons.collections4.iterators.PeekingIterator;
import java.util.function.BiFunction;
import java.util.function.Function;
public enum TokenType {
ILLEGAL(false),
EOF(false),
IDENT(false),
INT(false),
ASSIGN(false) {
@Override
public String toString() {
return "=";
}
},
PLUS(false) {
@Override
public String toString() {
return "+";
}
},
COMMA(false) {
@Override
public String toString() {
return ",";
}
},
SEMICOLON(false) {
@Override
public String toString() {
return ";";
}
},
LPAREN(false) {
@Override
public String toString() {
return "(";
}
},
RPAREN(false) {
@Override
public String toString() {
return ")";
}
},
LBRACE(false) {
@Override
public String toString() {
return "{";
}
},
RBRACE(false) {
@Override
public String toString() {
return "}";
}
},
FUNCTION(true) {
@Override
public String toString() {
return "fn";
}
},
LET(true) {
@Override
public String toString() {
return "let";
}
},
RETURN(true) {
@Override
public String toString() {
return "return";
}
},
TRUE(true) {
@Override
public String toString() {
return "true";
}
},
FALSE(true) {
@Override
public String toString() {
return "false";
}
},
IF(true) {
@Override
public String toString() {
return "if";
}
},
ELSE(true) {
@Override
public String toString() {
return "else";
}
},
MINUS(false) {
@Override
public String toString() {
return "-";
}
},
BANG(false) {
@Override
public String toString() {
return "!";
}
},
ASTERISK(false) {
@Override
public String toString() {
return "*";
}
},
SLASH(false) {
@Override
public String toString() {
return "/";
}
},
LT(false) {
@Override
public String toString() {
return "<";
}
},
GT(false) {
@Override
public String toString() {
return ">";
}
},
EQ(false) {
@Override
public String toString() {
return "==";
}
},
NOT_EQ(false) {
@Override
public String toString() {
return "!=";
}
};
final boolean isKeyword;
TokenType(boolean isKeyword) {
this.isKeyword = isKeyword;
}
public boolean isKeyword() {
return this.isKeyword;
}
}
Token Representation
A token is a pair of a token type and a value. So far, a record is enough to represent this:
package com.iqbalaissaoui.lex;
public record Token(TokenType type, String literal) {
public String toString() {
return "Token(" + type.name() + ", " + literal + ")";
}
}
Lexer Implementation
Here is the implementation of the Lexer, which is essentially a Spliterator of Tokens in this implementation:
package com.iqbalaissaoui.lex;
import org.apache.commons.collections4.iterators.PeekingIterator;
import org.apache.commons.lang3.StringUtils;
import java.util.*;
import java.util.concurrent.atomic.AtomicReference;
import java.util.function.Consumer;
import java.util.stream.Stream;
public class TokenSpliterator implements Spliterator<Token> {
PeekingIterator<String> peekingIterator;
public TokenSpliterator(PeekingIterator<String> peekingIterator) {
this.peekingIterator = peekingIterator;
}
@Override
public boolean tryAdvance(Consumer<? super Token> action) {
//todo
//1 ignore spaces
String current = peekingIterator.next();
while (peekingIterator.hasNext() && StringUtils.isBlank(current)) {
current = peekingIterator.next();
}
//2 check EOF
if(!peekingIterator.hasNext()){
action.accept(new Token(TokenType.EOF, TokenType.EOF.toString()));
return false;
}
Token t = switch(current) {
case "=" -> {
if(peekingIterator.hasNext() && peekingIterator.peek().contentEquals("=")){
peekingIterator.next();
yield new Token(TokenType.EQ, TokenType.EQ.toString());
}
yield new Token(TokenType.ASSIGN, TokenType.ASSIGN.toString());
}
case "+" -> new Token(TokenType.PLUS, TokenType.PLUS.toString());
case "," -> new Token(TokenType.COMMA, TokenType.COMMA.toString());
case ";" -> new Token(TokenType.SEMICOLON, TokenType.SEMICOLON.toString());
case "(" -> new Token(TokenType.LPAREN, TokenType.LPAREN.toString());
case ")" -> new Token(TokenType.RPAREN, TokenType.RPAREN.toString());
case "{" -> new Token(TokenType.LBRACE, TokenType.LBRACE.toString());
case "}" -> new Token(TokenType.RBRACE, TokenType.RBRACE.toString());
case "!" -> {
if(peekingIterator.hasNext() && peekingIterator.peek().contentEquals("=")){
peekingIterator.next();
yield new Token(TokenType.NOT_EQ, TokenType.NOT_EQ.toString());
}
yield new Token(TokenType.BANG, TokenType.BANG.toString());
}
case "<" -> new Token(TokenType.LT, TokenType.LT.toString());
case ">" -> new Token(TokenType.GT, TokenType.GT.toString());
case "*" -> new Token(TokenType.ASTERISK, TokenType.ASTERISK.toString());
case "/" -> new Token(TokenType.SLASH, TokenType.SLASH.toString());
case "-" -> new Token(TokenType.MINUS, TokenType.MINUS.toString());
default -> {
if (Character.isDigit(current.charAt(0))) {
StringBuilder tokenValue = new StringBuilder();
tokenValue.append(current);
while (peekingIterator.hasNext() && Character.isDigit(peekingIterator.peek().charAt(0))) {
tokenValue.append(peekingIterator.next());
}
yield new Token(TokenType.INT, tokenValue.toString());
} else if (Character.isLetter(current.charAt(0))) {
StringBuilder tokenValue = new StringBuilder();
tokenValue.append(current);
while (peekingIterator.hasNext() && Character.isLetter(peekingIterator.peek().charAt(0))) {
tokenValue.append(peekingIterator.next());
}
// identifier look up
Optional<TokenType> opKeyword = Stream.of(TokenType.values())
.filter(TokenType::isKeyword)
.filter(tt -> tt.toString().contentEquals(tokenValue))
.findFirst();
AtomicReference<Token> token = new AtomicReference<>();
//either keyword or identifier
opKeyword.ifPresentOrElse(
tt -> token.set(new Token(tt, tt.toString())),
() -> token.set(new Token(TokenType.IDENT, tokenValue.toString()))
);
yield token.get();
} else {
yield new Token(TokenType.ILLEGAL, TokenType.ILLEGAL.toString());
}
}
};
action.accept(t);
return true;
}
@Override
public Spliterator<Token> trySplit() {
return null;
}
@Override
public long estimateSize() {
return Long.MAX_VALUE;
}
@Override
public int characteristics() {
return ORDERED | SIZED | NONNULL;
}
}
You can see that the main logic is in the tryAdvance method, where we are responsible for identifying the token type and value, by walking conditionally through the stream of characters.
Entry Point and Project Structure
At the time of writing, I haven't implemented a proper REPL that would scan the input from the user and tokenize it.
I've used a simple main method to test the lexer with the help of Jbang:
Structure:
➜ src git:(interpreter) ✗ tree .
.
└── com
└── iqbalaissaoui
├── Repl.java
├── lex
│ ├── Lexer.java
│ ├── Token.java
│ ├── TokenSpliterator.java
│ └── TokenType.java
├── main.monkey
Jbang Entry Point:
///usr/bin/env jbang "$0" "$@" ; exit $?
//DEPS org.apache.commons:commons-lang3:3.17.0
//DEPS org.apache.commons:commons-collections4:4.4
//SOURCES lex/TokenSpliterator.java
//SOURCES lex/TokenType.java
//SOURCES lex/Token.java
package com.iqbalaissaoui;
import com.iqbalaissaoui.lex.Token;
import com.iqbalaissaoui.lex.TokenSpliterator;
import org.apache.commons.collections4.iterators.PeekingIterator;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;
import static java.lang.System.*;
public class Repl {
public static void main(String... args) throws IOException {
Path mk = Path.of("main.monkey");
try (Stream<String> lines = Files.lines(mk, StandardCharsets.UTF_8)) {
Stream<String> chars = lines.flatMapToInt(String::chars)
.mapToObj(Character::toString);
PeekingIterator<String> peekingIterator = new PeekingIterator<>(chars.iterator());
TokenSpliterator tokenSpliterator = new TokenSpliterator(peekingIterator);
Stream<Token> tokens = StreamSupport.stream(tokenSpliterator, false);
tokens.forEach(out::println);
}
}
}
Monkey file Example:
main.monkey
let five = 5;
let ten = 10;
let add = fn(x, y) {
x + y;
};
let result = add(five, ten);
Output:
Let's run the entry point with the help of Jbang:
jbang Repl.java
Output:
➜ iqbalaissaoui git:(interpreter) ✗ jbang Repl.java
Token(LET, let)
Token(IDENT, five)
Token(ASSIGN, =)
Token(INT, 5)
Token(SEMICOLON, ;)
Token(LET, let)
Token(IDENT, ten)
Token(ASSIGN, =)
Token(INT, 10)
Token(SEMICOLON, ;)
Token(LET, let)
Token(IDENT, add)
Token(ASSIGN, =)
Token(FUNCTION, fn)
Token(LPAREN, ()
Token(IDENT, x)
Token(COMMA, ,)
Token(IDENT, y)
Token(RPAREN, ))
Token(LBRACE, {)
Token(IDENT, x)
Token(PLUS, +)
Token(IDENT, y)
Token(SEMICOLON, ;)
Token(RBRACE, })
Token(SEMICOLON, ;)
Token(LET, let)
Token(IDENT, result)
Token(ASSIGN, =)
Token(IDENT, add)
Token(LPAREN, ()
Token(IDENT, five)
Token(COMMA, ,)
Token(IDENT, ten)
Token(RPAREN, ))
Token(EOF, EOF)