The Nominomicon
Welcome to Nominomicon; a guide to using the Nom parser for great good. This guide will give you an introduction to the theory and practice of using Nom.
This guide assumes only that you are:
- Wanting to learn Nom,
- Already familiar with Rust.
Nom is a parser-combinator library. In other words, it gives you tools to define:
- "parsers" (a function that takes an input, and gives back an output), and
- "combinators" (functions that take parsers, and combine them together!).
By combining parsers with combinators, you can build complex parsers up from simpler ones. These complex parsers are enough to understand HTML, mkv or Python!
Before we set off, it's important to list some caveats:
- This guide is for Nom7. Nom has undergone significant changes, so if
you are searching for documentation or StackOverflow answers, you may
find older documentation. Some common indicators that it is an old version are:
- Documentation older than 21st August, 2021
- Use of the
named!
macro - Use of
CompleteStr
orCompleteByteArray
.
- Nom can parse (almost) anything; but this guide will focus almost entirely on parsing
complete
&str
into things.
Chapter 1: The Nom Way
First of all, we need to understand the way that nom thinks about parsing. As discussed in the introduction, nom lets us build simple parsers, and then combine them (using "combinators").
Let's discuss what a "parser" actually does. A parser takes an input and returns a result, where:
Ok
indicates the parser successfully found what it was looking for; orErr
indicates the parser could not find what it was looking for.
If the parser was successful, then it will return a tuple. The first field of the tuple will contain everything the parser did not process. The second will contain everything the parser processed. The idea is that a parser can happily parse the first part of an input, without being able to parse the whole thing.
If the parser failed, then there are multiple errors that could be returned. For simplicity, however, in the next chapters we will leave these unexplored.
┌─► Ok(
│ what the parser didn't touch,
│ what matched the regex
│ )
┌─────────┐ │
my input───►│my parser├──►either──┤
└─────────┘ └─► Err(...)
To represent this model of the world, nom uses the IResult<I, O>
type.
The Ok
variant takes two types -- I
, the type of the input; and O
, the type
of the output, whereas the Err
variant stores an error.
You can import that from:
use nom::IResult;
You'll note that I
and O
are parameterized -- while most of the examples in this book
will be with &str
(i.e. parsing a string); they do not have to be strings; nor do they
have to be the same type (consider the simple example where I = &str
, and O = u64
-- this
parses a string into an unsigned integer).
Let's write our first parser! The simplest parser we can write is one which successfully does nothing.
This parser should take in an &str
:
- Since it is supposed to succeed, we know it will return the Ok Variant.
- Since it does nothing to our input, the remaining input is the same as the input.
- Since it doesn't parse anything, it also should just return an empty string.
use nom::IResult; use std::error::Error; pub fn do_nothing_parser(input: &str) -> IResult<&str, &str> { Ok((input, "")) } fn main() -> Result<(), Box<dyn Error>> { let (remaining_input, output) = do_nothing_parser("my_input")?; assert_eq!(remaining_input, "my_input"); assert_eq!(output, ""); Ok(()) }
It's that easy!
Chapter 2: Tags and Character Classes
The simplest useful parser you can write is one which has no special characters, it just matches a string.
In nom
, we call a simple collection of bytes a tag. Because
these are so common, there already exists a function called tag()
.
This function returns a parser for a given string.
Warning: nom
has multiple different definitions of tag
, make sure you use this one for the
moment!
extern crate nom;
pub use nom::bytes::complete::tag;
For example, code to parse the string "abc"
could be represented as tag("abc")
.
If you have not programmed in a language where functions are values, the type signature of them tag function might be a surprise:
pub fn tag<T, Input, Error: ParseError<Input>>(
tag: T
) -> impl Fn(Input) -> IResult<Input, Input, Error> where
Input: InputTake + Compare<T>,
T: InputLength + Clone,
Or, for the case where Input
and T
are both &str
, and simplifying slightly:
fn tag(tag: &str) -> (impl Fn(&str) -> IResult<&str, Error>)
In other words, this function tag
returns a function. The function it returns is a
parser, taking a &str
and returning an IResult
. Functions creating parsers and
returning them is a common pattern in Nom, so it is useful to call out.
Below, we have implemented a function that uses tag
.
extern crate nom; pub use nom::bytes::complete::tag; pub use nom::IResult; use std::error::Error; fn parse_input(input: &str) -> IResult<&str, &str> { // note that this is really creating a function, the parser for abc // vvvvv // which is then called here, returning an IResult<&str, &str> // vvvvv tag("abc")(input) } fn main() -> Result<(), Box<dyn Error>> { let (leftover_input, output) = parse_input("abcWorld")?; assert_eq!(leftover_input, "World"); assert_eq!(output, "abc"); assert!(parse_input("defWorld").is_err()); Ok(()) }
If you'd like to, you can also check tags without case-sensitivity
with the tag_no_case
function.
Character Classes
Tags are incredibly useful, but they are also incredibly restrictive. The other end of Nom's functionality is pre-written parsers that allow us to accept any of a group of characters, rather than just accepting characters in a defined sequence.
Here is a selection of them:
alpha0
: Recognizes zero or more lowercase and uppercase alphabetic characters:/[a-zA-Z]/
.alpha1
does the same but returns at least one characteralphanumeric0
: Recognizes zero or more numerical and alphabetic characters:/[0-9a-zA-Z]/
.alphanumeric1
does the same but returns at least one characterdigit0
: Recognizes zero or more numerical characters:/[0-9]/
.digit1
does the same but returns at least one charactermultispace0
: Recognizes zero or more spaces, tabs, carriage returns and line feeds.multispace1
does the same but returns at least one characterspace0
: Recognizes zero or more spaces and tabs.space1
does the same but returns at least one characterline_ending
: Recognizes an end of line (both\n
and\r\n
)newline
: Matches a newline character\n
tab
: Matches a tab character\t
We can use these in
extern crate nom; pub use nom::IResult; use std::error::Error; pub use nom::character::complete::alpha0; fn parser(input: &str) -> IResult<&str, &str> { alpha0(input) } fn main() -> Result<(), Box<dyn Error>> { let (remaining, letters) = parser("abc123")?; assert_eq!(remaining, "123"); assert_eq!(letters, "abc"); Ok(()) }
One important note is that, due to the type signature of these functions,
it is generally best to use them within a function that returns an IResult
.
If you don't, some of the information around the type of the tag
function must be
manually specified, which can lead to verbose code or confusing errors.
Chapter 3: Alternatives and Composition
In the last chapter, we saw how to create simple parsers using the tag
function;
and some of Nom's prebuilt parsers.
In this chapter, we explore two other widely used features of Nom: alternatives and composition.
Alternatives
Sometimes, we might want to choose between two parsers; and we're happy with either being used.
Nom gives us a similar ability through the alt()
combinator.
use nom::branch::alt;
The alt()
combinator will execute each parser in a tuple until it finds one
that does not error. If all error, then by default you are given the error from
the last error.
We can see a basic example of alt()
below.
extern crate nom; use nom::branch::alt; use nom::bytes::complete::tag; use nom::IResult; use std::error::Error; fn parse_abc_or_def(input: &str) -> IResult<&str, &str> { alt(( tag("abc"), tag("def") ))(input) } fn main() -> Result<(), Box<dyn Error>> { let (leftover_input, output) = parse_abc_or_def("abcWorld")?; assert_eq!(leftover_input, "World"); assert_eq!(output, "abc"); assert!(parse_abc_or_def("ghiWorld").is_err()); Ok(()) }
Composition
Now that we can create more interesting regexes, we can compose them together. The simplest way to do this is just to evaluate them in sequence:
extern crate nom; use nom::branch::alt; use nom::bytes::complete::tag; use nom::IResult; use std::error::Error; fn parse_abc(input: &str) -> IResult<&str, &str> { tag("abc")(input) } fn parse_def_or_ghi(input: &str) -> IResult<&str, &str> { alt(( tag("def"), tag("ghi") ))(input) } fn main() -> Result<(), Box<dyn Error>> { let input = "abcghi"; let (remainder, abc) = parse_abc(input)?; let (remainder, def_or_ghi) = parse_def_or_ghi(remainder)?; println!("first parsed: {abc}; then parsed: {def_or_ghi};"); Ok(()) }
Composing tags is such a common requirement that, in fact, Nom has a few built in
combinators to do it. The simplest of these is tuple()
. The tuple()
combinator takes a tuple of parsers,
and either returns Ok
with a tuple of all of their successful parses, or it
returns the Err
of the first failed parser.
use nom::sequence::tuple;
extern crate nom; use nom::branch::alt; use nom::sequence::tuple; use nom::bytes::complete::tag_no_case; use nom::character::complete::{digit1}; use nom::IResult; use std::error::Error; fn parse_base(input: &str) -> IResult<&str, &str> { alt(( tag_no_case("a"), tag_no_case("t"), tag_no_case("c"), tag_no_case("g") ))(input) } fn parse_pair(input: &str) -> IResult<&str, (&str, &str)> { // the many_m_n combinator might also be appropriate here. tuple(( parse_base, parse_base, ))(input) } fn main() -> Result<(), Box<dyn Error>> { let (remaining, parsed) = parse_pair("aTcG")?; assert_eq!(parsed, ("a", "T")); assert_eq!(remaining, "cG"); assert!(parse_pair("Dct").is_err()); Ok(()) }
Extra Nom Tools
After using alt()
and tuple()
, you might also be interested in a few other parsers that do similar things:
combinator | usage | input | output | comment |
---|---|---|---|---|
delimited | delimited(char('('), take(2), char(')')) | "(ab)cd" | Ok(("cd", "ab")) | |
preceded | preceded(tag("ab"), tag("XY")) | "abXYZ" | Ok(("Z", "XY")) | |
terminated | terminated(tag("ab"), tag("XY")) | "abXYZ" | Ok(("Z", "ab")) | |
pair | pair(tag("ab"), tag("XY")) | "abXYZ" | Ok(("Z", ("ab", "XY"))) | |
separated_pair | separated_pair(tag("hello"), char(','), tag("world")) | "hello,world!" | Ok(("!", ("hello", "world"))) |
Chapter 4: Parsers With Custom Return Types
So far, we have seen mostly functions that take an &str
, and return a
IResult<&str, &str>
. Splitting strings into smaller strings is certainly useful,
but it's not the only thing Nom is capable of!
A useful operation when parsing is to convert between types; for example
parsing from &str
to another primitive, like bool
.
All we need to do for our parser to return a different type is to change
the second type parameter of IResult
to the desired return type.
For example, to return a bool, return a IResult<&str, bool>
.
Recall that the first type parameter of the IResult
is the input
type, so even if you're returning something different, if your input
is a &str
, the first type argument of IResult
should be also.
Until you have read the chapter on Errors, we strongly suggest avoiding
the use of parsers built into Rust (like str.parse
); as they require
special handling to work well with Nom.
That said, one Nom-native way of doing a type conversion is to use the
value
combinator
to convert from a successful parse to a particular value.
The following code converts from a string containing "true"
or "false"
,
to the corresponding bool
.
extern crate nom; use std::error::Error; use nom::IResult; use nom::bytes::complete::tag; use nom::combinator::value; use nom::branch::alt; fn parse_bool(input: &str) -> IResult<&str, bool> { // either, parse `"true"` -> `true`; `"false"` -> `false`, or error. alt(( value(true, tag("true")), value(false, tag("false")), ))(input) } fn main() -> Result<(), Box<dyn Error>> { // Parses the `"true"` out. let (remaining, parsed) = parse_bool("true|false")?; assert_eq!(parsed, true); assert_eq!(remaining, "|false"); // If we forget about the "|", we get an error. let parsing_error = parse_bool(remaining); assert!(parsing_error.is_err()); // Skipping the first byte gives us `false`! let (remaining, parsed) = parse_bool(&remaining[1..])?; assert_eq!(parsed, false); assert_eq!(remaining, ""); Ok(()) }
Nom's in-built parser functions
Nom has a wide array of parsers built in. Here is a list of parsers which recognize specific characters.
Some of them we have seen before in Chapter 2, but now we also can try out the parsers that return different
types, like i32
. An example of this parser is shown in the next section.
Building a More Complex Example
A more complex example of parsing custom types might be parsing a 2D coordinate.
Let us try to figure out how to design this.
- We know that we want to take a string, like
"(3, -2)"
, and convert into aCoordinate
struct. - We can split this into three parts:
(vvvvvvvvvvvvv) # The outer brackets.
vvvv , vvvv # The comma, separating values.
3 -2 # The actual integers.
-
So, we will need three parsers, to deal with this:
- A parser for integers, which will deal with the raw numbers.
- A parser for comma seperated pair, which will split it up into integers.
- A parser for the outer brackets.
-
We can see below how we achieve this:
extern crate nom; use std::error::Error; use nom::IResult; use nom::bytes::complete::tag; use nom::sequence::{separated_pair, delimited}; // This is the type we will parse into. #[derive(Debug,PartialEq)] pub struct Coordinate { pub x: i32, pub y: i32, } // 1. Nom has an in-built i32 parser. use nom::character::complete::i32; // 2. Use the `separated_pair` parser to combine two parsers (in this case, // both `i32`), ignoring something in-between. fn parse_integer_pair(input: &str) -> IResult<&str, (i32, i32)> { separated_pair( i32, tag(", "), i32 )(input) } // 3. Use the `delimited` parser to apply a parser, ignoring the results // of two surrounding parsers. fn parse_coordinate(input: &str) -> IResult<&str, Coordinate> { let (remaining, (x, y)) = delimited( tag("("), parse_integer_pair, tag(")") )(input)?; // Note: we could construct this by implementing `From` on `Coordinate`, // We don't, just so it's obvious what's happening. Ok((remaining, Coordinate {x, y})) } fn main() -> Result<(), Box<dyn Error>> { let (_, parsed) = parse_coordinate("(3, 5)")?; assert_eq!(parsed, Coordinate {x: 3, y: 5}); let (_, parsed) = parse_coordinate("(2, -4)")?; assert_eq!(parsed, Coordinate {x: 2, y: -4}); let parsing_error = parse_coordinate("(3,)"); assert!(parsing_error.is_err()); let parsing_error = parse_coordinate("(,3)"); assert!(parsing_error.is_err()); let parsing_error = parse_coordinate("Ferris"); assert!(parsing_error.is_err()); Ok(()) }
As an exercise, you might want to explore how to make this parser deal gracefully with whitespace in the input.
Chapter 5: Repeating with Predicates
Just as, when programming, the humble while loop unlocks many useful features; in Nom, repeating a parser multiple times can be incredibly useful
There are, however, two ways of including repeating functionality into Nom -- parsers which are governed by a predicate; and combinators which repeat a parser.
Parsers which use a predicate
A predicate
is a function which returns a boolean value (i.e. given some input,
it returns true
or false
). These are incredibly common when parsing -- for instance,
a predicate is_vowel
might decide whether a character is an english vowel (a, e, i, o or u).
These can be used to make parsers that Nom hasn't built in. For instance, the below parser will take as many vowels as possible.
There are a few different categories of predicate parsers that are worth mentioning:
- For bytes, there are three different categories of parser:
take_till
,take_until
, andtake_while
.take_till
will continue consuming input until its input meets the predicate.take_while
will continue consuming input until its input does not meet the predicate.take_until
looks a lot like a predicate parser, but simply consumes until the first occurence of the pattern of bytes. - Some parsers have a "twin" with a
1
at the end of their name -- for example,take_while
hastake_while1
. The difference between them is thattake_while
could return an empty slice if the first byte does not satisfy a predicate.take_while1
returns an error if the predicate is not met. - As a special case,
take_while_m_n
is liketake_while
, but guarantees that it will consume at leastm
bytes, and no more thann
bytes.
extern crate nom; use std::error::Error; use nom::IResult; use nom::bytes::complete::{tag, take_until, take_while}; use nom::character::{is_space}; use nom::sequence::{terminated}; fn parse_sentence(input: &str) -> IResult<&str, &str> { terminated(take_until("."), take_while(|c| c == '.' || c == ' '))(input) } fn main() -> Result<(), Box<dyn Error>> { let (remaining, parsed) = parse_sentence("I am Tom. I write Rust.")?; assert_eq!(parsed, "I am Tom"); assert_eq!(remaining, "I write Rust."); let parsing_error = parse_sentence("Not a sentence (no period at the end)"); assert!(parsing_error.is_err()); Ok(()) }
For detailed examples, see their documentation, shown below:
combinator | usage | input | output | comment |
---|---|---|---|---|
take_while | take_while(is_alphabetic) | "abc123" | Ok(("123", "abc")) | Returns the longest list of bytes for which the provided function returns true. take_while1 does the same, but must return at least one character. take_while_m_n does the same, but must return between m and n characters. |
take_till | take_till(is_alphabetic) | "123abc" | Ok(("abc", "123")) | Returns the longest list of bytes or characters until the provided function returns true. take_till1 does the same, but must return at least one character. This is the reverse behaviour from take_while : take_till(f) is equivalent to take_while(\|c\| !f(c)) |
take_until | take_until("world") | "Hello world" | Ok(("world", "Hello ")) | Returns the longest list of bytes or characters until the provided tag is found. take_until1 does the same, but must return at least one character |
Chapter 6: Repeating Parsers
A single parser which repeats a predicate is useful, but more useful still is a combinator that
repeats a parser. Nom has multiple combinators which operate on this principle; the most obvious of
which is many0
, which applies a parser as many times as possible; and returns a vector of
the results of those parses. Here is an example:
extern crate nom; use std::error::Error; use nom::IResult; use nom::multi::many0; use nom::bytes::complete::tag; fn parser(s: &str) -> IResult<&str, Vec<&str>> { many0(tag("abc"))(s) } fn main() { assert_eq!(parser("abcabc"), Ok(("", vec!["abc", "abc"]))); assert_eq!(parser("abc123"), Ok(("123", vec!["abc"]))); assert_eq!(parser("123123"), Ok(("123123", vec![]))); assert_eq!(parser(""), Ok(("", vec![]))); }
There are many different parsers to choose from:
combinator | usage | input | output | comment |
---|---|---|---|---|
count | count(take(2), 3) | "abcdefgh" | Ok(("gh", vec!["ab", "cd", "ef"])) | Applies the child parser a specified number of times |
many0 | many0(tag("ab")) | "abababc" | Ok(("c", vec!["ab", "ab", "ab"])) | Applies the parser 0 or more times and returns the list of results in a Vec. many1 does the same operation but must return at least one element |
many_m_n | many_m_n(1, 3, tag("ab")) | "ababc" | Ok(("c", vec!["ab", "ab"])) | Applies the parser between m and n times (n included) and returns the list of results in a Vec |
many_till | many_till(tag( "ab" ), tag( "ef" )) | "ababefg" | Ok(("g", (vec!["ab", "ab"], "ef"))) | Applies the first parser until the second applies. Returns a tuple containing the list of results from the first in a Vec and the result of the second |
separated_list0 | separated_list0(tag(","), tag("ab")) | "ab,ab,ab." | Ok((".", vec!["ab", "ab", "ab"])) | separated_list1 works like separated_list0 but must returns at least one element |
fold_many0 | fold_many0(be_u8, \|\| 0, \|acc, item\| acc + item) | [1, 2, 3] | Ok(([], 6)) | Applies the parser 0 or more times and folds the list of return values. The fold_many1 version must apply the child parser at least one time |
fold_many_m_n | fold_many_m_n(1, 2, be_u8, \|\| 0, \|acc, item\| acc + item) | [1, 2, 3] | Ok(([3], 3)) | Applies the parser between m and n times (n included) and folds the list of return value |
length_count | length_count(number, tag("ab")) | "2ababab" | Ok(("ab", vec!["ab", "ab"])) | Gets a number from the first parser, then applies the second parser that many times |
Chapter 7: Using Errors from Outside Nom
Nom has other documentation about errors, so in place of this chapter, read this page.
Particular Notes
- It's particularly useful to use the
map_res
function. It allows you to convert an external error to a Nom error. For an example, see the Nom example on the front page.