The Nominomicon

Welcome to Nominomicon; a guide to using the Nom parser for great good. This guide will give you an introduction to the theory and practice of using Nom.

This guide assumes only that you are:

  • Wanting to learn Nom,
  • Already familiar with Rust.

Nom is a parser-combinator library. In other words, it gives you tools to define:

  • "parsers" (a function that takes an input, and gives back an output), and
  • "combinators" (functions that take parsers, and combine them together!).

By combining parsers with combinators, you can build complex parsers up from simpler ones. These complex parsers are enough to understand HTML, mkv or Python!

Before we set off, it's important to list some caveats:

  • This guide is for Nom7. Nom has undergone significant changes, so if you are searching for documentation or StackOverflow answers, you may find older documentation. Some common indicators that it is an old version are:
    • Documentation older than 21st August, 2021
    • Use of the named! macro
    • Use of CompleteStr or CompleteByteArray.
  • Nom can parse (almost) anything; but this guide will focus almost entirely on parsing complete &str into things.

Chapter 1: The Nom Way

First of all, we need to understand the way that nom thinks about parsing. As discussed in the introduction, nom lets us build simple parsers, and then combine them (using "combinators").

Let's discuss what a "parser" actually does. A parser takes an input and returns a result, where:

  • Ok indicates the parser successfully found what it was looking for; or
  • Err indicates the parser could not find what it was looking for.

If the parser was successful, then it will return a tuple. The first field of the tuple will contain everything the parser did not process. The second will contain everything the parser processed. The idea is that a parser can happily parse the first part of an input, without being able to parse the whole thing.

If the parser failed, then there are multiple errors that could be returned. For simplicity, however, in the next chapters we will leave these unexplored.

                                   ┌─► Ok(
                                   │      what the parser didn't touch,
                                   │      what matched the regex
                                   │   )
             ┌─────────┐           │
 my input───►│my parser├──►either──┤
             └─────────┘           └─► Err(...)

To represent this model of the world, nom uses the IResult<I, O> type. The Ok variant takes two types -- I, the type of the input; and O, the type of the output, whereas the Err variant stores an error.

You can import that from:

use nom::IResult;

You'll note that I and O are parameterized -- while most of the examples in this book will be with &str (i.e. parsing a string); they do not have to be strings; nor do they have to be the same type (consider the simple example where I = &str, and O = u64 -- this parses a string into an unsigned integer).

Let's write our first parser! The simplest parser we can write is one which successfully does nothing.

This parser should take in an &str:

  • Since it is supposed to succeed, we know it will return the Ok Variant.
  • Since it does nothing to our input, the remaining input is the same as the input.
  • Since it doesn't parse anything, it also should just return an empty string.
use nom::IResult;
use std::error::Error;

pub fn do_nothing_parser(input: &str) -> IResult<&str, &str> {
    Ok((input, ""))
}

fn main() -> Result<(), Box<dyn Error>> {
    let (remaining_input, output) = do_nothing_parser("my_input")?;
    assert_eq!(remaining_input, "my_input");
    assert_eq!(output, "");
  Ok(())
}

It's that easy!

Chapter 2: Tags and Character Classes

The simplest useful parser you can write is one which has no special characters, it just matches a string.

In nom, we call a simple collection of bytes a tag. Because these are so common, there already exists a function called tag(). This function returns a parser for a given string.

Warning: nom has multiple different definitions of tag, make sure you use this one for the moment!

extern crate nom;
pub use nom::bytes::complete::tag;

For example, code to parse the string "abc" could be represented as tag("abc").

If you have not programmed in a language where functions are values, the type signature of them tag function might be a surprise:

pub fn tag<T, Input, Error: ParseError<Input>>(
    tag: T
) -> impl Fn(Input) -> IResult<Input, Input, Error> where
    Input: InputTake + Compare<T>,
    T: InputLength + Clone, 

Or, for the case where Input and T are both &str, and simplifying slightly:

fn tag(tag: &str) -> (impl Fn(&str) -> IResult<&str, Error>)

In other words, this function tag returns a function. The function it returns is a parser, taking a &str and returning an IResult. Functions creating parsers and returning them is a common pattern in Nom, so it is useful to call out.

Below, we have implemented a function that uses tag.

extern crate nom;
pub use nom::bytes::complete::tag;
pub use nom::IResult;
use std::error::Error;

fn parse_input(input: &str) -> IResult<&str, &str> {
    //  note that this is really creating a function, the parser for abc
    //  vvvvv 
    //         which is then called here, returning an IResult<&str, &str>
    //         vvvvv
    tag("abc")(input)
}

fn main() -> Result<(), Box<dyn Error>> {
    let (leftover_input, output) = parse_input("abcWorld")?;
    assert_eq!(leftover_input, "World");
    assert_eq!(output, "abc");

    assert!(parse_input("defWorld").is_err());
  Ok(())
}

If you'd like to, you can also check tags without case-sensitivity with the tag_no_case function.

Character Classes

Tags are incredibly useful, but they are also incredibly restrictive. The other end of Nom's functionality is pre-written parsers that allow us to accept any of a group of characters, rather than just accepting characters in a defined sequence.

Here is a selection of them:

  • alpha0: Recognizes zero or more lowercase and uppercase alphabetic characters: /[a-zA-Z]/. alpha1 does the same but returns at least one character
  • alphanumeric0: Recognizes zero or more numerical and alphabetic characters: /[0-9a-zA-Z]/. alphanumeric1 does the same but returns at least one character
  • digit0: Recognizes zero or more numerical characters: /[0-9]/. digit1 does the same but returns at least one character
  • multispace0: Recognizes zero or more spaces, tabs, carriage returns and line feeds. multispace1 does the same but returns at least one character
  • space0: Recognizes zero or more spaces and tabs. space1 does the same but returns at least one character
  • line_ending: Recognizes an end of line (both \n and \r\n)
  • newline: Matches a newline character \n
  • tab: Matches a tab character \t

We can use these in

extern crate nom;
pub use nom::IResult;
use std::error::Error;
pub use nom::character::complete::alpha0;
fn parser(input: &str) -> IResult<&str, &str> {
    alpha0(input)
}

fn main() -> Result<(), Box<dyn Error>> {
    let (remaining, letters) = parser("abc123")?;
    assert_eq!(remaining, "123");
    assert_eq!(letters, "abc");
    
  Ok(())
}

One important note is that, due to the type signature of these functions, it is generally best to use them within a function that returns an IResult.

If you don't, some of the information around the type of the tag function must be manually specified, which can lead to verbose code or confusing errors.

Chapter 3: Alternatives and Composition

In the last chapter, we saw how to create simple parsers using the tag function; and some of Nom's prebuilt parsers.

In this chapter, we explore two other widely used features of Nom: alternatives and composition.

Alternatives

Sometimes, we might want to choose between two parsers; and we're happy with either being used.

Nom gives us a similar ability through the alt() combinator.

use nom::branch::alt;

The alt() combinator will execute each parser in a tuple until it finds one that does not error. If all error, then by default you are given the error from the last error.

We can see a basic example of alt() below.

extern crate nom;
use nom::branch::alt;
use nom::bytes::complete::tag;
use nom::IResult;
use std::error::Error;

fn parse_abc_or_def(input: &str) -> IResult<&str, &str> {
    alt((
        tag("abc"),
        tag("def")
    ))(input)
}

fn main() -> Result<(), Box<dyn Error>> {
    let (leftover_input, output) = parse_abc_or_def("abcWorld")?;
    assert_eq!(leftover_input, "World");
    assert_eq!(output, "abc");

    assert!(parse_abc_or_def("ghiWorld").is_err());
  Ok(())
}

Composition

Now that we can create more interesting regexes, we can compose them together. The simplest way to do this is just to evaluate them in sequence:

extern crate nom;
use nom::branch::alt;
use nom::bytes::complete::tag;
use nom::IResult;
use std::error::Error;

fn parse_abc(input: &str) -> IResult<&str, &str> {
    tag("abc")(input)
}
fn parse_def_or_ghi(input: &str) -> IResult<&str, &str> {
    alt((
        tag("def"),
        tag("ghi")
    ))(input)
}

fn main() -> Result<(), Box<dyn Error>> {
    let input = "abcghi";
    let (remainder, abc) = parse_abc(input)?;
    let (remainder, def_or_ghi) = parse_def_or_ghi(remainder)?;
    println!("first parsed: {abc}; then parsed: {def_or_ghi};");
    
  Ok(())
}

Composing tags is such a common requirement that, in fact, Nom has a few built in combinators to do it. The simplest of these is tuple(). The tuple() combinator takes a tuple of parsers, and either returns Ok with a tuple of all of their successful parses, or it returns the Err of the first failed parser.

use nom::sequence::tuple;
extern crate nom;
use nom::branch::alt;
use nom::sequence::tuple;
use nom::bytes::complete::tag_no_case;
use nom::character::complete::{digit1};
use nom::IResult;
use std::error::Error;

fn parse_base(input: &str) -> IResult<&str, &str> {
    alt((
        tag_no_case("a"),
        tag_no_case("t"),
        tag_no_case("c"),
        tag_no_case("g")
    ))(input)
}

fn parse_pair(input: &str) -> IResult<&str, (&str, &str)> {
    // the many_m_n combinator might also be appropriate here.
    tuple((
        parse_base,
        parse_base,
    ))(input)
}

fn main() -> Result<(), Box<dyn Error>> {
    let (remaining, parsed) = parse_pair("aTcG")?;
    assert_eq!(parsed, ("a", "T"));
    assert_eq!(remaining, "cG");
 
    assert!(parse_pair("Dct").is_err());

  Ok(())
}

Extra Nom Tools

After using alt() and tuple(), you might also be interested in a few other parsers that do similar things:

combinatorusageinputoutputcomment
delimiteddelimited(char('('), take(2), char(')'))"(ab)cd"Ok(("cd", "ab"))
precededpreceded(tag("ab"), tag("XY"))"abXYZ"Ok(("Z", "XY"))
terminatedterminated(tag("ab"), tag("XY"))"abXYZ"Ok(("Z", "ab"))
pairpair(tag("ab"), tag("XY"))"abXYZ"Ok(("Z", ("ab", "XY")))
separated_pairseparated_pair(tag("hello"), char(','), tag("world"))"hello,world!"Ok(("!", ("hello", "world")))

Chapter 4: Parsers With Custom Return Types

So far, we have seen mostly functions that take an &str, and return a IResult<&str, &str>. Splitting strings into smaller strings is certainly useful, but it's not the only thing Nom is capable of!

A useful operation when parsing is to convert between types; for example parsing from &str to another primitive, like bool.

All we need to do for our parser to return a different type is to change the second type parameter of IResult to the desired return type. For example, to return a bool, return a IResult<&str, bool>.

Recall that the first type parameter of the IResult is the input type, so even if you're returning something different, if your input is a &str, the first type argument of IResult should be also.

Until you have read the chapter on Errors, we strongly suggest avoiding the use of parsers built into Rust (like str.parse); as they require special handling to work well with Nom.

That said, one Nom-native way of doing a type conversion is to use the value combinator to convert from a successful parse to a particular value.

The following code converts from a string containing "true" or "false", to the corresponding bool.

extern crate nom;
use std::error::Error;
use nom::IResult;
use nom::bytes::complete::tag;
use nom::combinator::value;
use nom::branch::alt;

fn parse_bool(input: &str) -> IResult<&str, bool> {
    // either, parse `"true"` -> `true`; `"false"` -> `false`, or error.
    alt((
      value(true, tag("true")),
      value(false, tag("false")),
    ))(input)
}

fn main() -> Result<(), Box<dyn Error>> {
    // Parses the `"true"` out.
    let (remaining, parsed) = parse_bool("true|false")?;
    assert_eq!(parsed, true);
    assert_eq!(remaining, "|false");
   
    // If we forget about the "|", we get an error.
    let parsing_error = parse_bool(remaining);
    assert!(parsing_error.is_err());
    
    // Skipping the first byte gives us `false`!
    let (remaining, parsed) = parse_bool(&remaining[1..])?;
    assert_eq!(parsed, false);
    assert_eq!(remaining, "");
    
    

  Ok(())
}

Nom's in-built parser functions

Nom has a wide array of parsers built in. Here is a list of parsers which recognize specific characters.

Some of them we have seen before in Chapter 2, but now we also can try out the parsers that return different types, like i32. An example of this parser is shown in the next section.

Building a More Complex Example

A more complex example of parsing custom types might be parsing a 2D coordinate.

Let us try to figure out how to design this.

  • We know that we want to take a string, like "(3, -2)", and convert into a Coordinate struct.
  • We can split this into three parts:
(vvvvvvvvvvvvv) # The outer brackets.
  vvvv , vvvv   # The comma, separating values.
    3     -2    # The actual integers.
  • So, we will need three parsers, to deal with this:

    1. A parser for integers, which will deal with the raw numbers.
    2. A parser for comma seperated pair, which will split it up into integers.
    3. A parser for the outer brackets.
  • We can see below how we achieve this:

extern crate nom;
use std::error::Error;
use nom::IResult;
use nom::bytes::complete::tag;
use nom::sequence::{separated_pair, delimited};

// This is the type we will parse into.
#[derive(Debug,PartialEq)]
pub struct Coordinate {
  pub x:   i32,
  pub y:   i32,
}

// 1. Nom has an in-built i32 parser.
use nom::character::complete::i32;

// 2. Use the `separated_pair` parser to combine two parsers (in this case,
//    both `i32`), ignoring something in-between.
fn parse_integer_pair(input: &str) -> IResult<&str, (i32, i32)> {
    separated_pair(
        i32,
        tag(", "),
        i32
    )(input)
}

// 3. Use the `delimited` parser to apply a parser, ignoring the results
//    of two surrounding parsers.
fn parse_coordinate(input: &str) -> IResult<&str, Coordinate> {
    let (remaining, (x, y)) = delimited(
        tag("("),
        parse_integer_pair,
        tag(")")
    )(input)?;
    
    // Note: we could construct this by implementing `From` on `Coordinate`,
    // We don't, just so it's obvious what's happening.
    Ok((remaining, Coordinate {x, y}))
    
}

fn main() -> Result<(), Box<dyn Error>> {
    let (_, parsed) = parse_coordinate("(3, 5)")?;
    assert_eq!(parsed, Coordinate {x: 3, y: 5});
   
    let (_, parsed) = parse_coordinate("(2, -4)")?;
    assert_eq!(parsed, Coordinate {x: 2, y: -4});
    
    let parsing_error = parse_coordinate("(3,)");
    assert!(parsing_error.is_err());
    
    let parsing_error = parse_coordinate("(,3)");
    assert!(parsing_error.is_err());
    
    let parsing_error = parse_coordinate("Ferris");
    assert!(parsing_error.is_err());
    

  Ok(())
}

As an exercise, you might want to explore how to make this parser deal gracefully with whitespace in the input.

Chapter 5: Repeating with Predicates

Just as, when programming, the humble while loop unlocks many useful features; in Nom, repeating a parser multiple times can be incredibly useful

There are, however, two ways of including repeating functionality into Nom -- parsers which are governed by a predicate; and combinators which repeat a parser.

Parsers which use a predicate

A predicate is a function which returns a boolean value (i.e. given some input, it returns true or false). These are incredibly common when parsing -- for instance, a predicate is_vowel might decide whether a character is an english vowel (a, e, i, o or u).

These can be used to make parsers that Nom hasn't built in. For instance, the below parser will take as many vowels as possible.

There are a few different categories of predicate parsers that are worth mentioning:

  • For bytes, there are three different categories of parser: take_till, take_until, and take_while. take_till will continue consuming input until its input meets the predicate. take_while will continue consuming input until its input does not meet the predicate. take_until looks a lot like a predicate parser, but simply consumes until the first occurence of the pattern of bytes.
  • Some parsers have a "twin" with a 1 at the end of their name -- for example, take_while has take_while1. The difference between them is that take_while could return an empty slice if the first byte does not satisfy a predicate. take_while1 returns an error if the predicate is not met.
  • As a special case, take_while_m_n is like take_while, but guarantees that it will consume at least m bytes, and no more than n bytes.
extern crate nom;
use std::error::Error;
use nom::IResult;
use nom::bytes::complete::{tag, take_until, take_while};
use nom::character::{is_space};
use nom::sequence::{terminated};

fn parse_sentence(input: &str) -> IResult<&str, &str> {
    terminated(take_until("."), take_while(|c| c == '.' || c == ' '))(input)
}

fn main() -> Result<(), Box<dyn Error>> {
    let (remaining, parsed) = parse_sentence("I am Tom. I write Rust.")?;
    assert_eq!(parsed, "I am Tom");
    assert_eq!(remaining, "I write Rust.");
   
    let parsing_error = parse_sentence("Not a sentence (no period at the end)");
    assert!(parsing_error.is_err());
    

  Ok(())
}

For detailed examples, see their documentation, shown below:

combinatorusageinputoutputcomment
take_whiletake_while(is_alphabetic)"abc123"Ok(("123", "abc"))Returns the longest list of bytes for which the provided function returns true. take_while1 does the same, but must return at least one character. take_while_m_n does the same, but must return between m and n characters.
take_tilltake_till(is_alphabetic)"123abc"Ok(("abc", "123"))Returns the longest list of bytes or characters until the provided function returns true. take_till1 does the same, but must return at least one character. This is the reverse behaviour from take_while: take_till(f) is equivalent to take_while(\|c\| !f(c))
take_untiltake_until("world")"Hello world"Ok(("world", "Hello "))Returns the longest list of bytes or characters until the provided tag is found. take_until1 does the same, but must return at least one character

Chapter 6: Repeating Parsers

A single parser which repeats a predicate is useful, but more useful still is a combinator that repeats a parser. Nom has multiple combinators which operate on this principle; the most obvious of which is many0, which applies a parser as many times as possible; and returns a vector of the results of those parses. Here is an example:

extern crate nom;
use std::error::Error;
use nom::IResult;
use nom::multi::many0;
use nom::bytes::complete::tag;

fn parser(s: &str) -> IResult<&str, Vec<&str>> {
  many0(tag("abc"))(s)
}

fn main() {
    assert_eq!(parser("abcabc"), Ok(("", vec!["abc", "abc"])));
    assert_eq!(parser("abc123"), Ok(("123", vec!["abc"])));
    assert_eq!(parser("123123"), Ok(("123123", vec![])));
    assert_eq!(parser(""), Ok(("", vec![])));
}

There are many different parsers to choose from:

combinatorusageinputoutputcomment
countcount(take(2), 3)"abcdefgh"Ok(("gh", vec!["ab", "cd", "ef"]))Applies the child parser a specified number of times
many0many0(tag("ab"))"abababc"Ok(("c", vec!["ab", "ab", "ab"]))Applies the parser 0 or more times and returns the list of results in a Vec. many1 does the same operation but must return at least one element
many_m_nmany_m_n(1, 3, tag("ab"))"ababc"Ok(("c", vec!["ab", "ab"]))Applies the parser between m and n times (n included) and returns the list of results in a Vec
many_tillmany_till(tag( "ab" ), tag( "ef" ))"ababefg"Ok(("g", (vec!["ab", "ab"], "ef")))Applies the first parser until the second applies. Returns a tuple containing the list of results from the first in a Vec and the result of the second
separated_list0separated_list0(tag(","), tag("ab"))"ab,ab,ab."Ok((".", vec!["ab", "ab", "ab"]))separated_list1 works like separated_list0 but must returns at least one element
fold_many0fold_many0(be_u8, \|\| 0, \|acc, item\| acc + item)[1, 2, 3]Ok(([], 6))Applies the parser 0 or more times and folds the list of return values. The fold_many1 version must apply the child parser at least one time
fold_many_m_nfold_many_m_n(1, 2, be_u8, \|\| 0, \|acc, item\| acc + item)[1, 2, 3]Ok(([3], 3))Applies the parser between m and n times (n included) and folds the list of return value
length_countlength_count(number, tag("ab"))"2ababab"Ok(("ab", vec!["ab", "ab"]))Gets a number from the first parser, then applies the second parser that many times

Chapter 7: Using Errors from Outside Nom

Nom has other documentation about errors, so in place of this chapter, read this page.

Particular Notes

  • It's particularly useful to use the map_res function. It allows you to convert an external error to a Nom error. For an example, see the Nom example on the front page.

To Be Completed

To Be Completed

To Be Completed