Skip to the content.

Definitions

In this document, we define two languages using formal mathematical notations. The first language is the Regular Expression, and the second language is the Extended Backus-Naur Form.

To distinguish between terminal and non-terminal symbols in this document, we use lowercase letters or quotation marks for terminal symbols and uppercase letters for non-terminal symbols.

We use regular expressions for defining patterns for tokens (terminal symbols) in the EBNF language itself as well as any input languages desribed in EBNF. We need a regular expression parser for building a lexer for any EBNF parser.

Regular Expression

Regular expression, as the name suggests, is a regular language or a Type-3 language. Since Type-2 languages are a superset of Type-3 languages, we can define the regular expression language using the EBNF notation as well (we will define the EBNF notation formally in the next section).

Grammar

regex            = [ "^" ] expr
expr             = subexpr [ "|" expr ]
subexpr          = 
subexpr_item     = anchor | group | match
anchor           = "$"
group            = "(" expr ")" [ quantifier ]
match            = match_item [ quantifier ]
match_item       = any_char | single_char | char_class | ascii_char_class | char_group
char_group       = "[" [ "^" ]  "]"
char_group_item  = ascii_char_class | char_class | char_range | single_char
char_range       = char_in_range "-" char_in_range
char_in_range    = unicode_char | ascii_char | char
quantifier       = repetition [ "?" ]
repetition       = "?" | "*" | "+" | range
range            = "{" num [ upper_bound ] "}"
upper_bound      = "," [ num ]
any_char         = "."
single_char      = unicode_char | ascii_char | escaped_char | unescaped_char
char_class       = "\d" | "\D" | "\s" | "\S" | "\w" | "\W"
ascii_char_class = "[:blank:]" | "[:space:]" | "[:digit:]" | "[:xdigit:]" | "[:upper:]" | "[:lower:]" | "[:alpha:]" | "[:alnum:]" | "[:word:]" | "[:ascii:]"

ascii_char       = "\x" hex_digit{2}
unicode_char     = "\x" hex_digit{4,8}
escaped_char     = "\" ( "\" | "|" | "." | "?" | "*" | "+" | "(" | ")" | "[" | "]" | "{" | "}" | "$" )
unescaped_char   = # all characters excluding the escaped ones
char             = # all characters

num              = 
letters          = 
digit            = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
hex_digit        = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" | "A" | "B" | "C" | "D" | "E" | "F"
letter           = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
                 | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"

Extended Backus-Naur Form

The formal language we use for describing context-free grammars is a subset of EBNF notation.

Alphabet

"  /  =  |  (  )  [  ]  {  }
0  1  2  3  4  5  6  7  8  9
A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z
a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
!  #  $  %  &  '  *  +  ,  -  .  :  ;  <  >  ?  @  \  ^  _  `  ~

Tokens

Token Pattern Description
DEF "=" Symbol for rule definition
ALT "|" Separator symbol for alternation
LPAREN "(" Start symbol for grouping
RPAREN ")" End symbol for grouping
LBRACK "[" Start symbol for optional
RBRACK "]" End symbol for optional
LBRACE "{" Start symbol for repetition (Kleene Star)
RBRACE "}" End symbol for repetition (Kleene Star)
LLBRACE " Start symbol for repetition (Kleene Plus)
RRBRACE " End symbol for repetition (Kleene Plus)
GRAMMER "grammar" Keyword for declaring grammar name
IDENT /[a-z][0-9a-z_]*/ Regex for grammar name and non-terminal symbols
TOKEN /[A-Z][0-9A-Z_]*/ Regex for declaring and referencing terminal symbols
STRING /"([\x21\x23-\x5B\x5D-\x7E]|\\[\x21-\x7E])+"/ Regex for defining string patterns
REGEX /\/([\x20-\x2E\x30-\x5B\x5D-\x7E]|\\[\x20-\x7E])*\// Regex for defining regular expression patterns

Tokens can be declared and defined explicitly or implicitly.

Grammar

grammar = name { decl }
name    = "grammar" IDENT
decl    = token | rule
token   = TOKEN "=" ( STRING | REGEX )
rule    = lhs "=" rhs
lhs     = nonterm
rhs     = nonterm | term | "(" rhs ")" | "[" rhs "]" | "{" rhs "}" | " rhs " | rhs rhs | rhs "|" rhs
nonterm = IDENT
term    = TOKEN | STRING

Precedences

TBD

Semantic

TBD