Definitions

In this document, we define two languages using formal mathematical notations. The first language is the Regular Expression, and the second language is the Extended Backus-Naur Form.

To distinguish between terminal and non-terminal symbols in this document, we use lowercase letters or quotation marks for terminal symbols and uppercase letters for non-terminal symbols.

We use regular expressions for defining patterns for tokens (terminal symbols) in the EBNF language itself as well as any input languages desribed in EBNF. We need a regular expression parser for building a lexer for any EBNF parser.

Regular Expression

Regular expression, as the name suggests, is a regular language or a Type-3 language. Since Type-2 languages are a superset of Type-3 languages, we can define the regular expression language using the EBNF notation as well (we will define the EBNF notation formally in the next section).

Grammar

regex            = [ "^" ] expr
expr             = subexpr [ "|" expr ]
subexpr          = 
subexpr_item     = anchor | group | match
anchor           = "$"
group            = "(" expr ")" [ quantifier ]
match            = match_item [ quantifier ]
match_item       = any_char | single_char | char_class | ascii_char_class | char_group
char_group       = "[" [ "^" ]  "]"
char_group_item  = ascii_char_class | char_class | char_range | single_char
char_range       = char_in_range "-" char_in_range
char_in_range    = unicode_char | ascii_char | char
quantifier       = repetition [ "?" ]
repetition       = "?" | "*" | "+" | range
range            = "{" num [ upper_bound ] "}"
upper_bound      = "," [ num ]
any_char         = "."
single_char      = unicode_char | ascii_char | escaped_char | unescaped_char
char_class       = "\d" | "\D" | "\s" | "\S" | "\w" | "\W"
ascii_char_class = "[:blank:]" | "[:space:]" | "[:digit:]" | "[:xdigit:]" | "[:upper:]" | "[:lower:]" | "[:alpha:]" | "[:alnum:]" | "[:word:]" | "[:ascii:]"

ascii_char       = "\x" hex_digit{2}
unicode_char     = "\x" hex_digit{4,8}
escaped_char     = "\" ( "\" | "|" | "." | "?" | "*" | "+" | "(" | ")" | "[" | "]" | "{" | "}" | "$" )
unescaped_char   = # all characters excluding the escaped ones
char             = # all characters

num              = 
letters          = 
digit            = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
hex_digit        = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" | "A" | "B" | "C" | "D" | "E" | "F"
letter           = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
                 | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"

Extended Backus-Naur Form

The formal language we use for describing context-free grammars is a subset of EBNF notation.

Alphabet

"  /  =  |  (  )  [  ]  {  }
0  1  2  3  4  5  6  7  8  9
A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z
a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
!  #  $  %  &  '  *  +  ,  -  .  :  ;  <  >  ?  @  \  ^  _  `  ~

Tokens

Token	Pattern	Description
`DEF`	`"="`	Symbol for rule definition
`ALT`	`"\|"`	Separator symbol for alternation
`LPAREN`	`"("`	Start symbol for grouping
`RPAREN`	`")"`	End symbol for grouping
`LBRACK`	`"["`	Start symbol for optional
`RBRACK`	`"]"`	End symbol for optional
`LBRACE`	`"{"`	Start symbol for repetition (Kleene Star)
`RBRACE`	`"}"`	End symbol for repetition (Kleene Star)
`LLBRACE`	`"`	Start symbol for repetition (Kleene Plus)
`RRBRACE`	`"`	End symbol for repetition (Kleene Plus)
`GRAMMER`	`"grammar"`	Keyword for declaring grammar name
`IDENT`	`/[a-z][0-9a-z_]*/`	Regex for grammar name and non-terminal symbols
`TOKEN`	`/[A-Z][0-9A-Z_]*/`	Regex for declaring and referencing terminal symbols
`STRING`	`/"([\x21\x23-\x5B\x5D-\x7E]\|\\[\x21-\x7E])+"/`	Regex for defining string patterns
`REGEX`	`/\/([\x20-\x2E\x30-\x5B\x5D-\x7E]\|\\[\x20-\x7E])*\//`	Regex for defining regular expression patterns

Tokens can be declared and defined explicitly or implicitly.

Grammar

grammar = name { decl }
name    = "grammar" IDENT
decl    = token | rule
token   = TOKEN "=" ( STRING | REGEX )
rule    = lhs "=" rhs
lhs     = nonterm
rhs     = nonterm | term | "(" rhs ")" | "[" rhs "]" | "{" rhs "}" | " rhs " | rhs rhs | rhs "|" rhs
nonterm = IDENT
term    = TOKEN | STRING

Precedences

TBD

Semantic

TBD