Basic Formal Language Theory

Mort Yao

Definition 1. (Alphabet) An alphabet \(\Sigma\) is a well-defined set of symbols.

Example. \(\{\texttt{0},\texttt{1}\}\), \(\{\texttt{0}, \texttt{1}, \dots, \texttt{9}\}\), \(\{\texttt{a}, \texttt{b}, \dots, \texttt{z}\}\), and \(\{\sqcup\}\) are all valid alphabets.

Definition 2. (String) A string is a sequence of symbols.

Specifically, a string consisting of no symbol (i.e., a sequence of zero length) is called an empty string, denoted as \(\varepsilon\).

Example. \(\texttt{1001}\), \(\texttt{cat}\), and \(\varepsilon\) are all valid strings.

Definition 3. (Language) A language \(L\) over the alphabet \(\Sigma\) is any set of strings made up of symbols from \(\Sigma\) (including the empty string \(\varepsilon\) vacuously, if \(L\) is nonempty).

Specifically, an empty set of strings is called an empty language, denoted as \(\varnothing\). For every \(L \neq \varnothing\), \(\varepsilon \in L\) by definition.

The language of all strings over \(\Sigma\) (including the empty string \(\varepsilon\)) is denoted as \(\Sigma^*\), that is, \[\Sigma^* = \{a^*\ |\ a \in \Sigma\}\]

Example. \(\{\texttt{0},\texttt{1}\}^* = \{b^*\ |\ b \in \{\texttt{0}, \texttt{1}\}\} = \{\varepsilon,\texttt{0},\texttt{1},\texttt{00},\texttt{01},\texttt{10},\texttt{11},\texttt{000},\dots\}\), \(\{\varepsilon, \texttt{0}, \texttt{1}\}\), \(\{\varepsilon\}\), and \(\varnothing\) are all valid languages.

Note that every language \(L\) over \(\Sigma\) (including the empty language \(\varnothing\)) is a subset of \(\Sigma^*\), that is, for every \(L\) over \(\Sigma\), \(L \subseteq \Sigma^* = \{a^*\ |\ a \in \Sigma\}\).

If a language over \(\Sigma\) contains every string of fixed length \(k\), we denote the language as \(\Sigma^k\).

Example. \(\{\texttt{0}, \texttt{1}\}^8\) is the language of all 8-bit binary strings.

Definition 4. (Union) The union of two languages \(L_1\) and \(L_2\) is the language \[L_1 \cup L_2 = \{ w\ |\ w \in L_1 \lor w \in L_2 \}\]

Definition 5. (Intersection) The intersection of two languages \(L_1\) and \(L_2\) is the language \[L_1 \cap L_2 = \{ w\ |\ w \in L_1 \land w \in L_2 \}\]

Definition 6. (Complementation) The complement of a language \(L\) over \(\Sigma\) is the language \[\overline{L} = \Sigma^* \setminus L\]

Definition 7. (Concatenation) The concatenation of two languages \(L_1\) and \(L_2\) is the language \[L_1 \circ L_2 = \{ w_1w_2\ |\ w_1 \in L_1 \land w_2 \in L_2 \}\]

Additionally, we use \(L^k\) to denote the language obtained by concatenating \(L\) to itself \(k\) times.

Definition 8. (Kleene star / Closure) The Kleene star or closure of a language \(L\) is the language \[L^* = \{\varepsilon\} \cup L \cup L^2 \cup L^3 \cup \cdots\]