Strings: Basic notation

sets (basic notation)

Strings play a very prominent role in linguistics and language technology. A string is a sequence of symbols, like nfm, wendigo, or 105§/. In contrast to sets, strings are ordered and can contain duplicates.

The sets \(\left \{ m,a,d \right \}\), \(\left \{ d,a,m \right \}\), and \(\left \{ a,d,a,m \right \}\) are equivalent, but for strings \(\mathit{mad} \neq \mathit{dam} \neq \mathit{adam}\).

Fill in \(=\) or \(\neq\) as appropriate for each pair of strings below.

\(\mathit{abba}\) _ \(\mathit{ABBA}\)
\(10\) _ \(5 + 5\)
\(\left \{ m,a,d \right \} \_ \left \{ d,a,m \right \}\)

Caution: \(\{\) and \(\}\) can be symbols just like \(m\), \(a\), or \(d\).

Alphabet

When talking about strings, one usually fixes a finite set of symbols over which the strings are built. This is called an alphabet. It is common but not necessary to require alphabets to contain at least one symbol. Alphabets are often given labels like \(\Sigma\) or \(\Omega\). A string over alphabet \(\Sigma\) is also called a \(\Sigma\)-string.

The set of Latin characters (A-Z, a-z) is an alphabet that’s familiar to all of you. Strings over it include:

string
alphabet
aaaaaaa
c

The set of Arabic digits is an alphabet with symbols 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Every natural number (0, 1, 2, …), when represented in decimal as usual, is a string over this alphabet. But not every string over this alphabet is a number of the decimal system. For instance, 000134095 is not a valid number, although 134095 is.

The set \(\mathbb{N}\) of all natural numbers (0, 1, 2, and so on) is not a valid alphabet because it isn’t finite.

For each one of the following, say whether it is a valid alphabet. Justify your answer.

\(\left \{ a \right \}\)
\(\left \{ 0,1 \right \}\)
the set of all English words that are spelled with at most 5 characters
the set of all natural numbers less than 1000
the set of the nucleobases of DNA: adenine, cytosine, guanine, thymine

String length

The length of a \(\Sigma\)-string \(s\) is indicated by \(\left | s \right |\). For instance, \(\left | \text{ant} \right | = 3\), \(\left | 0770001 \right | = 7\), and \(\left | \text{a} \right | = 1\). The set of all strings over \(\Sigma\) whose length is exactly \(n\) is denoted by \(\Sigma^n\).

Let \(\Sigma \mathrel{\mathop:}=\left \{ a,b \right \}\). Then \(\Sigma^3\) contains all of the following strings, and only those:

\(\mathit{aaa}\)
\(\mathit{aab}\)
\(\mathit{aba}\)
\(\mathit{abb}\)
\(\mathit{baa}\)
\(\mathit{bab}\)
\(\mathit{bba}\)
\(\mathit{bbb}\)

The size of \(\Sigma^n\) is always fixed. If \(\Sigma\) has \(m\) members, then \(\Sigma^n\) contains \(m^n\) strings.

In the previous example, \(\Sigma\) contains two symbols, so \(\Sigma^n\) should consist of \(2^3 = 8\) distinct strings. That’s exactly what we found.

Which one of the following are members of \(\left \{ a,b \right \}^4\), i.e. \(\Sigma^4\) where \(\Sigma\) contains \(a\), \(b\), and nothing else?

\(\mathit{aaab}\)
\(\mathit{aba}\)
\(\mathit{aaaaa}\)
\(\mathit{b}\)
\(\mathit{abca}\)

List all members of \(\left \{ k,o,z \right \}^2\).

Very often expressions like \(a^n\) are used as a shorthand for \(\left \{ a \right \}^n\).

The expression \(\mathit{b a^5 c^3 d}\) is a shorthand for \(\mathit{baaaaacccd}\).

Write each one of the following in a more compact fashion using exponents.

ABBA
loool
aardvark

Infinite string sets over \(\Sigma\)

Since alphabets must be finite, \(\Sigma^n\) is necessarily finite for any alphabet \(\Sigma\) and \(n \geq 0\). But the set of all strings over \(\Sigma\) is infinite.

Let \(\Sigma \mathrel{\mathop:}=\left \{ a \right \}\). Then \(a\) is a string over \(\Sigma\), and so are \(\mathit{aa}\), \(\mathit{aaa}\), \(\mathit{aaaa}\), and so on. This enumeration continues indefinitely, so there must be infinitely many distinct strings over \(\Sigma\).

Two infinite string sets are commonly defined over \(\Sigma\). They are \(\Sigma^*\) and \(\Sigma^+\), respectively. The set \(\Sigma^*\) contains all strings over \(\Sigma\), whereas \(\Sigma^+\) contains all strings whose length is at least \(1\). The only difference between the two is that \(\Sigma^*\) also contains the empty string \(\varepsilon\). The empty string is the string counterpart of the number 0: it represents nothing. In fact, \(\varepsilon\) is the only string whose length is 0.

Let \(\Sigma = \left \{ a,b \right \}\). Then \(\Sigma^*\) contains

\(\varepsilon\),
\(\mathit{a}\)
\(\mathit{b}\)
\(\mathit{aa}\)
\(\mathit{ab}\)
\(\mathit{ba}\)
\(\mathit{bb}\)
\(\mathit{aaa}\)
\(\mathit{aab}\)
\(\mathit{aba}\)
\(\mathit{abb}\)
and so on

All these strings are also members of \(\Sigma^+\), except \(\varepsilon\).

\(\Sigma^*\) is also called the Kleene closure, named after Stephen C. Kleene.

Here’s a little bit of background to make it easier for you to remember the difference between \(\Sigma^*\) and \(\Sigma^+\). As you might know from search engines, the Kleene star * is sometimes used as a wildcard that matches everything. So \(\Sigma^*\) can be translated as “every string built over \(\Sigma\)”. On the other hand \(\Sigma^+\) only contains those strings whose length is at least 1, or in other words, whose length is positive. And \(+\) is a common abbreviation for positive (e.g. with batteries).

Enumerate the five shortest members of \(\left \{ a \right \}^*\).

Concatenation

Given two \(\Sigma\)-strings \(u\) and \(v\), their concatenation \(u \cdot v\) is the result of “glueing” the left end of \(v\) to the right end of \(u\).

Here are a few examples of concatenation:

\(\mathit{math} \cdot\mathit{ematics} = \mathit{mathematics}\),
\(2000 \cdot 18 = 200018\),
\(\text{Thomas} \cdot\text{Graf} = \text{ThomasGraf}\).

Just like addition, concatenation is associative. This means that if we carry out multiple concatenations, it does not matter in what order we resolve the concatenation steps: \(u \cdot(v \cdot w) = (u \cdot v) \cdot w = u \cdot v \cdot w\).

It does not matter in which order we combine is with concatenation and associative below:

\((\mathit{concatenation} \cdot\mathit{is}) \cdot\mathit{associative} = \mathit{concatenation is} \cdot\mathit{associative} = \mathit{concatenation is associative}\)
\(\mathit{concatenation} \cdot(\mathit{is} \cdot\mathit{associative}) = \mathit{concatenation} \cdot\mathit{is associative} = \mathit{concatenation is associative}\)

Even though concatenation is associative, it is not commutative. That is to say, \(u \cdot v\) and \(v \cdot u\) are not necessarily the same. They might be, but it’s not guaranteed.

Let \(u \mathrel{\mathop:}=\text{house}\) and \(v \mathrel{\mathop:}=\text{boat}\). Then \(u \cdot v\) is houseboat, whereas \(v \cdot u\) is boathouse. Those are not the same strings (and they also happen to mean completely different things).

Note the special behavior of the empty string: \(u \cdot\varepsilon= \varepsilon\cdot u = u\). This is fairly intuitive because adding a string of length 0 to \(u\) should not change the length of \(u\), which means that \(u\) does not change at all — just like adding 0 to a number does not change that number.

Sometimes concatenation is not explicitly indicated, so that instead of \(u \cdot v\) one may simply write \(\mathit{uv}\).

Give an example of distinct strings \(u\) and \(v\) such that \(\mathit{uv} = \mathit{vu}\) and neither \(u\) nor \(v\) is the empty string.

Is the following true or false? If \(u \neq v\), then \(\mathit{uv} \neq \mathit{vu}\)?

Is the following true or false? If \(\mathit{uv} \neq \mathit{vu}\), then \(u \neq v\)?

Recap

A string is a sequence of symbols drawn from some alphabet.
A \(\Sigma\)-string is a string over alphabet \(\Sigma\).
The length of string \(s\) is denoted by \(\left | s \right |\).
The empty string \(\varepsilon\) is the unique string of length \(0\).
\(\Sigma^n\) is the set of all \(\Sigma\)-strings \(s\) such that \(\left | s \right | = n\).
\(a^n\) is a shorthand for \(\left \{ a \right \}^n\).
The Kleene closure \(\Sigma^*\) is the set of all \(\Sigma\)-strings (including \(\varepsilon\)).
The positive closure \(\Sigma^+\) contains all \(\Sigma\)-strings except \(\varepsilon\).
Concatenation of strings \(u\) and \(v\) is denoted by \(u \cdot v\) or simply \(\mathit{uv}\).