Table of Contents
Words are the most fundamental building blocks of our language. Although they may look simple on the surface, they are very ingenious devices that pack not only meaning, but also grammatical information. For our purposes, we will say that a word consists of a stem and affixes. Let's look at three simple sentences:
I walk.
John walks.
Jack walked.
All three sentences contain some 'form' of walk. We say that these instances are all inflections of the verb walk. The part of the inflections that is shared (walk) is what we call the stem. The parts that are not common are named affixes. We inflect verbs to indicate tense (present, past, etc.), the person of the verb's subject (first, second, and third), and the number (singular or plural). The affix s in John walks, for instance, tells us (in combination with the subject John) that the verb walk is in present tense, third person singular.
Other types of words have inflections as well. For example, we inflect nouns to distinguish singular and plural:
I saw one duck. I saw two ducks.
Up to this point, we have just seen one kind of affix: one that is glued to the end of the word. There are actually many types of affixes. For now, you should only know about two:
Prefix: appears in front of the stem. For example, unbelievable.
Suffix: appears after the stem. For example, ducks.
Now, with that out of the way, let's get some work done.
Written words consist of characters. We can write down characters in Haskell with single quotes. If you type in a character in ghci, it will simply echo back the character:
Prelude> 'h'
'h'
This is all that ghci does, it evaluates whatever you type. A character evaluates to... a character. We confirm that Haskell agrees with us that this actually a character by asking the type with :type or its shorthand :t:
Prelude> :type 'h'
'h' :: Char
Great. Haskell indeed confirms that 'h' is a character, or in Haskell's words: that 'h' is of type Char. Not all that practical with the small amount of single-lettered words in English though. Rather than a single character, we want a sequence of characters. Not surprisingly, Haskell has a data types to build sequences. The most commonly used sequence is the list. You can have lists of many things: lists of groceries, lists of planets, but also lists of characters. We can make a literal list in Haskell by enumerating its elements, separated by commas and surrounded by square brackets. For instance, the list 1, 2, 3, 4, 5 is written as [1, 2, 3, 4, 5]. Let's try to make a list of characters:
Prelude> ['h','e','l','l','o']
"hello"
Now we are getting somewhere! Let's look at the type of this list:
Prelude> :type ['h','e','l','l','o']
['h','e','l','l','o'] :: [Char]
Its type is [Char], which should be read as 'list of characters'. Such a list of characters is known as a string Of course, writing down words in this manner is not very convenient. Fortunately, as the evaluation of the second to last example already suggests, there is a more convenient notation. We can represent strings by wrapping characters in double quotes:
Prelude>"hello""hello" Prelude>:type "hello""hello" :: [Char]
We will take this opportunity to seriously demolish some words, but all with the noble
cause of learning some commonly-used Haskell list functions. The first function
length returns the length of a list:
Prelude>length "hello"5 Prelude>length [1,2,3]3
To get a better impression of functions, it is often useful to look at its type:
Prelude> :type length
length :: [a] -> Int
That's one heck of a type! Basically, it says 'give me a list of something (denoted by the a between the list brackets), then I will give you an Int'. In these so-called type signatures, letters that are not capitalized are generic, meaning that they can be of some unspecified type. That is, [a] is a list with elements of some type. But: all elements should be of the same type. An Int is an integral number: a positive or negative whole number.
Two other basic list functions are head and
tail. head returns the first element of a
list, tail everything but the first element:
Prelude>head "hello"'h' Prelude>tail "hello""ello"
The type of head is the following:
Prelude> :type head
head :: [a] -> a
Hey, two a's! Equipped with the knowledge we have, we know that
head is a function that takes a list of something, and gives
back something. But there is an additional constraint here: although
a is some type, all a's have to be the
same type. So, applying head to a list of numbers gives a number,
applying head to a list of characters gives a character, etc.
In analogy, the type of tail should now be easy to understand:
Prelude> :type tail
tail :: [a] -> [a]
We apply tail to a list of some type, and get back a list with
the same type.
Finally, the last function for now is reverse. We have to admit
presenting this function with a bit of joy, since it will allow us to write our first
little useful Haskell program. As expected, reverse reverses the
elements of a list:
Prelude> reverse "hello"
"olleh"
Olé! And another one:
Prelude> reverse "level"
"level"
Hold on there! We bumped into a palindrome: a word
that is read the same way, no matter whether it is read forward or backward. Now,
suppose we would like to write our own function to determine whether a word is a
palindrome. We first need to make a slightly more formal definition of a palindrome: a
word is a palindrome if it is equal to its reverse. In Haskell we can compare values
using the == operator:
Prelude>"hello" == "hello"True Prelude>"hello" == "olleh"False
Such a comparison evaluates to True if both values are equal, or
to False in case they are not. True and
False are the only values of the Bool
type. Since reverse also returns a value, nothing holds us from
using it in comparisons:
Prelude>"hello" == reverse "hello"False Prelude>"level" == reverse "level"True
The test that we devised for detecting palindromes seems to work. But it is a lot of typing. Luckily, we can generalize this into a function. Let's replace both words by the symbolic name word (but don't execute this in ghci yet, since it does not know this symbolic name):
word == reverse word
And as a next step, Let's do some magic:
Prelude> let palindrome word = word == reverse word
This defines the function palindrome taking one argument, and
binds this argument to the symbolic name word. To this function we
assign the expression word == reverse word. Play a little with this
function to be convinced that it actually works. Some examples:
Prelude>palindrome "hello"False Prelude>palindrome "level"True Prelude>palindrome "racecar"True
If this function is still a mystery to you, it may be useful to write down the application of the function stepwise for a word that is not a palindrome:
palindrome "hello" palindrome "hello" = "hello" == reverse "hello" palindrome "hello" = "hello" == "olleh" palindrome "hello" = False
and a word that is a palindrome:
palindrome "racecar" palindrome "racecar" = "racecar" == reverse "racecar" palindrome "racecar" = "racecar" == "racecar" palindrome "racecar" = True
Congratulations, you have made your first function, which is in essence a small program!
So far, we have looked at words in isolation. However, in language, words are often combined to form a higher level of meaning representation: a sentence. Provided what we have learned about representing words in Haskell, the step towards representing sentences should be a minor one. We could, for example, represent sentences in the exactly the same way we represented words:
Prelude> "The cat is on the mat."
"The cat is on the mat."
That's fine for a beginning, although not so convenient. Let us see why. Assume we ask
you to give us the first word of a sentence. In the previous section, we learned that
head can be used to get the first element of a list. Let's try
to apply that here:
Prelude> head "The cat is on the mat."
'T'
As you probably expected, that didn't work. We represented a sentence as a list of characters (a string), and hence asking for the first element will give the first character. But wait! What if we represented a sentence as a list of words?
Prelude>["The", "cat", "is", "on", "the", "mat", "."]["The","cat","is","on","the","mat","."] Prelude>:type ["The", "cat", "is", "on", "the", "mat", "."]["The", "cat", "is", "on", "the", "mat", "."] :: [[Char]]
Nifty! We just constructed a list, of a list of characters. Though, you may wonder why we made the punctuation at the end of the sentence a separate "word". Well, this is mostly a pragmatic choice, because gluing this punctuation sign to mat does not really form a word either. Having the period sign separate is more practical for future processing. Hence, formally we say that a sentence consists of tokens, where a token can be a word, a number, and a punctuation sign.
Rinse and repeat:
Prelude> head ["The", "cat", "is", "on", "the", "mat", "."]
"The"
Since a word is also a list, we can apply a function to words as well. For example, we
can get the first character of the first word by applying head, to
the head of a sentence:
Prelude> head (head ["The", "cat", "is", "on", "the", "mat", "."])
'T'
Note that we need parenthesis to force Haskell to evaluate the part in parentheses
first. If we do not enforce this order of evaluation, Haskell will try to evaluate
head head first, which makes no sense. Remember that
head requires a list as its argument, and
head is not a list.
Now that we know how to represent sentence, this is a good time to try to write yet another small program. This time, we will write a function to compute the average token length in a corpus (a collection of texts). Since we did not look at real corpora yet, pick any sentence you like as My Little Corpus™. The authors will use "Oh, no, flying pink ponies!" The average token length is the sum of the lengths of all tokens, divided by the total number of tokens in the corpus. So, stepwise, we have to:
Get the length of each token in the corpus.
Sum the lengths of the tokens.
Divide the sum by the length of the corpus.
You know how to get the in characters length of a single token:
Prelude> length "flying"
6
Since you are lazy, you are not going to apply length to every
token in the corpus manually. Instead we want tell Haskell "Hey Haskell! Please
apply this length function to each element of the list." It turns out that Haskell
has a function to do this which is called map. Time to inspect
map:
Prelude> :type map
map :: (a -> b) -> [a] -> [b]
And we are in for another surprise. The most surprising element is probably the first
element in the type signature, (a -> b). Also surprising is that
we now see three types, (a -> b), [a] and
[b]. The latter is simple: this function takes two arguments,
(a -> b) and [a], and returns
[b]. (a -> b) as the notation suggests,
is a function taking an a and returning a b.
So, map is actually a function that takes a function as its
argument, or in functional programming-speak: a higher order
function.
So, map is a function that takes a function that maps from
a to b, takes a list of
as, and returns a list of bs. That looks a
suspicious lot like what we want! We have a list of tokens represented as strings, the
function length that takes a list and returns its length as an integer, and we want to
have a list of integers representing the lengths. Looks like we have a winner!
Prelude> map length ["Oh", ",", "no", ",", "flying", ",", "pink", "ponies","!"]
[2,1,2,1,6,1,4,6,1]
We have now completed our first step: we have the length of each token in the corpus.
Next, we have to sum the lengths that we have just retrieved. Fortunately, Haskell has a
sum function:
Prelude> :type sum
sum :: (Num a) => [a] -> a
This function takes a list of as, and returns an
a. But where did the (Num a) => come
from? Well, Num is a so-called typeclass. A type can belong to one or more of such typeclasses. But
belonging to a typeclass does not come without cost. In fact, it requires that certain
functions need to be defined for types that belong to it. For instance, the typeclass
Num is a typeclass for numbers, which requires amongst others,
functions that define addition or subtraction. Coming back to the type signature,
sum will sum a list of as, but not just
any as, only those that belong to the typeclass
Num. And after all, this makes sense, doesn't it? We cannot sum
strings or planets, but we can sum numbers. In fact, we can only sum numbers.
After this solemn introduction into typeclasses, feel free to take a cup of tea (or coffee), and try step two:
Prelude> :{
sum (map length ["Oh", ",", "no", ",", "flying", ",", "pink", "ponies", "!"])
:}
24
By now, you will probably smell victory. The only step that remains is to divide the sum by the length of the sentence using the division operator (/):
Prelude> :{
sum (map length ["Oh", ",", "no", ",", "flying", ",", "pink", "ponies", "!"]) /
length ["Oh", ",", "no", ",", "flying", ",", "pink", "ponies", "!"]
:}
<interactive>:1:0:
No instance for (Fractional Int)
arising from a use of `/' at <interactive>:1:0-136
Possible fix: add an instance declaration for (Fractional Int)
In the expression:
sum (map length ["Oh", ",", "no", ",", ....])
/ length ["Oh", ",", "no", ",", ....]
In the definition of `it':
it = sum (map length ["Oh", ",", "no", ....])
/ length ["Oh", ",", "no", ....]
And we have... Failure! I hope you poured yourself a cup of herb tea! (again alternatively: espresso!) While this is all a bit cryptic, the second line (No instance for (Fractional Int)) gives some idea where the trouble stems from. Fractional is typeclass for fractional numbers, and Haskell complains that Int is not defined to be of the typeclass Fractional. This sounds obvious, since an integer is not a fractional number. In other words, Haskell is trying to tell us that there is an Int in some place where it expected a type belonging to the typeclass Fractional. Since the division is the only new component, it is the first suspect of the breakdown:
Prelude> :type (/)
(/) :: (Fractional a) => a -> a -> a
First off, notice that we have put the division operator in parentheses. We have done this because the division operator is used as a so-called infix function: it is a function that is put between its arguments (like 1.0 / 2.0). By putting an infix operator in parentheses, you are stating that you would like to use it as a regular function. This means you can do things like this:
Prelude> (/) 1.0 2.0
0.5
Anyway, the verdict of the type signature of (/) is clear, it
requires two arguments that belong to the Fractional typeclass. The
sum and length that we calculated clearly do not belong to this typeclass, since they
are of the type Int:
Prelude>:{ :type sum (map length ["Oh", ",", "no", ",", "flying", ",", "pink", "ponies", "!"]) :}sum (map length ["Oh", ",", "no", ",", "flying", ",", "pink", "ponies", "!"]) :: Int Prelude>:{ :type length ["Oh", ",", "no", ",", "flying", ",", "pink", "ponies", "!"] :}length ["Oh", ",", "no", ",", "flying", ",", "pink", "ponies", "!"] :: Int
Fortunately, Haskell provides the function fromIntegral that
converts an integer to any kind of number. Add fromIntegral, and
you surely do get the average token length of the corpus:
Prelude> :{
fromIntegral
(sum (map length ["Oh", ",", "no", ",", "flying", ",", "pink", "ponies", "!"])) /
fromIntegral (length ["Oh", ",", "no", ",", "flying", ",", "pink", "ponies", "!"])
:}
2.6666666666666665
Well, that was a bumpier ride than you might have expected. Don't worry! During our first forays into Haskell, we were convinced that we were too stupid for this too (and here we are writing a book). However, after more practice, you will learn that Haskell is actually a very simple and logical language.
Maybe it will feel more like a victory after generalizing this to a function. You can follow the same pattern as in the palindrome example: replace the sentence with a symbolic name and transform it into a function:
Prelude>:{ let averageLength l = fromIntegral (sum (map length l)) / fromIntegral (length l) :}Prelude>:{ averageLength ["Oh", ",", "no", ",", "flying", ",", "pink" ,"ponies", "!"] :}2.6666666666666665
Congratulations, you just wrote your second function! But wait, you actually
accomplished more than you may expect. Check the type signature of
averageLength.
Prelude> :type averageLength
averageLength :: (Fractional b) => [[a]] -> b
You made your first weird type signature. Show it off to your colleagues, significant
other, or dog. averageLength is a function that takes a list of a
list of a, and returns a b that belongs to the
Fractional typeclass. But wait, a can be
anything, right? What happens if we apply this function to a list of sentences?
Prelude> averageLength [["I", "like", "Haskell", "."],
["Ruby", "rocks", "too", "."],
["Who", "needs", "Java", "?"]]
4.0
Woo! That's the average sentence length, expressed in number of words. It turns out that, although we set out to make a function to calculate the average token length, we wrote a function that calculates the average length of lists in a list (e.g., characters in words, words in sentences, or sentences in a text). This happens very often when you write Haskell programs: lots of functions are generic and can be reused for other tasks.
When dealing with real-world text, it is usually not neatly split in sentences and tokens. For example, consider this book - punctuation is usually glued to words. These processes, sentence splitting and tokenization may seem trivial, unfortunately they are not. Consider the following sentence:
E.g. Jack doesn't have 19.99 to spend.
If we simply perform sentence splitting on periods (.), we will find four sentences:
E.
g.
Jack doesn't have 19.
99 to spend.
Of course, it is just one sentence. Similar problems arise during punctuation: how do we know that E.g. and 19.99 should not be split? And how about doesn't, which should probably be split as does n't or does not? Tokenization can be performed accurately, but it requires techniques that you will see in later chapters. So don't worry, we will get back to proper tokenization later on. We promise!
Of course, up to the point where we handle tokenization, we need material to work on. To make life easier for you, the material for the first chapters of the book is pre-tokenized in a plain-text file using two simple rules:
One sentence per line.
Tokens are separated by a space.
To convert a text file to a Haskell representation, sentence splitting is a matter of splitting by line, and tokenization a matter of splitting by space. Have a look at the following example:
Prelude> "This is Jack .\nHe is a Haskeller ."
"This is Jack .\nHe is a Haskeller ."
This is exactly the representation that we will be using for our textual data. As you can see, the tokens are separated by spaces. Both sentences are separated using a newline. When writing down a string literally, you can insert a newline using \n.
Haskell provides a lines function to split up a string by line.
Not surprisingly, this function accepts a string as its first argument, and will return
a list of strings:
Prelude>:type lineslines :: String -> [String] Prelude>lines "This is Jack .\nHe is a Haskeller ."["This is Jack .","He is a Haskeller ."]
That was easy! Now to the actual tokenization. For all sentences, we have a string
representing the sentence. We want to split this string on the space character. Haskell
also has a function to do this, named words.
words is nearly the same function as
lines, except that it splits on spaces rather than newlines:
Prelude> words "This is Jack ."
["This","is","Jack","."]
That will do, but we have to apply this to every sentence in the list of sentences.
Recall that we can use the map function we have seen earlier to
apply the words function to each element of the list of
(untokenized) sentences:
Prelude> map words (lines "This is Jack .\nHe is a Haskeller .")
[["This","is","Jack","."],["He","is","a","Haskeller","."]]
Allright! That will do the job. We know how to turn this into a full-fledged function:
Prelude>let splitTokenize text = map words (lines text)Prelude>splitTokenize "This is Jack .\nHe is a Haskeller ."[["This","is","Jack","."],["He","is","a","Haskeller","."]]
This is a good moment to beautify this function a bit. To make it simpler, we first
need to get rid of the parentheses. We used the parentheses to tell Haskell that it
should evaluate lines text first, because it would otherwise try to
map over the function lines, which would fail, because it is not a
list. Very often, you will encounter function applications of the form
f(g(x)), or f(g(h(x))), etc. Haskell
provides the (.) function to combine such function applications.
So, f(g(x)) can be rewritten to (f . g) x
(apply function f to the outcome of g(x)) and
f(g(h(x))) as (f . g . h) x (apply
function f to the outcome of a function g,
which is in turn applied to the outcome of h(x)). As you can see,
this so-called function composition makes things much
easier to read. We can now rewrite our tokenization function by using function
composition:
Prelude> let splitTokenize text = (map words . lines) text
This states that we apply map words to the outcome
lines text. This may not yet seem so interesting. However, it
allows us to make yet another simplification step. Consider the type of the
map function:
Prelude> :type map
map :: (a -> b) -> [a] -> [b]
map takes a function, and a list, and returns a list. Now we will
do something that may look weird, but is very common in functional programming.
Prelude> :type map words
map words :: [String] -> [[String]]
Applying map to just one argument will give... another function!
In fact, what we just did is bind only the first argument of the map function, leaving
the second unbound. This gives a novel map function that only takes one argument, as it
has its first argument implicitly bound to the function words. This process is called
currying (indeed, named after the mathematician Haskell Curry)
in functional programming slang.
If we look back at our splitTokenize function, and look up the
type of map words . lines, we see that it is a function that takes
a String and returns a list of a list of strings:
Prelude> :type map words . lines
map words . lines :: String -> [[String]]
In our function body, we apply this function to the argument
text. Of course, this is not really necessary, because
map words . lines already defines our function (as we have
shown above). We just need to bind this to the name splitTokenize.
Consequently the function can once more be simplified:
Prelude>let splitTokenize = map words . linessplitTokenize :: String -> [[String]] Prelude>splitTokenize "This is Jack .\nHe is a Haskeller ."[["This","is","Jack","."],["He","is","a","Haskeller","."]]
In the following two sections, we will introduce two prototypical tasks related to words. The first is to make a word (or actually token) list, the second task is making a word frequency list.
A word list is a very simple data structure: it is just a list of unique words or tokens that occur in a text. Our corpus is also just a list of words, but since it contains duplicates, it is not a word list. The obvious method to make a word list is to go through a corpus word by word, and adding words that we have not yet seen to a second list. This requires some functions we haven't seen yet:
Adding an element to a list.
Checking whether an element is (or is not) in a list.
Constructing a list while traversing another list.
We like easy things first, so let's start with the first item: adding an element to a
list. We have seen the head function before that chops of the head
of the list and returns it. But we can also do the reverse: take a list and give it a
new head. The old head then becomes the head of the tail (are you still following?). In
Haskell, we can do this using the (:) function:
Prelude> 2 : [3,4,5]
[2,3,4,5]
Ain't that great? We can also add a head, and yet another:
Prelude> 1 : 2 : [3,4,5]
[1,2,3,4,5]
What if we do not have an element yet? Add the head to the empty list ([]):
Prelude> "Hi" : []
["Hi"]
With that covered, the next thing we need to be able to do is checking whether some
element belongs to a list. We can do this using the elem function.
It takes an element as its first argument, and a list as its second. It will return a
Bool of the value True if the element was in the list, or
False otherwise. For example:
Prelude>elem 2 [1,2,3,4,5]True Prelude>elem 6 [1,2,3,4,5]False
The function notElem is exactly the inverse of
elem, and returns True if an element is
not in the list, and False otherwise:
Prelude>notElem "foo" ["foo","bar","baz"]False Prelude>notElem "pony" ["foo","bar","baz"]True
Ok, so we want to add an element to a list if, but only if, it is true that it is not yet a member of that list. Or in other words, the addition is conditional. Haskell provides a set of keywords to model conditionals, if..then..else. The structure is like this:
if expr then a else b
This whole structure itself is an expression. This expression evaluates to a if expr evaluates to True or to b if expr evaluates to False. To give a working, but useless example:
Prelude>if 1 == 2 then "cuckoo" else "egg""egg" Prelude>if 1 == 1 then "cuckoo" else "egg""cuckoo"
This looks exactly like what we need. Just fill in the blanks:
Prelude>if elem "foo" ["foo","bar","baz"] then ["foo","bar","baz"] else "foo" : ["foo", "bar", "baz"]["foo","bar","baz"] Prelude>if elem "pony" ["foo","bar","baz"] then ["foo","bar","baz"] else "pony" : ["foo", "bar", "baz"]["pony","foo","bar","baz"]
That's a bit contrived, but (as you hopefully see) not if we rewrite it to a function:
Prelude>let elemOrAdd e l = if elem e l then l else e:lPrelude>elemOrAdd "foo" ["foo", "bar", "baz"]["foo","bar","baz"] Prelude>elemOrAdd "pony" ["foo", "bar", "baz"]["pony","foo","bar","baz"]
Good, now we need to apply this to all words in a text, starting with an empty list.
Haskell provides a function to do this, but brace yourself, the first time it may look a
bit 'difficult'. It is named foldl (a so-called) left fold. A left
fold traverses a list from head to tail, applying a function to each element, just like
map. However, the difference is that it can, but does not
necessarily return a list. As such, it is a generalization of the
map function. As usual, you can inspect the type signature to
see the arguments of foldl:
Prelude> :type foldl
foldl :: (a -> b -> a) -> a -> [b] -> a
Now consider this example using foldl:
Prelude> foldl (+) 0 [1,2,3,4,5]
15
Stepwise, this fold is executed in the following manner:
foldl (+) 0 [1,2,3,4,5] foldl (+) ((0)+1) [2,3,4,5] foldl (+) (((0)+1)+2) [3,4,5] foldl (+) ((((0)+1)+2)+3) [4,5] foldl (+) (((((0)+1)+2)+3)+4) [5] foldl (+) ((((((0)+1)+2)+3)+4)+5)) [] ((((((0)+1)+2)+3)+4)+5))
So, it works by applying a function to some initial argument (0 in this case) as its first argument, and the first element of the list as its second argument. When processing the second element of the list, this expression is then the first argument of the function, and the second element is the second argument, etc. The first argument of the function that is applied is also called the accumulator, since it accumulates results up till that point.
This could also work for our elemOrAdd function. Unfortunately,
elemOrAdd requires the accumulator as the second argument, and
the function passed to foldl as the first argument. Compare the
type signatures:
Prelude>:type foldlfoldl :: (a -> b -> a) -> a -> [b] -> a Prelude>:type elemOrAddelemOrAdd :: (Eq a) => a -> [a] -> [a]
In the function that is the first argument to foldl, the return
type is the same as the type of the first argument. In the case of
elemOrAdd, the type of the second argument corresponds to that
of the first. Of course, an easy 'hack' to solve this, is to redefine elemOrAdd,
switching its arguments, and plug it into foldl:
Prelude> let elemOrAdd l e = if elem e l then l else e:l
Now, since we are building a list, we use the empty list ([]) as the initial accumulator for this fold:
Prelude> foldl elemOrAdd [] ["blue", "blue", "red", "blue", "red"]
["red","blue"]
That looks good! Stepwise, the fold works like this:
foldl elemOrAdd [] ["blue", "blue", "red", "blue", "red"]
foldl elemOrAdd ("blue":([])) ["blue", "blue", "red", "blue", "red"]
foldl elemOrAdd ("blue":([])) ["blue", "red", "blue", "red"]
foldl elemOrAdd ("blue":([])) ["red", "blue", "red"]
foldl elemOrAdd ("red":("blue":([]))) ["blue", "red"]
foldl elemOrAdd ("red":("blue":([]))) ["red"]
foldl elemOrAdd ("red":("blue":([]))) []
("red":("blue":([])))
["red","blue"]
Now we wrap it up in another function, and you have constructed two functions that, together, make word lists:
Prelude>let wordList = foldl elemOrAdd []Prelude>wordList ["blue", "blue", "red", "blue", "red"]["red","blue"]
While our little word list function works fine on small texts, it will not be very
efficient for big corpora. The reason is simple - suppose that we have already found
100,000 different tokens. For every word, it would have to check the list of 100,000
tokens. There is no other way of doing this than to traverse the list word by word. Or,
on average, we compare a token to 100,000 / 2 = 50,000 elements in the list. As a
computer scientist would say: elemOrAdd works in linear time, its
processing time is linear to the number of different tokens that were seen.
This is a typical case of picking the wrong data structure for the task. But for illustrative purposes, using lists was nice and simple. However, since you are a working programmer, you want workable solutions. Bring in the sets! A set is, like the mathematical set, a collection that does not contain duplicate elements. That's good, because a word list does not contain duplicate elements. Silly us, we were fooled by the word list. What we actually want to build is a word set. In fact, it is just for historical purposes that it is called a word list.
Beside the uniqueness of elements, another nice property of sets, as they are normally implemented, is that set membership can be checked rather quickly. In the sets that we will use, membership checking is in logarithmic time. Or in other words, if comparison took one second, we would on average need 50,000 seconds to search the list mentioned earlier, but only log(100,000) or approximately 11.5 seconds to check whether the element is in a set. Talking about optimizations!
Haskell provides set functionality, but not in the so-called
Prelude. Prelude is a module that contains
functions. The Prelude module is always loaded, so its functions
are always available (unless you explicitly ask Haskell to hide them). The functions
map, head, tail, and
length, for instance, are defined in the
Prelude. Functions for set manipulation, on the other hand, are
defined in a module named Data.Set. For the time being, we will
access functions from modules by prefixing the name of the module. For instance, this
will give us the empty set:
Prelude> Data.Set.empty
fromList []
Like a list, a set can contain elements of various types. We see
this when inspecting the type signature of the
empty function:
Prelude> :type Data.Set.empty
Data.Set.empty :: Data.Set.Set a
empty returns a Set of some type a. We can
also construct a Set from a list using the
fromList function:
Prelude> Data.Set.fromList [5,2,5,8,1,1,23]
fromList [1,2,5,8,23]
As you can see here, the set does not contain duplicates. Another nice property of
Haskell sets is that they are ordered. We can also do the inverse, convert a set to a
list using toList:
Prelude> Data.Set.toList (Data.Set.fromList [5,2,5,8,1,1,23])
[1,2,5,8,23]
Elements can be added to or removed from a Set using respectively
the insert and delete functions. Both
functions return a set with that element inserted or removed:
Prelude>Data.Set.insert 42 (Data.Set.fromList [5,2,5,8,1,1,23])fromList [1,2,5,8,23,42] Prelude>Data.Set.delete 5 (Data.Set.fromList [5,2,5,8,1,1,23])fromList [1,2,8,23]
Finally, we can check whether some value is a member of a set by using the
member function:
Prelude>Data.Set.member 23 (Data.Set.fromList [5,2,5,8,1,1,23])True Prelude>Data.Set.member 24 (Data.Set.fromList [5,2,5,8,1,1,23])False
We have now seen enough to change our word list function. Rather than checking whether a value is in a list and adding it if not, we check whether it is in a Set and add it in when it is not:
Prelude>:{ let elemOrAdd s e = if Data.Set.member e s then s else Data.Set.insert e s :}Prelude>elemOrAdd (Data.Set.fromList [5,2,5,8,1,1,23]) 24fromList [1,2,5,8,23,24]
That was simple. But it feels a weird, right? The most vital characteristic of a set
is that it never contains duplicate elements, why do we need to check for duplicates? We
don't. So, forget about elemOrAdd, we will only use
Data.Set.insert from this point. Our objective now is to
traverse a list of tokens, adding each token to a set, starting with the empty set. Our
first take is this:
Prelude> let wordSet = foldl Data.Set.insert Data.Set.empty
However, this will not work. Remember that in the function we give to
foldl, the accumulator has to be the first argument (the reason
why we swapped the elements of our initial elemOrAdd
function)? We are accumulating a Set, but the set is the second
argument to Data.Set.insert. We will pull a little trick out of our
hat.
Prelude> let wordSet = foldl (\s e -> Data.Set.insert e s) Data.Set.empty
You might be thinking "Oh, no, more syntax terror! Does it ever stop?" Actually, ( e -> Data.Set.insert e s) is very familiar. You could see it as an inline function. In functional programming jargon, this is called a lambda. Check out the type signature of the lambda:
Prelude> :type (\s e -> Data.Set.insert e s)
(\s e -> Data.Set.insert e s)
:: (Ord a) => Data.Set.Set a -> a -> Data.Set.Set a
It is just a function, it takes a set of some type a, a value of type a, and returns a. Additionally, a has the typeclass Ord, which means that some comparison operators should be defined for a. The lambda has two arguments that are bound to s and e. The function body comes after the arrow. To emphasize that this is just a function, the following functions are equivalent:
myFun = (\s e -> Data.Set.insert e s) myFun s e = Data.Set.insert e ss
Back to our wordSet function. We used the lambda to swap the
arguments of Data.Set.insert. Data.Set.insert
takes a value and a set, our lambda takes a set and a value. The rest of the function
follows the same pattern as wordList, except that we start with an
empty set rather than an empty list. The function works as expected:
Prelude> wordSet ["blue", "blue", "red", "blue", "red"]
fromList ["blue","red"]
You have done it! You are now not only able to make a function that creates a word list, but also one that is performant.
To measure the vocabulary of a writer, a so-called type-token ratio can be calculated. This is the number of distinct tokens occurring in a text (types) divided by the total number of tokens in that text.
For instance the phrase "to be or not to be" contains six tokens and four types (to, be, or, not). The type-token ratio of this phrase is 4 / 6 = 2 / 3.
Write a function that calculates the type-token ratio of a list of tokens. You can use the Data.Set.size function to get the number of elements in a set.
Now that we are writing longer and longer functions, it becomes more convenient to define functions in a file rather than the ghci prompt. You can do this by creating a file using a plain-text editor with the .hs extension. Functions can be written down in the same manner as in ghci, but without the preceding let keyword. It is also highly recommended to add a type signature before the function. Haskell will check the function against the type signature, and report an error if they do not correspond. This will help you catch incorrect function definitions.
The palindrome function discussed earlier in this chapter can be
written to a file like this:
palindrome :: (Eq a) => [a] -> Bool palindrome word = word == reverse word
If you saved this file as example.hs, you can load it in ghci using the :l (shorthand for :load) command:
Prelude> :l example
[1 of 1] Compiling Main ( example.hs, interpreted )
Ok, modules loaded: Main.
*Main> palindrome "racecar"
True
For code fragments that use a module other than the prelude, add an import statement
at the top of the file. For example, the wordSet function from the
previous section should be saved to a text file in the following manner:
import qualified Data.Set wordSet :: Ord a => [a] -> Data.Set.Set a wordSet = foldl (\s e -> Data.Set.insert e s) Data.Set.empty
From now on, we assume that examples are written to a text file, except when the Prelude> occurs in the example.
The word list function that we built in the previous section works is useful for various tasks, like calculating the type-token ratio for a text. For some other tasks this is not good enough - we want to be able to find out how often a word was used. We can expand a word list with frequencies to make a word frequency list.
To be able to store word frequencies, every word has to be associated with an integer. We could store such an association as a tuple. A tuple is a data type with a fixed number of elements and a fixed type for an element. Examples of tuples are:
(1,2,3)
("hello","world")
("hello",1)
As you can see, they differ from lists in that they can have values of different types as elements. However, if you inspect the type signatures of these tuples, you will see that the length and type for each position is fixed:
Prelude>:type (1,2,3)(1,2,3) :: (Num t, Num t1, Num t2) => (t, t1, t2) Prelude>:type ("hello","world")("hello","world") :: ([Char], [Char]) Prelude>:type ("hello",1)("hello",1) :: (Num t) => ([Char], t)
To store frequencies, we could use a list of tuples of the type [([Char], Int)]. The phrase "to be or not to be" could be stored as
[("to",2),("be",2),("or",1),("not",1)]
However, this would be even less efficient than using lists for constructing word
lists. First, like elemOrAdd we would potentially have to search
the complete list to locate a word. Second, we would have to reconstruct the list up to
the point of the element. In the elemOrAdd function we could just
give the list a new head, but now we would have to replace the element to update the
word frequency and add all preceding list items again. Since Haskell is a 'pure'
language, we cannot modify existing values.
A more appropriate data type for this task is a map (not to be confused with the
map function). A map maps a key to a value. In Haskell, maps
are provided in the Data.Map module. Like sets, we can make an
empty map:
Prelude> Data.Map.empty
fromList []
When you inspect the type signature of the empty map, you can see that it parametrizes over two types, a type for the key and a type for values:
Prelude> :type Data.Map.empty
Data.Map.empty :: Data.Map.Map k a
We can construct a Map from a list of binary tuples (tuples with two elements), where the first element of the tuple becomes the key, and the second the value:
Prelude>Data.Map.fromList [("to",2),("be",2),("or",1),("not",1)]fromList [("be",2),("not",1),("or",1),("to",2)] Prelude>:type Data.Map.fromList [("to",2),("be",2),("or",1),("not",1)]Data.Map.fromList [("to",2),("be",2),("or",1),("not",1)] :: (Num t) => Data.Map.Map [Char] t
This also binds the types for the map: we are mapping from keys of type string to values of type t that belongs to the t typeclass. No specific value for types is used (yet), because the numbers could be integers or fractionals.
The insert function is used to add a new mapping to the
Map:
Prelude> :{
Data.Map.insert "hello" 1
(Data.Map.fromList [("to",2),("be",2),("or",1),("not",1)])
:}
fromList [("be",2),("hello",1),("not",1),("or",1),("to",2)]
If a mapping with the given key already exists, the existing mapping is replaced:
Prelude> :{
Data.Map.insert "be" 1
(Data.Map.fromList [("to",2),("be",2),("or",1),("not",1)])
:}
fromList [("be",1),("not",1),("or",1),("to",2)]
Looking up values is a bit peculiar. You can lookup a value with the
lookup function. However, if you inspect the type signature,
you will see that the value is not returned as is:
Prelude> :type Data.Map.lookup
Data.Map.lookup :: (Ord k) => k -> Data.Map.Map k a -> Maybe a
Rather than returning a value, it returns the value packed in some box called Maybe. Maybe a is a type that has just two possible so-called constructors, Just a or Nothing. You can put your own values in a Maybe box using the Just a constructor:
Prelude>Just 22Just 22 Prelude>:type Just 22Just 22 :: (Num t) => Maybe t Prelude>Just [1,2,3,4,5]Just [1,2,3,4,5] Prelude>Just "stay calm"Just "stay calm" Prelude>:type Just "stay calm"Just "stay calm" :: Maybe [Char]
You can also make a box that contains vast emptiness with the Nothing constructor:
Prelude>NothingNothing Prelude>:type NothingNothing :: Maybe a
These boxes turn out to be pretty cool: you can use them to return something or
nothing from functions, without resorting to all kinds of abominations as exceptions or
null pointers (if you never heard of exceptions or pointers, do not worry, you have a
life full of bliss). Since Maybe is so nice, the
lookup function uses it. It will return the value packed with
in a Just constructor if the key occurred in the map, or Nothing otherwise:
Prelude>:{ Data.Map.lookup "to" (Data.Map.fromList [("to",2),("be",2),("or",1),("not",1)]) :}Just 2 Prelude>:{ Data.Map.lookup "wrong" (Data.Map.fromList [("to",2),("be",2),("or",1),("not",1)]) :}Nothing
As for handling these values - we will come to that later. Mappings are deleted from a
Map by key with the delete function. If a
key did not occur in the Map, the original map is returned:
Prelude>:{ Data.Map.delete "to" (Data.Map.fromList [("to",2),("be",2),("or",1),("not",1)]) :}fromList [("be",2),("not",1),("or",1)] Prelude>:{ Data.Map.delete "wrong" (Data.Map.fromList [("to",2),("be",2),("or",1),("not",1)]) :}fromList [("be",2),("not",1),("or",1),("to",2)]
Finally, a Map can be converted to a list
using the toList function:
Prelude> :{
Data.Map.toList
(Data.Map.fromList [("to",2),("be",2),("or",1),("not",1)])
:}
[("be",2),("not",1),("or",1),("to",2)]
Alright. Back to our task at hand: constructing a word frequency list. As with word
lists, we want to traverse a list of words, accumulating data. So, the use of
foldl is appropriate for this task. During each folding step,
we take the Map created in a previous step. We then
lookup the value for the current step in the Map. If
it does not exist, we add it to the Map giving it a
frequency of one. Otherwise, we want to increase the frequency by one. The
countElem function does this:
import qualified Data.Map
countElem :: (Ord k) => Data.Map.Map k Int -> k -> Data.Map.Map k Int
countElem m e = case (Data.Map.lookup e m) of
Just v -> Data.Map.insert e (v + 1) m
Nothing -> Data.Map.insert e 1 m
This function introduces the case construct.
Remember that lookup uses the nifty Maybe data type? The case construct
allows us to select an expression based on a constructor. If lookup
returned a value using the Just constructor, the key
was in the Map. In this case, we bind the value to
the name v and add a new value for this key. This
value is the old value for this key incremented by one. If a value with the Nothing constructor was returned, the key was not in the
Map. So, we will add it, and give it a
(frequency) value of 1.
The countElem function works as intended:
*Main> foldl countElem Data.Map.empty ["to","be","or","not","to","be"]
fromList [("be",2),("not",1),("or",1),("to",2)]
While this was a nice exercise, the Data.Map.insertWith function can drastically shorten our function. This function uses an update function to update a value, or a specified value if the key is not present in the Map:
*Main> :t Data.Map.insertWith
Data.Map.insertWith
:: Ord k =>
(a -> a -> a) -> k -> a -> Data.Map.Map k a -> Data.Map.Map k a
The update function gets the specified value as its first argument, and the old value
as its second argument. Using insertWith, we can shorten our
function to:
countElem :: (Ord k) => Data.Map.Map k Int -> k -> Data.Map.Map k Int countElem m e = Data.Map.insertWith (\n o -> n + o) e 1 m
If an element was not seen in the Map yet, a
frequency of 1 will be inserted. If the element does occur as a key in the map, the
lambda adds one to the old frequency. With countElem in our reach,
we can define the freqList function:
freqList :: (Ord k) => [k] -> Data.Map.Map k Int freqList = foldl countElem Data.Map.empty
In the next section you will see how to read real text corpora using the so-called IO monad. Before diving into the IO monad, we will give a short introduction to monads. In Haskell, it happens very often that you want to perform a series of computations on values that are wrapped using some type, such as Maybe or a list. For instance, suppose that you have a Map that maps a customer name to a customer number, and yet another Map that maps a customer number to a list of order numbers:
*Main>let customers = Data.Map.fromList [("Daniel de Kok", 1000), ("Harm Brouwer", 1001)]*Main>let orders = Data.Map.fromList [(1001, [128])]Prelude>Data.Map.lookup "Harm Brouwer" customersJust 1001 Prelude>Data.Map.lookup 1001 ordersJust [128]
Now, we want to write a function that extracts a list orders for a given customer, wrapped in a Maybe, so that Nothing can be returned if the customer is not in the list or if the customer had no orders. Your first attempt will probably be along the following lines:
lookupOrder0 :: Data.Map.Map String Integer ->
Data.Map.Map Integer [Integer] ->
String -> Maybe [Integer]
lookupOrder0 customers orders customer =
case Data.Map.lookup customer customers of
Nothing -> Nothing
Just customerId -> Data.Map.lookup customerId orders
This function works as intended:
*Main>lookupOrder0 customers orders "Jack Sparrow"Nothing *Main>lookupOrder0 customers orders "Daniel de Kok"Nothing *Main>lookupOrder0 customers orders "Harm Brouwer"Just [128]
But case constructs will quickly start to stack up
when more functions are called that return Maybe. For each
Maybe we follow the same procedure: if the value is
Nothing we end the computation with that value, if the value is
Just x we call the next function that possibly uses
x as an argument. This is where the Monad typeclass
kicks in: all data types that are of the Monad typeclass implement the
(>>=) function. This function joins computations resulting in that data type according to some logic.
For instance, the Maybe monad combines expressions that return
Maybe in such a way that if one expression returns
Nothing, the whole joined expression also returns Nothing,
just like our case construct in the example above.
Consequently, we could rewrite the example above
as:
lookupOrder1 :: Data.Map.Map String Integer ->
Data.Map.Map Integer [Integer] ->
String -> Maybe [Integer]
lookupOrder1 customers orders customer =
Data.Map.lookup customer customers >>= (\m -> Data.Map.lookup m orders)
That surely shortened the function! But what does it do? Well, the function performs the following steps:
Data.Map.lookup customer customers: Lookup
customer in customers.
(>>=): If the expression on the left-hand returned
Nothing, the value of the full expression, and consequently
lookupOrder1, is Nothing. If the
lookup returned a value of type Just Integer, extract the
Integer from the Just constructor, and pass it
as the argument of the next function.
(\m -> Data.Map.lookup m orders): Lookup the supplied
argument in the orders
Map. The result becomes the result of
lookupOrder1. We had to use a lambda, to make the
element to be looked up the first argument.
The type signature of (>>=) is also very illustrative:
*Main> :type (>>=)
(>>=) :: (Monad m) => m a -> (a -> m b) -> m b
The (>>=) simply unwraps a value from the monad and feeds it to a
function that returns a value wrapped in the same monad. The (>>=)
function has the freedom to perform any operation (including not calling (a ->
mb)), as long as it returns a value of type m b.
The actual implementation of (>>=) for the Maybe monad is very simple:
(Just x) >>= k = k x Nothing >>= _ = Nothing
That's all! If the left-hand side expression evaluates to Just x, the
expression on the left hand side is evaluated with x as its argument.
If the left-hand side evaluates to Nothing, the right-end side is not
evaluated, and the whole expression returns Nothing.
There is yet another function that is essential to monads named return,
it does nothing else than wrapping a value in that monad. For instance, the
maybeBool function wraps a Bool value in a
Maybe
monad:
maybeBool :: Bool -> Maybe Bool maybeBool = return
This is maybeBool in
action:
*Main>maybeBool TrueJust True *Main>maybeBool FalseJust False
The implementation of return for the Maybe monad is
trivial:
return = Just
Let's get back to our order lookup function. You may have noticed that the
(>>=) function is somewhat imperative in nature: it chains a
set of expressions, where the left-hand expression is evaluated before the right-hand.
Due to this nature, Haskell has a do notation that
resembles imperative programs. This is lookupOrder1 using the
do-notation:
lookupOrder2 :: Data.Map.Map String Integer ->
Data.Map.Map Integer [Integer] ->
String -> Maybe [Integer]
lookupOrder2 customers orders customer = do
customerId <- Data.Map.lookup customer customers
orders <- Data.Map.lookup customerId orders
return orders
You can see that we added the do keyword to start
the do-notation. Every expression that follows can be seen as just one element in a
sequence of (>>=) expressions. Consequently, each line is governed by the
'laws' of the monad. If the first lookup fails, no further evaluation is performed, and
lookupOrder2 will return Nothing. Otherwise,
computation continues. The do-notation also allows the use of the backward arrow
(<-) this arrow extracts a value from the
monad, and binds it to a variable.
We can simplify lookupOrder2 further. The last lookup already returns a value wrapped
in the Maybe monad, there is no need to extract it using <- and wrap it again with
return:
lookupOrder :: Data.Map.Map String Integer ->
Data.Map.Map Integer [Integer] ->
String -> Maybe [Integer]
lookupOrder customers orders customer = do
customerId <- Data.Map.lookup customer customers
Data.Map.lookup customerId orders
Up to this point we have been using very artificial text corpora. At most a few sentences. But you are in it for the real deal, right? Lucky you, we will use a real (like really real) corpus starting from this very moment. Of course, just to test functions we will start with small examples. But you will be able to apply your functions to real data.
So-called I/O (input/output) is a delicate matter in Haskell. The reason being that, as a pure functional language, it is not possible to modify expressions or values once they exist. This leads to an admirable quality of Haskell: given the same input, a function will always return the same output. Or in other words, a function does not have side-effects. Unlike most other languages, there is no state in a function that can change, so the output can also not change. If a function always evaluates to the same value given the same input, how can you have I/O? For example, suppose that we open a file, and use a function read a byte. And then we read yet another byte. The second byte may be a different one. The reading function has a side-effect: it increases the position within the file.
The Haskell developers have, clever as they are, found a solution to get Pandora's box
into Haskell. I/O in Haskell is performed in the so-called IO monad. The IO monad is not very different from the
Maybe monad, it implements the (>>=) and
return functions as required for monads. However, there is a subtle,
but very important difference. Remember that you can extract a value from a
Maybe value using its Just
constructor?
value = case someMaybe of Just x = x
The IO monad does not have a public constructor, so there is no way to pry a value out of an IO monad. If you read a file as a list of Char, it resides in the IO monad. You can apply any list function to this list, however, the result of the function that is applied to a list will also have to reside in the IO monad. A value can never escape the IO monad. And this is how impure I/O is possible in Haskell without sacrificing purity of the language: impure data stays in the IO monad, and can never escape.
Now on to some real work. As said, functions that do IO return a value wrapped in
IO. For instance, the putStrLn function returns an
empty tuple packed in the IO
monad:
Prelude> :type putStrLn "hello world!"
putStrLn "hello world!" :: IO ()
This is just a normal value, you can bind it to a name in ghci:
Prelude> let v = putStrLn "hello world!"
v :: IO ()
However, if we evaluate the value in ghci, it will execute this I/O action:
Prelude> v
hello world!
Of course, the same thing happens if we evaluate putStrLn
directly:
Prelude> putStrLn "hello world!"
hello world!
Now, let's get to the interesting part: reading a file. In the files distributed with
this book, you will find the file brown.txt. This file contains the
Brown corpus, a corpus of written text of various kinds. The Brown corpus is already
tokenized, which makes our life a bit easier. Ok, first we need to open the file using
the IO.openFile function. openFile requires a
filename and an I/O mode as its arguments, and it returns a handle packed in the
IOmonad:
Prelude> :type IO.openFile
IO.openFile
:: FilePath
-> GHC.IO.IOMode.IOMode
-> IO GHC.IO.Handle.Types.Handle
We use IO.ReadMode to open
brown.txt for reading:
Prelude>let h = IO.openFile "brown.txt" IO.ReadModePrelude>:type hh :: IO GHC.IO.Handle.Types.Handle
The handle is bound to h, but still in the IO monad. It would be nicer if we can access that handle directly, but we told you that a value can never escape its monad. Surprisingly, we can extract the value from the IO monad in ghci. The reason that you can is that ghci lives in the IO monad itself. So, the value will still never leave IO. We can bind the value to a name (or pattern) using <-:
Prelude>h <- IO.openFile "brown.txt" IO.ReadModePrelude>:type hh :: GHC.IO.Handle.Types.Handle
That gives us the handle, bound to h. The next function that we will use is IO.hGetContents, which returns unread data as a String wrapped in the IO monad:
Prelude> :type IO.hGetContents
IO.hGetContents :: GHC.IO.Handle.Types.Handle -> IO String
As we mentioned earlier, Haskell is a lazy language: expressions are only evaluated when necessary. The same thing applies to I/O: the relevant contents of a file are only read once you start extracting characters from the String. With some smart programming, it is not necessary for your program to read the whole file into memory, it will allocate and deallocate chunks of the file as they are used. Now, get the contents of the file:
Prelude> c <- IO.hGetContents h
Prelude> :type c
c :: String
We can apply the usual list functions to this String:
Prelude>head c'T' Prelude>length c6157180
Since the file is sentence-splitted using newlines and tokenized using spaces, you can
use the lines and words functions to apply
sentence splitting and tokenization. For instance, the first word of the corpus
is:
Prelude> head (words (head (lines c)))
"The"
The frequency of the word the is nicely wrapped in a Just constructor:
Prelude> Data.Map.lookup "the" (freqList (words c))
Just 62713
Congratulations! This was your first venture into the world of corpus statistics!
Todo: Zipfian distribution.