Spell corrector (aka Google suggest) in Erlang (first iteration)

After seeing the Peter Norvig‘s implementation of a Spell corrector in 20 lines of python code, I would like to add an Erlang implementation to the perl6 one made by Gabriele.

Unfortunately I’m blocked by an issue I’m still trying to solve. (update: thanks to your suggestions I’ve solved the “folding” issue)

Here’s what I did until now:

(Note: I followed Peter Norvig’s naming for functions as much as possible.)

The words/1 function takes a binary representation of the reference text file and build a list of words:

words(Bin) ->
    {ok, Words} = regexp:split(binary_to_list(Bin), "[^a-zA-Z]"),
    lists:map(fun(X) -> string:to_lower(X) end, Words).

The train/1 function takes a list of words and creates a Set of lower case words with associated frequency:

train(Words) ->
    Dict = ets:new(dictionary, [set]),  
    lists:foreach(fun(X) -> 
        case ets:insert_new(Dict, {list_to_binary(X), 1}) of
            false -> ets:update_counter(Dict, list_to_binary(X), 1);
            true -> true
        end
    end, Words),
    Dict.

And finally the build_dict/1 function takes a filename and returns the associated word Set:

build_dict(Filename) ->
    {ok, Bin} = file:read_file(Filename),
    train(words(Bin)). 

Then there’s the function that generates all the possible “mistakes” from a word. This function is actually made of four different functions that generates different kind of errors:

alphabet() ->
    "abcdefghijklmnopqrstuvwxyz".

deletion_edits(Word) ->
    deletion_edits([], Word, []).
deletion_edits(_, [], Edits) ->
    Edits;
deletion_edits(Before, [Current | After], Edits) ->
	deletion_edits([Current | Before], After, 
		[lists:reverse(Before) ++ After | Edits]).

transposition_edits(Word) ->
    transposition_edits([], Word, []).
transposition_edits(_, [], Edits) ->
    Edits;
transposition_edits(_, [_], Edits) ->
    Edits;
transposition_edits(Before, [Current, Next | After], Edits) ->
	transposition_edits([Current | Before], [Next | After], 
		[ lists:reverse(Before) ++ [Next, Current] ++ After | Edits]).

alteration_edits(Word) ->
    alteration_edits([], Word, []).
alteration_edits(_, [], Edits) ->
    Edits;
alteration_edits(Before, [Current | After], Edits) ->
	BeforeR = lists:reverse(Before),
	alteration_edits([Current | Before], After, 
		[BeforeR ++ [X] ++ After || X <- alphabet()] ++ Edits).

insertion_edits(Word) ->
    insertion_edits([], Word, [[X] ++ Word || X <- alphabet()]).
insertion_edits(_, [], Edits) ->
    Edits;
insertion_edits(Before, [Current | After], Edits) ->
	BeforeR = lists:reverse(Before),
	insertion_edits([Current | Before], After, 
		[BeforeR ++ [Current, X] ++ After || X <- alphabet()] ++ 
		Edits).

We have edits1/1 that generates the 1-level error words and edits1/2 that generates the 2-level (errors of errors) words:

edits1(Word) ->
    lists:usort(deletion_edits(Word) ++ transposition_edits(Word) ++ alteration_edits(Word) ++ insertion_edits(Word)).
edits2(Word) ->
	lists:usort(lists:foldr(fun(A, AccIn) ->
		AccIn ++ edits1(A) 
	end, [], edits1(Word))).

The problem is on the edits2/1 function, it generates a list made of sublists but I want it to generate a single flat list instead. I've tryed lists:flatten/1 but it flattens strings too, I need time to find the correct function...

Here's the known/2 function that filters a list of words by returning only the words present in the dictionary:

known(Words, Dict) ->
	lists:filter(fun(Word) ->
		ets:member(Dict, list_to_binary(Word))
	end, Words).

And then there's the gen_candidates/2 function that generates the possible correct words from the mispelled one:

gen_candidates(Word, Dict) ->
	C1 = known([Word], Dict),
	case (length(C1) > 0) of
		true -> 
			[Word];
		false -> 
			C2 = known(edits1(Word), Dict),
			case (length(C2) > 0) of
				true ->
					C2;
				false ->
					C3 = known(edits2(Word), Dict),
					case (length(C3) > 0) of
						true -> C3;
						false -> [Word]
					end
			end
	end.

And finally the correct/2 function that takes a mispelled word and returns the most probable correct one:

correct(Word, Dict) ->
	lists:foldl(fun(W, {CN, CW}) ->
		case ets:lookup(Dict, list_to_binary(W)) of
			[{LW, LN}] ->
				case LN > CN of
					true -> {LN, LW};
					false -> {CN, CW}
				end;
			_ ->
				{CN, CW}
		end
	end, {0, Word}, gen_candidates(Word, Dict)).

At this point I started trying some words on the shell:

5> Dict = spellcheck:build_dict("big.txt").
14
6> spellcheck:known(["hello", "helo"], Dict).
["hello"]
7> spellcheck:known(["hello", "helo", "true"], Dict).
["hello","true"]
8> spellcheck:gen_candidates("hello", Dict).
["hello"]
9> spellcheck:gen_candidates("helo", Dict). 
["felo","halo","held","hell","hello","helm","help","hero"]
10> spellcheck:correct("helo", Dict).
{287,<<"held">>}
11> spellcheck:correct("correctz", Dict).
{38,<<"correct">>}
12> spellcheck:correct("correc", Dict).  
{38,<<"correct">>}
13> spellcheck:correct("gues", Dict).  
{112,<<"guns">>}
14> spellcheck:correct("guess", Dict).
{17,<<"guess">>}
15> spellcheck:correct("gues", Dict). 
{112,<<"guns">>}
16> spellcheck:correct("gue", Dict). 
{249,<<"due">>}

It seems to work! :)

In the next iteration I'll try to optimize and refactor the code, in special way for the edits generators that are too long for my tastes... Suggestions are welcome! :)

Also read...

Comments

  1. Pingback: Planet Erlang

  2. Pingback: PDI^2

  3. Not sure this is what you need, but maybe this is what you need to flatten?

    > lists:flatmap(fun(A) -> A end, [["ABC"],["DEF"],["GHI"]]).
    ["ABC","DEF","GHI"]

    And, if you ran it again..

    > lists:flatmap(fun(A) -> A end, ["ABC","DEF","GHI"]).
    “ABCDEFGHI”

  4. Ah, great post :)
    It seems that python’s list comprehension have a little advantage over erlangs by allowing iteration over multiplethings at the same time and allowing you to avoid explicit flattening. Unexpected :)

    Oh, and maybe in words/1 you could use the fun string:to_lower/1 syntax instead of defining an anonymous function to call that?

  5. flatmap(Function, List1) -> Element

    Types:
    Function = fun(A) -> B
    List1 = [A]
    Element = [B]

    flatmap behaves as if it had been defined as follows:

    flatmap(Func, List) ->
    append(map(Func, List))

    this function might be useful for implementing edits2/1

    e.g.

    edits2(Word) -> flatmap(edits1, edits1(Word))

  6. Use the fold, Luke!

    106> Fun = fun(A, AccIn) -> AccIn ++ A end.
    #Fun
    107> lists:foldr(Fun, [], [["aa", "bb"], ["cc"]]).

    I’m porting my Haskell port of Norvig’s spellchecker to Erlang. I’ll throw something in SVN in a few days and we can compare notes!

  7. Wow, thank’s for all these useful comments!

    Riffraff: I did some timings and I found a weird results, if I used and indirect call to string:to_lower/1 with an anonymous fun I gained about 30% in speed!

    Jordan, Mark, Steve, massemanet: thanks for your suggestions, I’ll try with that this afternoon! :)

    cheers

  8. Pingback: Spell corrector (aka Google suggest) in Erlang (first part) (reddit.com)

  9. Pingback: M0re splling correctrs... « A Programmer’s Ramblings …

  10. The “alphabet” function is just “lists:seq($a, $z)”.

    It might be nice to implement a “position fold” function that would iteratively pass into a parameter function a binary, a position (integer), and an accumulator. Then you could write your “edits” functions in terms of simple calls to that.

  11. Mike, I didn’t wanted to use a “position counter” because in that case you have to go through all the Nth items each time to split the string.

    The best optimization I can think of is to just create a single edits generator that gets the two parts of the string and applies the different algorithms.

    [h,e|llo] -> “hllo” (deletion), “heallo” … “hezllo” (insertion), “hallo” … “hzllo” (alteration), “hlelo” (transposition)

    But I suspect that you’ll just gain some function calls overheads.
    It would be nice to use lists comprehension to like Peter did in python.

  12. Pingback: links for 2007-05-14 -- A Tempest of Thoughts

  13. you can use list comprehensions like peter did,

    you would just need to create a slice function

    or use the sub_string function in the string module

  14. i rewrote edits1 to use list comprehensions:

    edits1(Word) ->
    N = length(Word),
    SEQ = if N []; true -> lists:seq(1,N) end,
    SEQM1 = if N []; true -> lists:seq(1,N-1) end,
    SEQP1 = lists:seq(1,N+1),
    ALPHA = alphabet(),
    lists:usort([lists:concat([string:sub_string(Word,1,I-1),string:sub_string(Word,I+1)]) || I

  15. You can also use lists:sublist:

    delete(Word) ->
    LW = length(Word),
    [sublist(Word, I) ++ sublist(Word, I+2, LW) || I <- seq(0, LW-1)].

  16. I think most spelling mistakes are related to the keymap.
    that is, qwerty.
    helo to held is less likely than helo to help.
    (p is near o, and d is far from o)
    maybe u can use this method to determine which candidate (from the various suggestion techniques) is more likely.

    gue is probably glue and not due
    gues is probably guess and not guns

    (letter miss more likely than far keyboard character)
    but all this in assumption that the keyboard is qwerty

  17. *** Mike, I didn’t wanted to use a “position counter” because in that case you have to go through all the Nth items each time to split the string.

    You didn’t read exactly what I wrote: if you use binary data instead of lists, you can “index” into your strings with an integer. It’s also cheaper than all those “++” expressions, which really should be avoided.

    The formatting is going to be all messed up, but here’s a “position fold” using binaries:

    pos_foldl(Fun, Initial, List) -> pos_foldl(Fun, Initial, 0, list_to_binary(List)).
    pos_foldl(Fun, Accum, Count, Binary) ->
    case Binary of
    > -> Fun(Before, >, Accum);
    > ->
    pos_foldl(Fun, Fun(Before, After, Accum), Count + 1, Binary)
    end.

    And here’s the “deletion_edit” written in terms of that:

    deletion_edit(Word) ->
    lists:reverse(pos_foldl(fun
    (_, >, Accum) -> Accum;
    (Before, >, Accum) -> [> | Accum]
    end, [], Word)).

  18. gen_candidates(Word, Dict) ->
    lookup(Word, Dict, [fun edits1/1, fun edits2/1, fun edits3/1], “”).

    lookup(_Word, _Dict, _Edits, Found) when length(Found) > 0 -> Found;
    lookup(_Word, _Dict, [], “”) -> Word;
    lookup(Word, Dict, [Edit|Edits], “”) ->
    lookup(Word, Dict, Edits, known(Edits(Word), Dict)).

    (untested)

  19. or

    gen_candidates(Word, Dict) ->
    Lookup =
    fun(Edit, “”) -> known(Edit(Word), Dict);
    (Edit, Found) -> Found
    end,
    case lists:foldl(Lookup, “”, [fun edits1/1, fun edits2/1, fun edits3/1]) of
    “” -> Word;
    Found -> Found
    end.

    there may be a list library function that does something like this?

  20. Mike, cool thanks! I didn’t knew that with binaries you can index by position!
    I’ll try that soon!
    cheers

  21. Pingback: How to Write a Spelling Corrector

  22. Pingback: Как написать проверку орфографии. Питер Норвиг

  23. Pingback: Spell corrector in python in 20 righe at GoTo 10 - Un blog italiano sulla programmazione

  24. Pingback: [ANN][Erlang] Òóòîðèàë - POP3-to-RSS gateway - Äåêëàðàòèâíîå ïðîãðàììèðîâàíèå - RSDN

  25. Pingback: 怎样写一个拼写检查器 by Peter Norvig

  26. Pingback: pligg.com

  27. Pingback: Marcelo Toledo - Blog » Blog Archive » How to Write a Spelling Corrector

  28. Pingback: ??????????-???-python | StormBlog

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>