May
14
Posted on 14-05-2007
Filed Under (Erlang, Programming) by Federico Feroldi on 14-05-2007

After seeing the Peter Norvig’s implementation of a Spell corrector in 20 lines of python code, I would like to add an Erlang implementation to the perl6 one made by Gabriele.

Unfortunately I’m blocked by an issue I’m still trying to solve. (update: thanks to your suggestions I’ve solved the “folding” issue)

Here’s what I did until now:

(Note: I followed Peter Norvig’s naming for functions as much as possible.)

The words/1 function takes a binary representation of the reference text file and build a list of words:

words(Bin) ->
    {ok, Words} = regexp:split(binary_to_list(Bin), "[^a-zA-Z]"),
    lists:map(fun(X) -> string:to_lower(X) end, Words).

The train/1 function takes a list of words and creates a Set of lower case words with associated frequency:

train(Words) ->
    Dict = ets:new(dictionary, [set]),
    lists:foreach(fun(X) ->
        case ets:insert_new(Dict, {list_to_binary(X), 1}) of
            false -> ets:update_counter(Dict, list_to_binary(X), 1);
            true -> true
        end
    end, Words),
    Dict.

And finally the build_dict/1 function takes a filename and returns the associated word Set:

build_dict(Filename) ->
    {ok, Bin} = file:read_file(Filename),
    train(words(Bin)).

Then there’s the function that generates all the possible “mistakes” from a word. This function is actually made of four different functions that generates different kind of errors:

alphabet() ->
    "abcdefghijklmnopqrstuvwxyz".

deletion_edits(Word) ->
    deletion_edits([], Word, []).
deletion_edits(_, [], Edits) ->
    Edits;
deletion_edits(Before, [Current | After], Edits) ->
	deletion_edits([Current | Before], After,
		[lists:reverse(Before) ++ After | Edits]).

transposition_edits(Word) ->
    transposition_edits([], Word, []).
transposition_edits(_, [], Edits) ->
    Edits;
transposition_edits(_, [_], Edits) ->
    Edits;
transposition_edits(Before, [Current, Next | After], Edits) ->
	transposition_edits([Current | Before], [Next | After],
		[ lists:reverse(Before) ++ [Next, Current] ++ After | Edits]).

alteration_edits(Word) ->
    alteration_edits([], Word, []).
alteration_edits(_, [], Edits) ->
    Edits;
alteration_edits(Before, [Current | After], Edits) ->
	BeforeR = lists:reverse(Before),
	alteration_edits([Current | Before], After,
		[BeforeR ++ [X] ++ After || X <- alphabet()] ++ Edits).

insertion_edits(Word) ->
    insertion_edits([], Word, [[X] ++ Word || X <- alphabet()]).
insertion_edits(_, [], Edits) ->
    Edits;
insertion_edits(Before, [Current | After], Edits) ->
	BeforeR = lists:reverse(Before),
	insertion_edits([Current | Before], After,
		[BeforeR ++ [Current, X] ++ After || X <- alphabet()] ++
		Edits).

We have edits1/1 that generates the 1-level error words and edits1/2 that generates the 2-level (errors of errors) words:

edits1(Word) ->
    lists:usort(deletion_edits(Word) ++ transposition_edits(Word) ++ alteration_edits(Word) ++ insertion_edits(Word)).
edits2(Word) ->
	lists:usort(lists:foldr(fun(A, AccIn) ->
		AccIn ++ edits1(A)
	end, [], edits1(Word))).

The problem is on the edits2/1 function, it generates a list made of sublists but I want it to generate a single flat list instead. I’ve tryed lists:flatten/1 but it flattens strings too, I need time to find the correct function…

Here’s the known/2 function that filters a list of words by returning only the words present in the dictionary:

known(Words, Dict) ->
	lists:filter(fun(Word) ->
		ets:member(Dict, list_to_binary(Word))
	end, Words).

And then there’s the gen_candidates/2 function that generates the possible correct words from the mispelled one:

gen_candidates(Word, Dict) ->
	C1 = known([Word], Dict),
	case (length(C1) > 0) of
		true ->
			[Word];
		false ->
			C2 = known(edits1(Word), Dict),
			case (length(C2) > 0) of
				true ->
					C2;
				false ->
					C3 = known(edits2(Word), Dict),
					case (length(C3) > 0) of
						true -> C3;
						false -> [Word]
					end
			end
	end.

And finally the correct/2 function that takes a mispelled word and returns the most probable correct one:

correct(Word, Dict) ->
	lists:foldl(fun(W, {CN, CW}) ->
		case ets:lookup(Dict, list_to_binary(W)) of
			[{LW, LN}] ->
				case LN > CN of
					true -> {LN, LW};
					false -> {CN, CW}
				end;
			_ ->
				{CN, CW}
		end
	end, {0, Word}, gen_candidates(Word, Dict)).

At this point I started trying some words on the shell:

5> Dict = spellcheck:build_dict("big.txt").
14
6> spellcheck:known(["hello", "helo"], Dict).
["hello"]
7> spellcheck:known(["hello", "helo", "true"], Dict).
["hello","true"]
8> spellcheck:gen_candidates("hello", Dict).
["hello"]
9> spellcheck:gen_candidates("helo", Dict).
["felo","halo","held","hell","hello","helm","help","hero"]
10> spellcheck:correct("helo", Dict).
{287,<<"held">>}
11> spellcheck:correct(”correctz”, Dict).
{38,<<"correct">>}
12> spellcheck:correct(”correc”, Dict).
{38,<<"correct">>}
13> spellcheck:correct(”gues”, Dict).
{112,<<"guns">>}
14> spellcheck:correct(”guess”, Dict).
{17,<<"guess">>}
15> spellcheck:correct(”gues”, Dict).
{112,<<"guns">>}
16> spellcheck:correct(”gue”, Dict).
{249,<<"due">>}

It seems to work! :)

In the next iteration I’ll try to optimize and refactor the code, in special way for the edits generators that are too long for my tastes… Suggestions are welcome! :)

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • DZone
  • Reddit
  • Technorati
  • YahooMyWeb
    Read More   

Comments

Bob Ippolito on 14 May, 2007 at 1:34 am #

Try lists:append/1.


Jordan Wilberding on 14 May, 2007 at 3:07 am #

Not sure this is what you need, but maybe this is what you need to flatten?

> lists:flatmap(fun(A) -> A end, [[”ABC”],[”DEF”],[”GHI”]]).
[”ABC”,”DEF”,”GHI”]

And, if you ran it again..

> lists:flatmap(fun(A) -> A end, [”ABC”,”DEF”,”GHI”]).
“ABCDEFGHI”


riffraff on 14 May, 2007 at 8:55 am #

Ah, great post :)
It seems that python’s list comprehension have a little advantage over erlangs by allowing iteration over multiplethings at the same time and allowing you to avoid explicit flattening. Unexpected :)

Oh, and maybe in words/1 you could use the fun string:to_lower/1 syntax instead of defining an anonymous function to call that?


mark bradley on 14 May, 2007 at 9:28 am #

flatmap(Function, List1) -> Element

Types:
Function = fun(A) -> B
List1 = [A]
Element = [B]

flatmap behaves as if it had been defined as follows:

flatmap(Func, List) ->
append(map(Func, List))

this function might be useful for implementing edits2/1

e.g.

edits2(Word) -> flatmap(edits1, edits1(Word))


Steve Jenson on 14 May, 2007 at 9:32 am #

Use the fold, Luke!

106> Fun = fun(A, AccIn) -> AccIn ++ A end.
#Fun
107> lists:foldr(Fun, [], [[”aa”, “bb”], [”cc”]]).

I’m porting my Haskell port of Norvig’s spellchecker to Erlang. I’ll throw something in SVN in a few days and we can compare notes!


massemanet on 14 May, 2007 at 10:53 am #

i believe you want the lists:append/1 function.


Federico Feroldi on 14 May, 2007 at 11:18 am #

Wow, thank’s for all these useful comments!

Riffraff: I did some timings and I found a weird results, if I used and indirect call to string:to_lower/1 with an anonymous fun I gained about 30% in speed!

Jordan, Mark, Steve, massemanet: thanks for your suggestions, I’ll try with that this afternoon! :)

cheers


[…] Spell corrector (aka Google suggest) in Erlang (first part) (pixzone.com) […]


[…] A version in Erlang! […]


Planet Erlang on 14 May, 2007 at 7:46 pm #

Spell corrector (aka Google suggest) in Erlang (first iteration)


Mike McNally on 14 May, 2007 at 9:26 pm #

The “alphabet” function is just “lists:seq($a, $z)”.

It might be nice to implement a “position fold” function that would iteratively pass into a parameter function a binary, a position (integer), and an accumulator. Then you could write your “edits” functions in terms of simple calls to that.


Steve Jenson on 14 May, 2007 at 9:36 pm #

I don’t think alphabet needs to be a function at all. You can use an Erlang macro.


Federico Feroldi on 14 May, 2007 at 10:52 pm #

Mike, I didn’t wanted to use a “position counter” because in that case you have to go through all the Nth items each time to split the string.

The best optimization I can think of is to just create a single edits generator that gets the two parts of the string and applies the different algorithms.

[h,e|llo] -> “hllo” (deletion), “heallo” … “hezllo” (insertion), “hallo” … “hzllo” (alteration), “hlelo” (transposition)

But I suspect that you’ll just gain some function calls overheads.
It would be nice to use lists comprehension to like Peter did in python.


links for 2007-05-14 -- A Tempest of Thoughts on 15 May, 2007 at 1:30 am #

[…] Spell corrector (aka Google suggest) in Erlang (first iteration) | the pix zone My good friend’s Federico take on spell corrector, this time in Erlang. (tags: erlang programming spelling) […]


mark bradley on 15 May, 2007 at 1:34 am #

you can use list comprehensions like peter did,

you would just need to create a slice function

or use the sub_string function in the string module


mark bradley on 15 May, 2007 at 4:28 am #

i rewrote edits1 to use list comprehensions:

edits1(Word) ->
N = length(Word),
SEQ = if N []; true -> lists:seq(1,N) end,
SEQM1 = if N []; true -> lists:seq(1,N-1) end,
SEQP1 = lists:seq(1,N+1),
ALPHA = alphabet(),
lists:usort([lists:concat([string:sub_string(Word,1,I-1),string:sub_string(Word,I+1)]) || I


mark bradley on 15 May, 2007 at 4:32 am #

check out my code here: http://www.cse.unsw.edu.au/~markb/spell.erl

sorry about the formatting of the previous post…didnt know there was a problem with it


Steve Jenson on 15 May, 2007 at 7:32 am #

You can also use lists:sublist:

delete(Word) ->
LW = length(Word),
[sublist(Word, I) ++ sublist(Word, I+2, LW) || I <- seq(0, LW-1)].


PDI^2 on 15 May, 2007 at 9:08 am #

restituisce semplicemente un grosso array con tutte le variazioni. Paragonando questo codice a quello di norvig noterete che lui usa i Set in edits1 mentre noi usiamo semplici liste, ma va bene perché non ci sono duplicati tra le variazioni. Federico credo faccia la stessa cosa in erlang [IMG :)] Ora serve una funzione known(@array) che selezioni gli elementi della lista che sono parole valide. Ricordate che abbiamo una variabile %WORDS{word}=>number che abbiamo inizializzato con le parole del vocabolario e la loro


kobi on 15 May, 2007 at 12:11 pm #

I think most spelling mistakes are related to the keymap.
that is, qwerty.
helo to held is less likely than helo to help.
(p is near o, and d is far from o)
maybe u can use this method to determine which candidate (from the various suggestion techniques) is more likely.

gue is probably glue and not due
gues is probably guess and not guns

(letter miss more likely than far keyboard character)
but all this in assumption that the keyboard is qwerty


Mike McNally on 15 May, 2007 at 1:56 pm #

*** Mike, I didn’t wanted to use a “position counter” because in that case you have to go through all the Nth items each time to split the string.

You didn’t read exactly what I wrote: if you use binary data instead of lists, you can “index” into your strings with an integer. It’s also cheaper than all those “++” expressions, which really should be avoided.

The formatting is going to be all messed up, but here’s a “position fold” using binaries:

pos_foldl(Fun, Initial, List) -> pos_foldl(Fun, Initial, 0, list_to_binary(List)).
pos_foldl(Fun, Accum, Count, Binary) ->
case Binary of
> -> Fun(Before, >, Accum);
> ->
pos_foldl(Fun, Fun(Before, After, Accum), Count + 1, Binary)
end.

And here’s the “deletion_edit” written in terms of that:

deletion_edit(Word) ->
lists:reverse(pos_foldl(fun
(_, >, Accum) -> Accum;
(Before, >, Accum) -> [> | Accum]
end, [], Word)).


Mike McNally on 15 May, 2007 at 2:17 pm #

At the URL:

http://gutfullofbeer.net/word.html

is a more complete example of how such “string” manipulations can be done more efficiently (I think) with binaries instead of lists.

Those “edit” examples make lists of binaries.


andrew cooke on 15 May, 2007 at 10:14 pm #

gen_candidates(Word, Dict) ->
lookup(Word, Dict, [fun edits1/1, fun edits2/1, fun edits3/1], “”).

lookup(_Word, _Dict, _Edits, Found) when length(Found) > 0 -> Found;
lookup(_Word, _Dict, [], “”) -> Word;
lookup(Word, Dict, [Edit|Edits], “”) ->
lookup(Word, Dict, Edits, known(Edits(Word), Dict)).

(untested)


andrew cooke on 15 May, 2007 at 10:20 pm #

or

gen_candidates(Word, Dict) ->
Lookup =
fun(Edit, “”) -> known(Edit(Word), Dict);
(Edit, Found) -> Found
end,
case lists:foldl(Lookup, “”, [fun edits1/1, fun edits2/1, fun edits3/1]) of
“” -> Word;
Found -> Found
end.

there may be a list library function that does something like this?


Federico Feroldi on 15 May, 2007 at 10:29 pm #

Mike, cool thanks! I didn’t knew that with binaries you can index by position!
I’ll try that soon!
cheers


How to Write a Spelling Corrector on 16 May, 2007 at 5:24 am #

[…] Erlang: by Federico Feroldi […]


[…] Erlang: by Federico Feroldi […]


[…] a questa versione in Scheme, che però mi sembra parecchio meno comprensibile. UPDATE2: Anche una versione in erlang ed una in perl6. Daiche ci sono un sacco di altri linguaggi […]


[ANN][Erlang] - POP3-to-RSS gateway - - RSDN on 21 May, 2007 at 3:07 pm #

[…] 08:27 Erlang blogging contest Spell corrector (aka Google suggest) in Erlang. , , […]


怎样写一个拼写检查器 by Peter Norvig on 10 June, 2007 at 9:54 am #

[…] Erlang: by Federico Feroldi […]


pligg.com on 26 June, 2007 at 11:27 am #

Spell corrector (aka Google suggest) in Erlang…

A spell checker in erlang inspired by Peter Norvig’s python version.

This shows how to do the spellchecker in erlang. For understanding how the and why the spellchecker actually works, check the Norvig’s original article….


[…] Federico Feroldi […]


Alex Les on 23 January, 2008 at 11:01 pm #

Best site to buy Tramadol


Diesel on 15 March, 2008 at 11:32 am #

Kir on 15 March, 2008 at 3:49 pm #

Palvoqsc on 16 May, 2008 at 12:53 pm #

!, , 527,
, ajhjat, , =],


Post a Comment
Name:
Email:
Website:
Comments: