After seeing the Peter Norvig’s implementation of a Spell corrector in 20 lines of python code, I would like to add an Erlang implementation to the perl6 one made by Gabriele.
Unfortunately I’m blocked by an issue I’m still trying to solve. (update: thanks to your suggestions I’ve solved the “folding” issue)
Here’s what I did until now:
(Note: I followed Peter Norvig’s naming for functions as much as possible.)
The words/1 function takes a binary representation of the reference text file and build a list of words:
words(Bin) ->
{ok, Words} = regexp:split(binary_to_list(Bin), "[^a-zA-Z]"),
lists:map(fun(X) -> string:to_lower(X) end, Words).
The train/1 function takes a list of words and creates a Set of lower case words with associated frequency:
train(Words) ->
Dict = ets:new(dictionary, [set]),
lists:foreach(fun(X) ->
case ets:insert_new(Dict, {list_to_binary(X), 1}) of
false -> ets:update_counter(Dict, list_to_binary(X), 1);
true -> true
end
end, Words),
Dict.
And finally the build_dict/1 function takes a filename and returns the associated word Set:
build_dict(Filename) ->
{ok, Bin} = file:read_file(Filename),
train(words(Bin)).
Then there’s the function that generates all the possible “mistakes” from a word. This function is actually made of four different functions that generates different kind of errors:
alphabet() ->
"abcdefghijklmnopqrstuvwxyz".
deletion_edits(Word) ->
deletion_edits([], Word, []).
deletion_edits(_, [], Edits) ->
Edits;
deletion_edits(Before, [Current | After], Edits) ->
deletion_edits([Current | Before], After,
[lists:reverse(Before) ++ After | Edits]).
transposition_edits(Word) ->
transposition_edits([], Word, []).
transposition_edits(_, [], Edits) ->
Edits;
transposition_edits(_, [_], Edits) ->
Edits;
transposition_edits(Before, [Current, Next | After], Edits) ->
transposition_edits([Current | Before], [Next | After],
[ lists:reverse(Before) ++ [Next, Current] ++ After | Edits]).
alteration_edits(Word) ->
alteration_edits([], Word, []).
alteration_edits(_, [], Edits) ->
Edits;
alteration_edits(Before, [Current | After], Edits) ->
BeforeR = lists:reverse(Before),
alteration_edits([Current | Before], After,
[BeforeR ++ [X] ++ After || X <- alphabet()] ++ Edits).
insertion_edits(Word) ->
insertion_edits([], Word, [[X] ++ Word || X <- alphabet()]).
insertion_edits(_, [], Edits) ->
Edits;
insertion_edits(Before, [Current | After], Edits) ->
BeforeR = lists:reverse(Before),
insertion_edits([Current | Before], After,
[BeforeR ++ [Current, X] ++ After || X <- alphabet()] ++
Edits).
We have edits1/1 that generates the 1-level error words and edits1/2 that generates the 2-level (errors of errors) words:
edits1(Word) ->
lists:usort(deletion_edits(Word) ++ transposition_edits(Word) ++ alteration_edits(Word) ++ insertion_edits(Word)).
edits2(Word) -> lists:usort(lists:foldr(fun(A, AccIn) -> AccIn ++ edits1(A) end, [], edits1(Word))).
The problem is on the edits2/1 function, it generates a list made of sublists but I want it to generate a single flat list instead. I’ve tryed lists:flatten/1 but it flattens strings too, I need time to find the correct function…
Here’s the known/2 function that filters a list of words by returning only the words present in the dictionary:
known(Words, Dict) -> lists:filter(fun(Word) -> ets:member(Dict, list_to_binary(Word)) end, Words).
And then there’s the gen_candidates/2 function that generates the possible correct words from the mispelled one:
gen_candidates(Word, Dict) -> C1 = known([Word], Dict), case (length(C1) > 0) of true -> [Word]; false -> C2 = known(edits1(Word), Dict), case (length(C2) > 0) of true -> C2; false -> C3 = known(edits2(Word), Dict), case (length(C3) > 0) of true -> C3; false -> [Word] end end end.
And finally the correct/2 function that takes a mispelled word and returns the most probable correct one:
correct(Word, Dict) ->
lists:foldl(fun(W, {CN, CW}) ->
case ets:lookup(Dict, list_to_binary(W)) of
[{LW, LN}] ->
case LN > CN of
true -> {LN, LW};
false -> {CN, CW}
end;
_ ->
{CN, CW}
end
end, {0, Word}, gen_candidates(Word, Dict)).
At this point I started trying some words on the shell:
5> Dict = spellcheck:build_dict("big.txt").
14
6> spellcheck:known(["hello", "helo"], Dict).
["hello"]
7> spellcheck:known(["hello", "helo", "true"], Dict).
["hello","true"]
8> spellcheck:gen_candidates("hello", Dict).
["hello"]
9> spellcheck:gen_candidates("helo", Dict).
["felo","halo","held","hell","hello","helm","help","hero"]
10> spellcheck:correct("helo", Dict).
{287,<<"held">>}
11> spellcheck:correct(”correctz”, Dict).
{38,<<"correct">>}
12> spellcheck:correct(”correc”, Dict).
{38,<<"correct">>}
13> spellcheck:correct(”gues”, Dict).
{112,<<"guns">>}
14> spellcheck:correct(”guess”, Dict).
{17,<<"guess">>}
15> spellcheck:correct(”gues”, Dict).
{112,<<"guns">>}
16> spellcheck:correct(”gue”, Dict).
{249,<<"due">>}
It seems to work!
In the next iteration I’ll try to optimize and refactor the code, in special way for the edits generators that are too long for my tastes… Suggestions are welcome! ![]()
Try lists:append/1.
Not sure this is what you need, but maybe this is what you need to flatten?
> lists:flatmap(fun(A) -> A end, [[”ABC”],[”DEF”],[”GHI”]]).
[”ABC”,”DEF”,”GHI”]
And, if you ran it again..
> lists:flatmap(fun(A) -> A end, [”ABC”,”DEF”,”GHI”]).
“ABCDEFGHI”
Ah, great post ![]()
It seems that python’s list comprehension have a little advantage over erlangs by allowing iteration over multiplethings at the same time and allowing you to avoid explicit flattening. Unexpected
Oh, and maybe in words/1 you could use the fun string:to_lower/1 syntax instead of defining an anonymous function to call that?
flatmap(Function, List1) -> Element
Types:
Function = fun(A) -> B
List1 = [A]
Element = [B]
flatmap behaves as if it had been defined as follows:
flatmap(Func, List) ->
append(map(Func, List))
this function might be useful for implementing edits2/1
e.g.
edits2(Word) -> flatmap(edits1, edits1(Word))
Use the fold, Luke!
106> Fun = fun(A, AccIn) -> AccIn ++ A end.
#Fun
107> lists:foldr(Fun, [], [[”aa”, “bb”], [”cc”]]).
I’m porting my Haskell port of Norvig’s spellchecker to Erlang. I’ll throw something in SVN in a few days and we can compare notes!
i believe you want the lists:append/1 function.
Wow, thank’s for all these useful comments!
Riffraff: I did some timings and I found a weird results, if I used and indirect call to string:to_lower/1 with an anonymous fun I gained about 30% in speed!
Jordan, Mark, Steve, massemanet: thanks for your suggestions, I’ll try with that this afternoon!
cheers
[…] Spell corrector (aka Google suggest) in Erlang (first part) (pixzone.com) […]
[…] A version in Erlang! […]
Spell corrector (aka Google suggest) in Erlang (first iteration)
The “alphabet” function is just “lists:seq($a, $z)”.
It might be nice to implement a “position fold” function that would iteratively pass into a parameter function a binary, a position (integer), and an accumulator. Then you could write your “edits” functions in terms of simple calls to that.
I don’t think alphabet needs to be a function at all. You can use an Erlang macro.
Mike, I didn’t wanted to use a “position counter” because in that case you have to go through all the Nth items each time to split the string.
The best optimization I can think of is to just create a single edits generator that gets the two parts of the string and applies the different algorithms.
[h,e|llo] -> “hllo” (deletion), “heallo” … “hezllo” (insertion), “hallo” … “hzllo” (alteration), “hlelo” (transposition)
But I suspect that you’ll just gain some function calls overheads.
It would be nice to use lists comprehension to like Peter did in python.
[…] Spell corrector (aka Google suggest) in Erlang (first iteration) | the pix zone My good friend’s Federico take on spell corrector, this time in Erlang. (tags: erlang programming spelling) […]
you can use list comprehensions like peter did,
you would just need to create a slice function
or use the sub_string function in the string module
i rewrote edits1 to use list comprehensions:
edits1(Word) ->
N = length(Word),
SEQ = if N []; true -> lists:seq(1,N) end,
SEQM1 = if N []; true -> lists:seq(1,N-1) end,
SEQP1 = lists:seq(1,N+1),
ALPHA = alphabet(),
lists:usort([lists:concat([string:sub_string(Word,1,I-1),string:sub_string(Word,I+1)]) || I
check out my code here: http://www.cse.unsw.edu.au/~markb/spell.erl
sorry about the formatting of the previous post…didnt know there was a problem with it
You can also use lists:sublist:
delete(Word) ->
LW = length(Word),
[sublist(Word, I) ++ sublist(Word, I+2, LW) || I <- seq(0, LW-1)].
restituisce semplicemente un grosso array con tutte le variazioni. Paragonando questo codice a quello di norvig noterete che lui usa i Set in edits1 mentre noi usiamo semplici liste, ma va bene perché non ci sono duplicati tra le variazioni. Federico credo faccia la stessa cosa in erlang [IMG :)] Ora serve una funzione known(@array) che selezioni gli elementi della lista che sono parole valide. Ricordate che abbiamo una variabile %WORDS{word}=>number che abbiamo inizializzato con le parole del vocabolario e la loro
I think most spelling mistakes are related to the keymap.
that is, qwerty.
helo to held is less likely than helo to help.
(p is near o, and d is far from o)
maybe u can use this method to determine which candidate (from the various suggestion techniques) is more likely.
gue is probably glue and not due
gues is probably guess and not guns
(letter miss more likely than far keyboard character)
but all this in assumption that the keyboard is qwerty
*** Mike, I didn’t wanted to use a “position counter” because in that case you have to go through all the Nth items each time to split the string.
You didn’t read exactly what I wrote: if you use binary data instead of lists, you can “index” into your strings with an integer. It’s also cheaper than all those “++” expressions, which really should be avoided.
The formatting is going to be all messed up, but here’s a “position fold” using binaries:
pos_foldl(Fun, Initial, List) -> pos_foldl(Fun, Initial, 0, list_to_binary(List)).
pos_foldl(Fun, Accum, Count, Binary) ->
case Binary of
> -> Fun(Before, >, Accum);
> ->
pos_foldl(Fun, Fun(Before, After, Accum), Count + 1, Binary)
end.
And here’s the “deletion_edit” written in terms of that:
deletion_edit(Word) ->
lists:reverse(pos_foldl(fun
(_, >, Accum) -> Accum;
(Before, >, Accum) -> [> | Accum]
end, [], Word)).
At the URL:
http://gutfullofbeer.net/word.html
is a more complete example of how such “string” manipulations can be done more efficiently (I think) with binaries instead of lists.
Those “edit” examples make lists of binaries.
gen_candidates(Word, Dict) ->
lookup(Word, Dict, [fun edits1/1, fun edits2/1, fun edits3/1], “”).
lookup(_Word, _Dict, _Edits, Found) when length(Found) > 0 -> Found;
lookup(_Word, _Dict, [], “”) -> Word;
lookup(Word, Dict, [Edit|Edits], “”) ->
lookup(Word, Dict, Edits, known(Edits(Word), Dict)).
(untested)
or
gen_candidates(Word, Dict) ->
Lookup =
fun(Edit, “”) -> known(Edit(Word), Dict);
(Edit, Found) -> Found
end,
case lists:foldl(Lookup, “”, [fun edits1/1, fun edits2/1, fun edits3/1]) of
“” -> Word;
Found -> Found
end.
there may be a list library function that does something like this?
Mike, cool thanks! I didn’t knew that with binaries you can index by position!
I’ll try that soon!
cheers
[…] Erlang: by Federico Feroldi […]
[…] Erlang: by Federico Feroldi […]
[…] a questa versione in Scheme, che però mi sembra parecchio meno comprensibile. UPDATE2: Anche una versione in erlang ed una in perl6. Daiche ci sono un sacco di altri linguaggi […]
[…] 08:27 Erlang blogging contest Spell corrector (aka Google suggest) in Erlang. , , […]
[…] Erlang: by Federico Feroldi […]
Spell corrector (aka Google suggest) in Erlang…
A spell checker in erlang inspired by Peter Norvig’s python version.
This shows how to do the spellchecker in erlang. For understanding how the and why the spellchecker actually works, check the Norvig’s original article….
[…] Federico Feroldi […]
Best site to buy Tramadol