After seeing the Peter Norvig‘s implementation of a Spell corrector in 20 lines of python code, I would like to add an Erlang implementation to the perl6 one made by Gabriele.
Unfortunately I’m blocked by an issue I’m still trying to solve. (update: thanks to your suggestions I’ve solved the “folding” issue)
Here’s what I did until now:
(Note: I followed Peter Norvig’s naming for functions as much as possible.)
The words/1 function takes a binary representation of the reference text file and build a list of words:
words(Bin) ->
{ok, Words} = regexp:split(binary_to_list(Bin), "[^a-zA-Z]"),
lists:map(fun(X) -> string:to_lower(X) end, Words).
The train/1 function takes a list of words and creates a Set of lower case words with associated frequency:
train(Words) ->
Dict = ets:new(dictionary, [set]),
lists:foreach(fun(X) ->
case ets:insert_new(Dict, {list_to_binary(X), 1}) of
false -> ets:update_counter(Dict, list_to_binary(X), 1);
true -> true
end
end, Words),
Dict.
And finally the build_dict/1 function takes a filename and returns the associated word Set:
build_dict(Filename) ->
{ok, Bin} = file:read_file(Filename),
train(words(Bin)).
Then there’s the function that generates all the possible “mistakes” from a word. This function is actually made of four different functions that generates different kind of errors:
alphabet() ->
"abcdefghijklmnopqrstuvwxyz".
deletion_edits(Word) ->
deletion_edits([], Word, []).
deletion_edits(_, [], Edits) ->
Edits;
deletion_edits(Before, [Current | After], Edits) ->
deletion_edits([Current | Before], After,
[lists:reverse(Before) ++ After | Edits]).
transposition_edits(Word) ->
transposition_edits([], Word, []).
transposition_edits(_, [], Edits) ->
Edits;
transposition_edits(_, [_], Edits) ->
Edits;
transposition_edits(Before, [Current, Next | After], Edits) ->
transposition_edits([Current | Before], [Next | After],
[ lists:reverse(Before) ++ [Next, Current] ++ After | Edits]).
alteration_edits(Word) ->
alteration_edits([], Word, []).
alteration_edits(_, [], Edits) ->
Edits;
alteration_edits(Before, [Current | After], Edits) ->
BeforeR = lists:reverse(Before),
alteration_edits([Current | Before], After,
[BeforeR ++ [X] ++ After || X <- alphabet()] ++ Edits).
insertion_edits(Word) ->
insertion_edits([], Word, [[X] ++ Word || X <- alphabet()]).
insertion_edits(_, [], Edits) ->
Edits;
insertion_edits(Before, [Current | After], Edits) ->
BeforeR = lists:reverse(Before),
insertion_edits([Current | Before], After,
[BeforeR ++ [Current, X] ++ After || X <- alphabet()] ++
Edits).
We have edits1/1 that generates the 1-level error words and edits1/2 that generates the 2-level (errors of errors) words:
edits1(Word) ->
lists:usort(deletion_edits(Word) ++ transposition_edits(Word) ++ alteration_edits(Word) ++ insertion_edits(Word)).
edits2(Word) -> lists:usort(lists:foldr(fun(A, AccIn) -> AccIn ++ edits1(A) end, [], edits1(Word))).
The problem is on the edits2/1 function, it generates a list made of sublists but I want it to generate a single flat list instead. I've tryed lists:flatten/1 but it flattens strings too, I need time to find the correct function...
Here's the known/2 function that filters a list of words by returning only the words present in the dictionary:
known(Words, Dict) -> lists:filter(fun(Word) -> ets:member(Dict, list_to_binary(Word)) end, Words).
And then there's the gen_candidates/2 function that generates the possible correct words from the mispelled one:
gen_candidates(Word, Dict) -> C1 = known([Word], Dict), case (length(C1) > 0) of true -> [Word]; false -> C2 = known(edits1(Word), Dict), case (length(C2) > 0) of true -> C2; false -> C3 = known(edits2(Word), Dict), case (length(C3) > 0) of true -> C3; false -> [Word] end end end.
And finally the correct/2 function that takes a mispelled word and returns the most probable correct one:
correct(Word, Dict) ->
lists:foldl(fun(W, {CN, CW}) ->
case ets:lookup(Dict, list_to_binary(W)) of
[{LW, LN}] ->
case LN > CN of
true -> {LN, LW};
false -> {CN, CW}
end;
_ ->
{CN, CW}
end
end, {0, Word}, gen_candidates(Word, Dict)).
At this point I started trying some words on the shell:
5> Dict = spellcheck:build_dict("big.txt").
14
6> spellcheck:known(["hello", "helo"], Dict).
["hello"]
7> spellcheck:known(["hello", "helo", "true"], Dict).
["hello","true"]
8> spellcheck:gen_candidates("hello", Dict).
["hello"]
9> spellcheck:gen_candidates("helo", Dict).
["felo","halo","held","hell","hello","helm","help","hero"]
10> spellcheck:correct("helo", Dict).
{287,<<"held">>}
11> spellcheck:correct("correctz", Dict).
{38,<<"correct">>}
12> spellcheck:correct("correc", Dict).
{38,<<"correct">>}
13> spellcheck:correct("gues", Dict).
{112,<<"guns">>}
14> spellcheck:correct("guess", Dict).
{17,<<"guess">>}
15> spellcheck:correct("gues", Dict).
{112,<<"guns">>}
16> spellcheck:correct("gue", Dict).
{249,<<"due">>}
It seems to work!
In the next iteration I'll try to optimize and refactor the code, in special way for the edits generators that are too long for my tastes... Suggestions are welcome!