May
14
Posted on 14-05-2007
Filed Under (Erlang, Programming) by Federico Feroldi on 14-05-2007

After seeing the Peter Norvig‘s implementation of a Spell corrector in 20 lines of python code, I would like to add an Erlang implementation to the perl6 one made by Gabriele.

Unfortunately I’m blocked by an issue I’m still trying to solve. (update: thanks to your suggestions I’ve solved the “folding” issue)

Here’s what I did until now:

(Note: I followed Peter Norvig’s naming for functions as much as possible.)

The words/1 function takes a binary representation of the reference text file and build a list of words:

words(Bin) ->
    {ok, Words} = regexp:split(binary_to_list(Bin), "[^a-zA-Z]"),
    lists:map(fun(X) -> string:to_lower(X) end, Words).

The train/1 function takes a list of words and creates a Set of lower case words with associated frequency:

train(Words) ->
    Dict = ets:new(dictionary, [set]),
    lists:foreach(fun(X) ->
        case ets:insert_new(Dict, {list_to_binary(X), 1}) of
            false -> ets:update_counter(Dict, list_to_binary(X), 1);
            true -> true
        end
    end, Words),
    Dict.

And finally the build_dict/1 function takes a filename and returns the associated word Set:

build_dict(Filename) ->
    {ok, Bin} = file:read_file(Filename),
    train(words(Bin)).

Then there’s the function that generates all the possible “mistakes” from a word. This function is actually made of four different functions that generates different kind of errors:

alphabet() ->
    "abcdefghijklmnopqrstuvwxyz".

deletion_edits(Word) ->
    deletion_edits([], Word, []).
deletion_edits(_, [], Edits) ->
    Edits;
deletion_edits(Before, [Current | After], Edits) ->
	deletion_edits([Current | Before], After,
		[lists:reverse(Before) ++ After | Edits]).

transposition_edits(Word) ->
    transposition_edits([], Word, []).
transposition_edits(_, [], Edits) ->
    Edits;
transposition_edits(_, [_], Edits) ->
    Edits;
transposition_edits(Before, [Current, Next | After], Edits) ->
	transposition_edits([Current | Before], [Next | After],
		[ lists:reverse(Before) ++ [Next, Current] ++ After | Edits]).

alteration_edits(Word) ->
    alteration_edits([], Word, []).
alteration_edits(_, [], Edits) ->
    Edits;
alteration_edits(Before, [Current | After], Edits) ->
	BeforeR = lists:reverse(Before),
	alteration_edits([Current | Before], After,
		[BeforeR ++ [X] ++ After || X <- alphabet()] ++ Edits).

insertion_edits(Word) ->
    insertion_edits([], Word, [[X] ++ Word || X <- alphabet()]).
insertion_edits(_, [], Edits) ->
    Edits;
insertion_edits(Before, [Current | After], Edits) ->
	BeforeR = lists:reverse(Before),
	insertion_edits([Current | Before], After,
		[BeforeR ++ [Current, X] ++ After || X <- alphabet()] ++
		Edits).

We have edits1/1 that generates the 1-level error words and edits1/2 that generates the 2-level (errors of errors) words:

edits1(Word) ->
    lists:usort(deletion_edits(Word) ++ transposition_edits(Word) ++ alteration_edits(Word) ++ insertion_edits(Word)).
edits2(Word) ->
	lists:usort(lists:foldr(fun(A, AccIn) ->
		AccIn ++ edits1(A)
	end, [], edits1(Word))).

The problem is on the edits2/1 function, it generates a list made of sublists but I want it to generate a single flat list instead. I've tryed lists:flatten/1 but it flattens strings too, I need time to find the correct function...

Here's the known/2 function that filters a list of words by returning only the words present in the dictionary:

known(Words, Dict) ->
	lists:filter(fun(Word) ->
		ets:member(Dict, list_to_binary(Word))
	end, Words).

And then there's the gen_candidates/2 function that generates the possible correct words from the mispelled one:

gen_candidates(Word, Dict) ->
	C1 = known([Word], Dict),
	case (length(C1) > 0) of
		true ->
			[Word];
		false ->
			C2 = known(edits1(Word), Dict),
			case (length(C2) > 0) of
				true ->
					C2;
				false ->
					C3 = known(edits2(Word), Dict),
					case (length(C3) > 0) of
						true -> C3;
						false -> [Word]
					end
			end
	end.

And finally the correct/2 function that takes a mispelled word and returns the most probable correct one:

correct(Word, Dict) ->
	lists:foldl(fun(W, {CN, CW}) ->
		case ets:lookup(Dict, list_to_binary(W)) of
			[{LW, LN}] ->
				case LN > CN of
					true -> {LN, LW};
					false -> {CN, CW}
				end;
			_ ->
				{CN, CW}
		end
	end, {0, Word}, gen_candidates(Word, Dict)).

At this point I started trying some words on the shell:

5> Dict = spellcheck:build_dict("big.txt").
14
6> spellcheck:known(["hello", "helo"], Dict).
["hello"]
7> spellcheck:known(["hello", "helo", "true"], Dict).
["hello","true"]
8> spellcheck:gen_candidates("hello", Dict).
["hello"]
9> spellcheck:gen_candidates("helo", Dict).
["felo","halo","held","hell","hello","helm","help","hero"]
10> spellcheck:correct("helo", Dict).
{287,<<"held">>}
11> spellcheck:correct("correctz", Dict).
{38,<<"correct">>}
12> spellcheck:correct("correc", Dict).
{38,<<"correct">>}
13> spellcheck:correct("gues", Dict).
{112,<<"guns">>}
14> spellcheck:correct("guess", Dict).
{17,<<"guess">>}
15> spellcheck:correct("gues", Dict).
{112,<<"guns">>}
16> spellcheck:correct("gue", Dict).
{249,<<"due">>}

It seems to work! :)

In the next iteration I'll try to optimize and refactor the code, in special way for the edits generators that are too long for my tastes... Suggestions are welcome! :)

Share and Enjoy:
  • Digg
  • del.icio.us
  • DZone
  • Reddit
  • Technorati
  • YahooMyWeb
(0) Comments    Read More   
Post a Comment
Name:
Email:
Website:
Comments: