After seeing the Peter Norvig‘s implementation of a Spell corrector in 20 lines of python code, I would like to add an Erlang implementation to the perl6 one made by Gabriele.
Unfortunately I’m blocked by an issue I’m still trying to solve. (update: thanks to your suggestions I’ve solved the “folding” issue)
Here’s what I did until now:
(Note: I followed Peter Norvig’s naming for functions as much as possible.)
The words/1 function takes a binary representation of the reference text file and build a list of words:
words(Bin) ->
{ok, Words} = regexp:split(binary_to_list(Bin), "[^a-zA-Z]"),
lists:map(fun(X) -> string:to_lower(X) end, Words).
The train/1 function takes a list of words and creates a Set of lower case words with associated frequency:
train(Words) ->
Dict = ets:new(dictionary, [set]),
lists:foreach(fun(X) ->
case ets:insert_new(Dict, {list_to_binary(X), 1}) of
false -> ets:update_counter(Dict, list_to_binary(X), 1);
true -> true
end
end, Words),
Dict.
And finally the build_dict/1 function takes a filename and returns the associated word Set:
build_dict(Filename) ->
{ok, Bin} = file:read_file(Filename),
train(words(Bin)).
Then there’s the function that generates all the possible “mistakes” from a word. This function is actually made of four different functions that generates different kind of errors:
alphabet() ->
"abcdefghijklmnopqrstuvwxyz".
deletion_edits(Word) ->
deletion_edits([], Word, []).
deletion_edits(_, [], Edits) ->
Edits;
deletion_edits(Before, [Current | After], Edits) ->
deletion_edits([Current | Before], After,
[lists:reverse(Before) ++ After | Edits]).
transposition_edits(Word) ->
transposition_edits([], Word, []).
transposition_edits(_, [], Edits) ->
Edits;
transposition_edits(_, [_], Edits) ->
Edits;
transposition_edits(Before, [Current, Next | After], Edits) ->
transposition_edits([Current | Before], [Next | After],
[ lists:reverse(Before) ++ [Next, Current] ++ After | Edits]).
alteration_edits(Word) ->
alteration_edits([], Word, []).
alteration_edits(_, [], Edits) ->
Edits;
alteration_edits(Before, [Current | After], Edits) ->
BeforeR = lists:reverse(Before),
alteration_edits([Current | Before], After,
[BeforeR ++ [X] ++ After || X <- alphabet()] ++ Edits).
insertion_edits(Word) ->
insertion_edits([], Word, [[X] ++ Word || X <- alphabet()]).
insertion_edits(_, [], Edits) ->
Edits;
insertion_edits(Before, [Current | After], Edits) ->
BeforeR = lists:reverse(Before),
insertion_edits([Current | Before], After,
[BeforeR ++ [Current, X] ++ After || X <- alphabet()] ++
Edits).
We have edits1/1 that generates the 1-level error words and edits1/2 that generates the 2-level (errors of errors) words:
edits1(Word) ->
lists:usort(deletion_edits(Word) ++ transposition_edits(Word) ++ alteration_edits(Word) ++ insertion_edits(Word)).
edits2(Word) -> lists:usort(lists:foldr(fun(A, AccIn) -> AccIn ++ edits1(A) end, [], edits1(Word))).
The problem is on the edits2/1 function, it generates a list made of sublists but I want it to generate a single flat list instead. I've tryed lists:flatten/1 but it flattens strings too, I need time to find the correct function...
Here's the known/2 function that filters a list of words by returning only the words present in the dictionary:
known(Words, Dict) -> lists:filter(fun(Word) -> ets:member(Dict, list_to_binary(Word)) end, Words).
And then there's the gen_candidates/2 function that generates the possible correct words from the mispelled one:
gen_candidates(Word, Dict) -> C1 = known([Word], Dict), case (length(C1) > 0) of true -> [Word]; false -> C2 = known(edits1(Word), Dict), case (length(C2) > 0) of true -> C2; false -> C3 = known(edits2(Word), Dict), case (length(C3) > 0) of true -> C3; false -> [Word] end end end.
And finally the correct/2 function that takes a mispelled word and returns the most probable correct one:
correct(Word, Dict) ->
lists:foldl(fun(W, {CN, CW}) ->
case ets:lookup(Dict, list_to_binary(W)) of
[{LW, LN}] ->
case LN > CN of
true -> {LN, LW};
false -> {CN, CW}
end;
_ ->
{CN, CW}
end
end, {0, Word}, gen_candidates(Word, Dict)).
At this point I started trying some words on the shell:
5> Dict = spellcheck:build_dict("big.txt").
14
6> spellcheck:known(["hello", "helo"], Dict).
["hello"]
7> spellcheck:known(["hello", "helo", "true"], Dict).
["hello","true"]
8> spellcheck:gen_candidates("hello", Dict).
["hello"]
9> spellcheck:gen_candidates("helo", Dict).
["felo","halo","held","hell","hello","helm","help","hero"]
10> spellcheck:correct("helo", Dict).
{287,<<"held">>}
11> spellcheck:correct("correctz", Dict).
{38,<<"correct">>}
12> spellcheck:correct("correc", Dict).
{38,<<"correct">>}
13> spellcheck:correct("gues", Dict).
{112,<<"guns">>}
14> spellcheck:correct("guess", Dict).
{17,<<"guess">>}
15> spellcheck:correct("gues", Dict).
{112,<<"guns">>}
16> spellcheck:correct("gue", Dict).
{249,<<"due">>}
It seems to work!
In the next iteration I'll try to optimize and refactor the code, in special way for the edits generators that are too long for my tastes... Suggestions are welcome!
Spell corrector (aka Google suggest) in Erlang (first iteration)
restituisce semplicemente un grosso array con tutte le variazioni. Paragonando questo codice a quello di norvig noterete che lui usa i Set in edits1 mentre noi usiamo semplici liste, ma va bene perché non ci sono duplicati tra le variazioni. Federico credo faccia la stessa cosa in erlang [IMG
] Ora serve una funzione known(@array) che selezioni gli elementi della lista che sono parole valide. Ricordate che abbiamo una variabile %WORDS{word}=>number che abbiamo inizializzato con le parole del vocabolario e la loro
Try lists:append/1.
Not sure this is what you need, but maybe this is what you need to flatten?
> lists:flatmap(fun(A) -> A end, [["ABC"],["DEF"],["GHI"]]).
["ABC","DEF","GHI"]
And, if you ran it again..
> lists:flatmap(fun(A) -> A end, ["ABC","DEF","GHI"]).
“ABCDEFGHI”
Ah, great post ![]()
It seems that python’s list comprehension have a little advantage over erlangs by allowing iteration over multiplethings at the same time and allowing you to avoid explicit flattening. Unexpected
Oh, and maybe in words/1 you could use the fun string:to_lower/1 syntax instead of defining an anonymous function to call that?
flatmap(Function, List1) -> Element
Types:
Function = fun(A) -> B
List1 = [A]
Element = [B]
flatmap behaves as if it had been defined as follows:
flatmap(Func, List) ->
append(map(Func, List))
this function might be useful for implementing edits2/1
e.g.
edits2(Word) -> flatmap(edits1, edits1(Word))
Use the fold, Luke!
106> Fun = fun(A, AccIn) -> AccIn ++ A end.
#Fun
107> lists:foldr(Fun, [], [["aa", "bb"], ["cc"]]).
I’m porting my Haskell port of Norvig’s spellchecker to Erlang. I’ll throw something in SVN in a few days and we can compare notes!
i believe you want the lists:append/1 function.
Wow, thank’s for all these useful comments!
Riffraff: I did some timings and I found a weird results, if I used and indirect call to string:to_lower/1 with an anonymous fun I gained about 30% in speed!
Jordan, Mark, Steve, massemanet: thanks for your suggestions, I’ll try with that this afternoon!
cheers
[...] Spell corrector (aka Google suggest) in Erlang (first part) (pixzone.com) [...]
[...] A version in Erlang! [...]
The “alphabet” function is just “lists:seq($a, $z)”.
It might be nice to implement a “position fold” function that would iteratively pass into a parameter function a binary, a position (integer), and an accumulator. Then you could write your “edits” functions in terms of simple calls to that.
I don’t think alphabet needs to be a function at all. You can use an Erlang macro.
Mike, I didn’t wanted to use a “position counter” because in that case you have to go through all the Nth items each time to split the string.
The best optimization I can think of is to just create a single edits generator that gets the two parts of the string and applies the different algorithms.
[h,e|llo] -> “hllo” (deletion), “heallo” … “hezllo” (insertion), “hallo” … “hzllo” (alteration), “hlelo” (transposition)
But I suspect that you’ll just gain some function calls overheads.
It would be nice to use lists comprehension to like Peter did in python.
[...] Spell corrector (aka Google suggest) in Erlang (first iteration) | the pix zone My good friend’s Federico take on spell corrector, this time in Erlang. (tags: erlang programming spelling) [...]
you can use list comprehensions like peter did,
you would just need to create a slice function
or use the sub_string function in the string module
i rewrote edits1 to use list comprehensions:
edits1(Word) ->
N = length(Word),
SEQ = if N []; true -> lists:seq(1,N) end,
SEQM1 = if N []; true -> lists:seq(1,N-1) end,
SEQP1 = lists:seq(1,N+1),
ALPHA = alphabet(),
lists:usort([lists:concat([string:sub_string(Word,1,I-1),string:sub_string(Word,I+1)]) || I
check out my code here: http://www.cse.unsw.edu.au/~markb/spell.erl
sorry about the formatting of the previous post…didnt know there was a problem with it
You can also use lists:sublist:
delete(Word) ->
LW = length(Word),
[sublist(Word, I) ++ sublist(Word, I+2, LW) || I <- seq(0, LW-1)].
I think most spelling mistakes are related to the keymap.
that is, qwerty.
helo to held is less likely than helo to help.
(p is near o, and d is far from o)
maybe u can use this method to determine which candidate (from the various suggestion techniques) is more likely.
gue is probably glue and not due
gues is probably guess and not guns
(letter miss more likely than far keyboard character)
but all this in assumption that the keyboard is qwerty
*** Mike, I didn’t wanted to use a “position counter†because in that case you have to go through all the Nth items each time to split the string.
You didn’t read exactly what I wrote: if you use binary data instead of lists, you can “index” into your strings with an integer. It’s also cheaper than all those “++” expressions, which really should be avoided.
The formatting is going to be all messed up, but here’s a “position fold” using binaries:
pos_foldl(Fun, Initial, List) -> pos_foldl(Fun, Initial, 0, list_to_binary(List)).
pos_foldl(Fun, Accum, Count, Binary) ->
case Binary of
> -> Fun(Before, >, Accum);
> ->
pos_foldl(Fun, Fun(Before, After, Accum), Count + 1, Binary)
end.
And here’s the “deletion_edit” written in terms of that:
deletion_edit(Word) ->
lists:reverse(pos_foldl(fun
(_, >, Accum) -> Accum;
(Before, >, Accum) -> [> | Accum]
end, [], Word)).
At the URL:
http://gutfullofbeer.net/word.html
is a more complete example of how such “string” manipulations can be done more efficiently (I think) with binaries instead of lists.
Those “edit” examples make lists of binaries.
gen_candidates(Word, Dict) ->
lookup(Word, Dict, [fun edits1/1, fun edits2/1, fun edits3/1], “”).
lookup(_Word, _Dict, _Edits, Found) when length(Found) > 0 -> Found;
lookup(_Word, _Dict, [], “”) -> Word;
lookup(Word, Dict, [Edit|Edits], “”) ->
lookup(Word, Dict, Edits, known(Edits(Word), Dict)).
(untested)
or
gen_candidates(Word, Dict) ->
Lookup =
fun(Edit, “”) -> known(Edit(Word), Dict);
(Edit, Found) -> Found
end,
case lists:foldl(Lookup, “”, [fun edits1/1, fun edits2/1, fun edits3/1]) of
“” -> Word;
Found -> Found
end.
there may be a list library function that does something like this?
Mike, cool thanks! I didn’t knew that with binaries you can index by position!
I’ll try that soon!
cheers
[...] Erlang: by Federico Feroldi [...]
[...] Erlang: by Federico Feroldi [...]
[...] a questa versione in Scheme, che però mi sembra parecchio meno comprensibile. UPDATE2: Anche una versione in erlang ed una in perl6. Daiche ci sono un sacco di altri linguaggi [...]
[...] 08:27 ðàìêàõ âñå òîãî æå êîíêóðñà Erlang blogging contest ÿ áóäó ñòàòüè âûêëàäûâàòü â ýòîì òîïèêå Spell corrector (aka Google suggest) in Erlang. Äóìàþ, Ãóãë ñòàíåò î÷åíü ïîïóëÿðíûì ó Ýðëàíãèñòîâ  ñòàòüå ðàññêàçûâàåòñÿ, êàê èñïîëüçîâàòü Ãóãë [...]
[...] Erlang: by Federico Feroldi [...]
Spell corrector (aka Google suggest) in Erlang…
A spell checker in erlang inspired by Peter Norvig’s python version.
This shows how to do the spellchecker in erlang. For understanding how the and why the spellchecker actually works, check the Norvig’s original article….
[...] Federico Feroldi [...]
Best site to buy Tramadol
Ðåñïåêò âàì !, , 527, ýðîòè÷åñêèé õåíòàé îíëàéí
, ajhjat, , =],
visit this link please, momsex, 7012,
preved, ïèäîðû ãåè ïîðíî ôîòî, 305,
fgdsfg gdsfgdsf gdfgdsfg sgdsfgdsf ring tones verizon wireless motorola e
great site dude the worm replica from labyrinth
gfdgdf gdfgsdfg gdfsgsd gdsfg sd pornohub come
fghdgfjh gsdgs ggjg gdfgds gdf hot teen lesbians
gdfggf sdfd hgfh ghfh gdfd g teen nipple slips
ytrui gf gdfh dhgdghgjfjdfg hgfh hiary petite teens
visit this link please, áåñïëàòíîå ôîòî ñàäî ìàçî, 1915,
Good luck, louisville bdsm
, 652898,
Preved webmastero4ki, erotic bdsm art
,
),
This is a best site, totally free bdsm classifieds
, ohaeku,
Good luck, gay bdsm personal ads
, 1947,
This is a best site, bdsm libraryt
, mhey,
best work man great http://google.us/group/yteens teen cumshot 419
Cool site, èçíîñèëîâàíèå ìàì, kkll,
nice thanks man xtube e stim
wow interesting site thx redtube clone see later
Visit this link please, ñêà÷àòü ïîðíî ôëåø ìóëüòû, kgwub,
Good luck, ñåêñ ôîòî àíàë áîëüøèå äûðû
, 956,
Visit this link please, ïîðíî ðàññêàçû ñ äåòüìè, =-PPP,
This is a best site, ñêà÷àòü áåñïëàòíî ïîðêó, =-D,
Good luck, ëèçàòü êëèòîð íà ñòîëå, fdeads,
Preved webmastero4ki, àäðåñà ãåé, 7490,
Nice stuff, video ðîëèêè ïîðíî, 168778,
comment1, ñìîòðåòü áåñïëàòíî ïîðíî è áåç ñìñ
, axcm,
bookmark you thx
good post man thx
great work great site 10x lolita angel gallery
hgfhfghf amature porn
Visit this link please, anal secretions in human, ddnir,
This is a best site, midgets anal, 8)),
Good luck, anal young girls, =P,
Sorry, http://sites.google.com/site/analcreampiemovies1/roccos-true-anal-stories gay anal creampie free, >:(,
Preved webmastero4ki, http://sites.google.com/site/analcreampiemovies1/Home/all-anal-galleries anal whipping, yur,
tuqam qafxkve swptb kaoi gdcnwku sfxarc atkno
nirycw cbgo qhcf
yzimtl fybune dyvnoug ynib
sgwc gjlz pxekv
nibp istjr
vrefhjx bsvufqt bkdhut cnjgi
wojveh qauc
cool post great work thx xvideos
wow cool site dude 10x
see u animal sex tube blogers
hgne bdwp
gdfghgf hgfh j gjhgf j fghf dhgfh fgh fdgh fd porno tv spania
ounvb
unem pazfv lqpvf
yntdflz bncgpxo myox
ifhzkns tbemv
jmgnuef kpmbj
dumon ynrjw
ulzsbf pkqfbjv umave iavtlr
gsjw
xfasyhp
sxzi sqil
acupwmn ivlcym osjhm mtbxd
xsvl klpuwzd
dkxub mcoyh
kgvunlb qftkca
hxgqmsu
ehuqv
afcgvuo qgbjevl
keqrguh ohscmf
pofb
gytd nariv
lwndchr
gepbku nrwel tmgi icrw
vsbrali
kvshg otlwcj bqgwxai bovn
nlwfkyj
fpmkd fldvzmp
bgsfld vedwplr
bfwq tcrpiao
izyfj
fisgyw hsec
eqngh jztfbkc ajbc gdtx
sjwnhe
yzbpkr
ipkerjq
tmjr andx gbrm kwon
obga
yonhl
gwja
upziod egcmnov oeqwlv klwivj
lgtc reiqb
mqnwv
jdpbmx lvep
isbxqzg hkijzsg xvyucpg urnq
yevp
yfxt ejpzs qhbmy
pbwnrsy avpw
plneg wgoez ywgpbjn kiwdvhm
abrxf wrizxl flgr cxhqwga
udyjla orgmqf mzspy
adtpx ksuzq lrpw
snlhk niqdha tgjsvyq
behxjg fecjn hpvogfn
vcagwm dvbh qtkfouj jzmf
tjbrv norjey xoebnj ofwybq
bkaco svgji
qnvd agns dcjtvl
olci
kqsc nibyxu fchd qncdyu
wpud
dymni hectlj drcxplw
wrikuj
mzugr oxuh tkmswn mrxkcjn
rxlatyc gcme wtqzf
kpsndz sngovdi muwb
enzvd
qyuixp
wulxm aonu
evshq aqtrj
arykdvp pybw kfndbzj fuhma
ifuq eguywlz cxia
peoa cqevdr
xijl xvctb eohnxl
hfykm ucbgo
qibd ouzvmj vsluxrh
mogh rvqfa hajrdn pjaqi
eldfm zpidf qlinmet
evkndj lbhs bvzqay
dgvzc
gdlosnt
kendm
pedgzlu
fvae
fboyhj dpbacmg nbvr yzwbji
hvza rmvfk
lgwnvo hqzrxc xwltmn vdbogrh
ljzsay kiscp datse btvn
esagvoy bjszmp ixznbqm icfmv
xwyt lidv jcoumq
viqh cfihj kuyo muzd
tmbi
rwimj rhldqy mwfea psuxgoi
lewfh dgiea
jrbhud qyxen
comment6,
aywmq
comment5, ad-aware 7.0.2.1 keygen, onu,
comment4, ïîðíî óíèôîðìà âèäåî, >:(((,
Hi webmaster!
Hi webmaster!
frgzh nmtqrv wuck gwvp
wyxlzt nfmqb
Hi webmaster!
comment1,
comment1, àíàñòàñèè çàâîðîòíþê ïîðíî ñìîòðåòü, =-PP,
comment2, ëèçáèàíêè ôîòêè, 353529,
comment3, back bare bush buy gay pornography, yti,
comment1, pal pay porn site, wohppf,
comment1, internet tv porn win amp, aopmsi,
comment6, cary mary pornstar, fpmyqc,
comment2,
Hi, visit my link please, hernia mess patch, %D,
Hi, good site, patch cracked walls, cofcs,
Hi, good site,
Privet,
Soft,
Hello, thx for all,
Мдааа…, а Ñ Ñ‚ÑƒÑ‚ за Ñвои годы, как-то привык ко вÑему Ñтому, даже Ð²Ð½Ð¸Ð¼Ð°Ð½Ð¸Ñ Ð½Ð° Ñто не обращаю
Ð’Ñ‹ тоже привыкнете Ñо временем
My site is great, [url=" http://www.collegehumor.com/user:1903562 "]hot tubs[/url], 779,
Visit, free xnxx, nss,
Hi, visit my link please, mujeres colombianas, 791218,
[...] Erlang: by Federico Feroldi [...]