Oleg (obartunov) wrote,
Oleg
obartunov

Nepali language for PostgreSQL

Many years ago I and Teodor added devanagari script support (slides 33-35) to PostgreSQL . We planned to add snowball stemmer for nepali language, but that was never done.



Recently, I revived the project and contacted Ingroj Shrestha from Nepali NLP group, who kindly agreed to work on snowball stemmer for nepali language and rather quickly produced the first version of stemmer (Ingroj Shrestha and Shreeya Singh Dhakal), which I used to add nepali support to full text search in master branch of PostgreSQL (default configuration, stop words), see nepali.patch. See, how it works:

select ts_lexize('nepali_stem', 'अँगअँकाउछन्');
 ts_lexize
-----------
 {अँगअँकाउ}
(1 row)

select to_tsvector('nepali','PostgreSQL संसारको सबैभन्दा उन्नत खुला स्रोत डाटास हो');
                                 to_tsvector
-----------------------------------------------------------------------------
 'postgresql':1 'उन्नत':4 'खुला':5 'डाटास':7 'संसार':2 'सबैभन्':3 'स्रोत':6 'हो':8
(1 row)


I intend to submit this patch for PG 11.

Update: Arthur Zakirov added Hunspell dictionary for nepali language.
Update: Nepali FTS configuration was committed to PG 12



PS. I had to create ne_NP locale, using ne_NP definition on Ubuntu Linux:
localedef -i ne_NP -c -f ./UTF-8 /usr/lib/locale/ne_NP

Then, I use this locale to initialize cluster
initdb -D ........ --locale=ne_NP

Tags: fts, nepal, pgen
Subscribe

  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your IP address will be recorded 

  • 0 comments