日产无码久久久久久精品,国产满18AV精品免费观看视频

主頁 > 知識(shí)庫 > 詳細(xì)講解PostgreSQL中的全文搜索的用法

詳細(xì)講解PostgreSQL中的全文搜索的用法

開發(fā)Web應(yīng)用時(shí)，你經(jīng)常要加上搜索功能。甚至還不知能要搜什么，就在草圖上畫了一個(gè)放大鏡。

搜索是項(xiàng)非常重要的功能，所以像elasticsearch和SOLR這樣的基于lucene的工具變得很流行。它們都很棒。但使用這些大規(guī)模“殺傷性”的搜索武器前，你可能需要來點(diǎn)輕量級(jí)的，但又足夠好的搜索工具。

所謂“足夠好”，我是指一個(gè)搜索引擎擁有下列的功能：

詞根（Stemming）
排名/提升(Ranking / Boost)
支持多種語言
對拼寫錯(cuò)誤模糊搜索
方言的支持

幸運(yùn)的是PostgreSQL對這些功能全支持。

本文的目標(biāo)讀者是：

使用PostgreSQL，同時(shí)又不想安裝其它的搜索引擎。
使用其它的數(shù)據(jù)庫（比如MySQL），同時(shí)需要更好的全文搜索功能。

本文中我們將通過下面的表和數(shù)據(jù)說明PostgreSQL的全文搜索功能。

CREATE TABLE author(
 id SERIAL PRIMARY KEY,
 name TEXT NOT NULL);
CREATE TABLE post(
 id SERIAL PRIMARY KEY,
 title TEXT NOT NULL,
 content TEXT NOT NULL,
 author_id INT NOT NULL references author(id) );
CREATE TABLE tag(
 id SERIAL PRIMARY KEY,
 name TEXT NOT NULL );
CREATE TABLE posts_tags(
 post_id INT NOT NULL references post(id),
 tag_id INT NOT NULL references tag(id)
 );
INSERT INTO author (id, name) 
VALUES (1, 'Pete Graham'), 
  (2, 'Rachid Belaid'), 
  (3, 'Robert Berry');
 
INSERT INTO tag (id, name) 
VALUES (1, 'scifi'), 
  (2, 'politics'), 
  (3, 'science');
 
INSERT INTO post (id, title, content, author_id) 
VALUES (1, 'Endangered species', 'Pandas are an endangered species', 1 ), 
  (2, 'Freedom of Speech', 'Freedom of speech is a necessary right missing in many countries', 2), 
  (3, 'Star Wars vs Star Trek', 'Few words from a big fan', 3);
 
INSERT INTO posts_tags (post_id, tag_id) 
VALUES (1, 3), 
  (2, 2), 
  (3, 1);

這是一個(gè)類博客的應(yīng)用。它有post表，帶有title和content字段。post通過外鍵關(guān)聯(lián)到author。post自身還有多個(gè)標(biāo)簽(tag)。

什么是全文搜索

首先，讓我們看一下定義：

在文本檢索中，全文搜索是指從全文數(shù)據(jù)庫中搜索計(jì)算機(jī)存儲(chǔ)的單個(gè)或多個(gè)文檔(document)的技術(shù)。全文搜索不同于基于元數(shù)據(jù)的搜索或根據(jù)數(shù)據(jù)庫中原始文本的搜索。

-- 維基百科

這個(gè)定義中引入了文檔的概念，這很重要。當(dāng)你搜索數(shù)據(jù)時(shí)，你在尋找你想要找到的有意義的實(shí)體，這些就是你的文檔。PostgreSQL的文檔中解釋地很好。

文檔是全文搜索系統(tǒng)中的搜索單元。比如，一篇雜質(zhì)文章或是一封郵件消息。

-- Postgres 文檔

這里的文檔可以跨多個(gè)表，代表為我們想要搜索的邏輯實(shí)體。

構(gòu)建我們的文檔(document)

上一節(jié)，我們介紹了文檔的概念。文檔與表的模式無關(guān)，而是與數(shù)據(jù)相關(guān)，把字段聯(lián)合為一個(gè)有意義的實(shí)體。根據(jù)示例中的表的模式，我們的文檔(document)由這些組成：

post.title
post.content
post的author.name
關(guān)聯(lián)到post的所有tag.name

根據(jù)這些要求產(chǎn)生文檔，SQL查詢應(yīng)該是這樣的：

SELECT post.title || ' ' || 
  post.content || ' ' ||
  author.name || ' ' ||
  coalesce((string_agg(tag.name, ' ')), '') as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id, author.id;
 
    document --------------------------------------------------
 Endangered species Pandas are an endangered species Pete Graham politics
 Freedom of Speech Freedom of speech is a necessary right missing in many countries Rachid Belaid politics
 Star Wars vs Star Trek Few words from a big fan Robert Berry politics
(3 rows)

由于用post和author分組了，因?yàn)橛卸鄠€(gè)tag關(guān)聯(lián)到一個(gè)post，我們使用string_agg()作聚合函數(shù)。即使author是外鍵并且一個(gè)post不能有多個(gè)author，也要求對author添加聚合函數(shù)或者把a(bǔ)uthor加到GROUP BY中。

我們還用了coalesce()。當(dāng)值可以是NULL時(shí)，使用coalesce()函數(shù)是個(gè)很好的辦法，否則字符串連接的結(jié)果將是NULL。

至此，我們的文檔只是一個(gè)長string，這沒什么用。我們需要用to_tsvector()把它轉(zhuǎn)換為正確的格式。

SELECT to_tsvector(post.title) || 
  to_tsvector(post.content) ||
  to_tsvector(author.name) ||
  to_tsvector(coalesce((string_agg(tag.name, ' ')), '')) as documentFROM post
JOIN author ON author.id = post.author_id
JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id
JOIN tag ON tag.id = posts_tags.tag_id
GROUP BY post.id, author.id;
    document 
-------------------------------------------------- 
'endang':1,6 'graham':9 'panda':3 'pete':8 'polit':10 'speci':2,7
'belaid':16 'countri':14 'freedom':1,4 'mani':13 'miss':11 'necessari':9 'polit':17 'rachid':15 'right':10 'speech':3,6
'berri':13 'big':10 'fan':11 'polit':14 'robert':12 'star':1,4 'trek':5 'vs':3 'war':2 'word':7
(3 rows)

這個(gè)查詢將返回適于全文搜索的tsvector格式的文檔。讓我們嘗試把一個(gè)字符串轉(zhuǎn)換為一個(gè)tsvector。

SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value');

這個(gè)查詢將返回下面的結(jié)果：

        to_tsvector
----------------------------------------------------------------------
'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17(1 row)

發(fā)生了怪事。首先比原文的詞少了，一些詞也變了（try變成了tri），而且后面還有數(shù)字。怎么回事？

一個(gè)tsvector是一個(gè)標(biāo)準(zhǔn)詞位的有序列表（sorted list），標(biāo)準(zhǔn)詞位（distinct lexeme）就是說把同一單詞的各種變型體都被標(biāo)準(zhǔn)化相同的。

標(biāo)準(zhǔn)化過程幾乎總是把大寫字母換成小寫的，也經(jīng)常移除后綴（比如英語中的s,es和ing等）。這樣可以搜索同一個(gè)字的各種變體，而不是乏味地輸入所有可能的變體。

數(shù)字表示詞位在原始字符串中的位置，比如“man"出現(xiàn)在第6和15的位置上。你可以自己數(shù)數(shù)看。

Postgres中to_tesvetor的默認(rèn)配置的文本搜索是“英語“。它會(huì)忽略掉英語中的停用詞（stopword，譯注：也就是am is are a an等單詞)。

這解釋了為什么tsvetor的結(jié)果比原句子中的單詞少。后面我們會(huì)看到更多的語言和文本搜索配置。

查詢

我們知道了如何構(gòu)建一個(gè)文檔，但我們的目標(biāo)是搜索文檔。我們對tsvector搜索時(shí)可以使用@@操作符，使用說明見此處?？磶讉€(gè)查詢文檔的例子。

> select to_tsvector('If you can dream it, you can do it') @@ 'dream';
 ?column?
----------
 t
(1 row)
 
> select to_tsvector('It''s kind of fun to do the impossible') @@ 'impossible';
 
 ?column?
----------
 f
(1 row)

第二個(gè)查詢返回了假，因?yàn)槲覀冃枰獦?gòu)建一個(gè)tsquery，使用@@操作符時(shí)，把字符串轉(zhuǎn)型(cast)成了tsquery。下面顯示了這種l轉(zhuǎn)型和使用to_tsquery()之間的差別。

SELECT 'impossible'::tsquery, to_tsquery('impossible');
 tsquery | to_tsquery
--------------+------------
 'impossible' | 'imposs'(1 row)

但"dream"的詞位與它本身相同。

SELECT 'dream'::tsquery, to_tsquery('dream');
 tsquery | to_tsquery
--------------+------------
 'dream'  | 'dream'(1 row)

從現(xiàn)在開始我們使用to_tsquery查詢文檔。

 
SELECT to_tsvector('It''s kind of fun to do the impossible') @@ to_tsquery('impossible');
 
 ?column?
----------
 t
(1 row)

tsquery存儲(chǔ)了要搜索的詞位，可以使用（與）、|（或）和!（非）邏輯操作符。可以使用圓括號(hào)給操作符分組。

> SELECT to_tsvector('If the facts don't fit the theory, change the facts') @@ to_tsquery('! fact');
 
 ?column?
----------
 f
(1 row)
 
> SELECT to_tsvector('If the facts don''t fit the theory, change the facts') @@ to_tsquery('theory  !fact');
 
 ?column?
----------
 f
(1 row)
 
> SELECT to_tsvector('If the facts don''t fit the theory, change the facts.') @@ to_tsquery('fiction | theory');
 
 ?column?
----------
 t
(1 row)

我們也可以使用：*來表達(dá)以某詞開始的查詢。

> SELECT to_tsvector('If the facts don''t fit the theory, change the facts.') @@ to_tsquery('theo:*');
 
 ?column?
----------
 t
(1 row)

既然我們知道了怎樣使用全文搜索查詢了，我們回到開始的表模式，試著查詢文檔。

SELECT pid, p_titleFROM (SELECT post.id as pid,
    post.title as p_title,
    to_tsvector(post.title) || 
    to_tsvector(post.content) ||
    to_tsvector(author.name) ||
    to_tsvector(coalesce(string_agg(tag.name, ' '))) as document
  FROM post
  JOIN author ON author.id = post.author_id
  JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id
  JOIN tag ON tag.id = posts_tags.tag_id
  GROUP BY post.id, author.id) p_search　WHERE p_search.document @@ to_tsquery('Endangered  Species');
 
 pid |  p_title
-----+--------------------
 1 | Endangered species
(1 row)

這個(gè)查詢將找到文檔中包含Endangered和Species或接近的詞。

語言支持

Postgres 內(nèi)置的文本搜索功能支持多種語言：丹麥語，荷蘭語，英語，芬蘭語，法語，德語，匈牙利語，意大利語，挪威語，葡萄牙語，羅馬尼亞語，俄語，西班牙語，瑞典語，土耳其語。

SELECT to_tsvector('english', 'We are running');
 to_tsvector-------------
 'run':3
(1 row)SELECT to_tsvector('french', 'We are running');
  to_tsvector----------------------------
 'are':2 'running':3 'we':1
(1 row)

基于我們最初的模型，列名可以用來創(chuàng)建tsvector。假設(shè)post表中包含不同語言的內(nèi)容，且它包含一列l(wèi)anguage。

ALTER TABLE post ADD language text NOT NULL DEFAULT('english');

為了使用language列，現(xiàn)在我們重新編譯文檔。

SELECT to_tsvector(post.language::regconfig, post.title) || 
  to_tsvector(post.language::regconfig, post.content) ||
  to_tsvector('simple', author.name) ||
  to_tsvector('simple', coalesce((string_agg(tag.name, ' ')), '')) as documentFROM postJOIN author ON author.id = post.author_idJOIN posts_tags ON posts_tags.post_id = posts_tags.tag_idJOIN tag ON tag.id = posts_tags.tag_idGROUP BY post.id, author.id;

如果缺少顯示的轉(zhuǎn)化符：：regconfig，查詢時(shí)會(huì)產(chǎn)生一個(gè)錯(cuò)誤：

ERROR: function to_tsvector(text, text) does not exist

regconfig是對象標(biāo)識(shí)符類型，它表示Postgres文本搜索配置項(xiàng)。:http://www.postgresql.org/docs/9.3/static/datatype-oid.html

現(xiàn)在，文檔的語義會(huì)使用post.language中正確的語言進(jìn)行編譯。

我們也使用simple，它也是Postgres提供的一個(gè)文本搜索配置項(xiàng)。simple并不忽略禁用詞表，它也不會(huì)試著去查找單詞的詞根。使用simple時(shí)，空格分割的每一組字符都是一個(gè)語義；對于數(shù)據(jù)來說，simple文本搜索配置項(xiàng)很實(shí)用，就像一個(gè)人的名字，我們也許不想查找名字的詞根。

SELECT to_tsvector('simple', 'We are running');
  to_tsvector
---------------------------- 'are':2 'running':3 'we':1(1 row)

重音字符

當(dāng)你建立一個(gè)搜索引擎支持多種語言時(shí)你也需要考慮重音問題。在許多語言中重音非常重要,可以改變這個(gè)詞的含義。Postgres附帶一個(gè)unaccent擴(kuò)展去調(diào)用 unaccentuate內(nèi)容是有用處的。

CREATE EXTENSION unaccent;SELECT unaccent('èéê?');
 unaccent----------
 eeee
(1 row)

讓我們添加一些重音的你內(nèi)容到我們的post表中。

INSERT INTO post (id, title, content, author_id, language) 
VALUES (4, 'il était une fois', 'il était une fois un h?tel ...', 2,'french')

如果我們想要忽略重音在我們建立文檔時(shí),之后我們可以簡單做到以下幾點(diǎn):

SELECT to_tsvector(post.language, unaccent(post.title)) || 
  to_tsvector(post.language, unaccent(post.content)) ||
  to_tsvector('simple', unaccent(author.name)) ||
  to_tsvector('simple', unaccent(coalesce(string_agg(tag.name, ' '))))JOIN author ON author.id = post.author_idJOIN posts_tags ON posts_tags.post_id = posts_tags.tag_idJOIN tag ON author.id = post.author_idGROUP BY p.id

這樣工作的話，如果有更多錯(cuò)誤的空間它就有點(diǎn)麻煩。我們還可以建立一個(gè)新的文本搜索配置支持無重音的字符。

CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );ALTER TEXT SEARCH CONFIGURATION fr ALTER MAPPINGFOR hword, hword_part, word WITH unaccent, french_stem;

當(dāng)我們使用這個(gè)新的文本搜索配置,我們可以看到詞位

SELECT to_tsvector('french', 'il était une fois');
 to_tsvector-------------
 'fois':4
(1 row)SELECT to_tsvector('fr', 'il était une fois');
 to_tsvector--------------------
 'etait':2 'fois':4
(1 row)

這給了我們相同的結(jié)果，第一作為應(yīng)用unaccent并且從結(jié)果建立tsvector。

SELECT to_tsvector('french', unaccent('il était une fois'));
 to_tsvector--------------------
 'etait':2 'fois':4
(1 row)

詞位的數(shù)量是不同的,因?yàn)閕l était une在法國是一個(gè)無用詞。這是一個(gè)問題讓這些詞停止在我們的文件嗎?我不這么認(rèn)為etait不是一個(gè)真正的無用詞而是拼寫錯(cuò)誤。

SELECT to_tsvector('fr', 'H?tel') @@ to_tsquery('hotels') as result;
 result--------
 t
(1 row)

如果我們?yōu)槊糠N語言創(chuàng)建一個(gè)無重音的搜索配置,這樣我們的post可以寫入并且我們保持這個(gè)值在post.language的中,然后我們可以保持以前的文檔查詢。

SELECT to_tsvector(post.language, post.title) || 
  to_tsvector(post.language, post.content) ||
  to_tsvector('simple', author.name) ||
  to_tsvector('simple', coalesce(string_agg(tag.name, ' ')))JOIN author ON author.id = post.author_idJOIN posts_tags ON posts_tags.post_id = posts_tags.tag_idJOIN tag ON author.id = post.author_idGROUP BY p.id

如果你需要為每種語言創(chuàng)建無重音的文本搜索配置由Postgres支持,然后你可以使用gist

我們當(dāng)前的文檔大小可能會(huì)增加,因?yàn)樗梢园o重音的無用詞但是我們并沒有關(guān)注重音字符查詢。這可能是有用的如有人用英語鍵盤搜索法語內(nèi)容。

歸類

當(dāng)你創(chuàng)建了一個(gè)你想要的搜索引擎用來搜索相關(guān)的結(jié)果（根據(jù)相關(guān)性歸類）的時(shí)候，歸類可以是基于許多因素的，它的文檔大致解釋了這些（歸類依據(jù)）內(nèi)容。

歸類試圖處理特定的上下文搜索, 因此有許多個(gè)配對的時(shí)候，相關(guān)性最高的那個(gè)會(huì)被排在第一個(gè)位置。PostgreSQL提供了兩個(gè)預(yù)定義歸類函數(shù)，它們考慮到了詞法解釋，接近度和結(jié)構(gòu)信息；他們考慮到了在上下文中的詞頻，如何接近上下文中的相同詞語，以及在文中的什么位置出現(xiàn)和其重要程度。

-- PostgreSQL documentation

通過PostgreSQL提供的一些函數(shù)得到我們想要的相關(guān)性結(jié)果，在我們的例子中我們將會(huì)使用他們中的2個(gè)：ts_rank() 和 setweight() 。

函數(shù)setweight允許我們通過tsvector函數(shù)給重要程度（權(quán)）賦值；值可以是'A', 'B', 'C' 或者 'D'。

SELECT pid, p_titleFROM (SELECT post.id as pid,
    post.title as p_title,
    setweight(to_tsvector(post.language::regconfig, post.title), 'A') || 
    setweight(to_tsvector(post.language::regconfig, post.content), 'B') ||
    setweight(to_tsvector('simple', author.name), 'C') ||
    setweight(to_tsvector('simple', coalesce(string_agg(tag.name, ' '))), 'B') as document  FROM post  JOIN author ON author.id = post.author_id  JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id  JOIN tag ON tag.id = posts_tags.tag_id  GROUP BY post.id, author.id) p_searchWHERE p_search.document @@ to_tsquery('english', 'Endangered  Species')ORDER BY ts_rank(p_search.document, to_tsquery('english', 'Endangered  Species')) DESC;

上面的查詢，我們在文中不同的欄里面賦了不同的權(quán)值。post.title的重要程度超過post.content和tag的總和。最不重要的是author.name。

這意味著如果我們搜索關(guān)鍵詞“Alice”，那么在題目中包含這個(gè)關(guān)鍵詞的文檔就會(huì)排在搜索結(jié)果的前面，在此之后是在內(nèi)容中包含這些關(guān)鍵詞的文檔，最后才是作者名字中包含這些關(guān)鍵詞的文檔.

基于對文檔各個(gè)部分的權(quán)重分配ts_rank()這個(gè)函數(shù)返回一個(gè)浮點(diǎn)數(shù)，這個(gè)浮點(diǎn)數(shù)代表了文檔和查詢關(guān)鍵詞的相關(guān)性.

SELECT ts_rank(to_tsvector('This is an example of document'), 
    to_tsquery('example | document')) as relevancy;
 relevancy-----------
 0.0607927
(1 row)SELECT ts_rank(to_tsvector('This is an example of document'), 
    to_tsquery('example ')) as relevancy;
 relevancy-----------
 0.0607927
(1 row)SELECT ts_rank(to_tsvector('This is an example of document'), 
    to_tsquery('example | unkown')) as relevancy;
 relevancy-----------
 0.0303964
(1 row)SELECT ts_rank(to_tsvector('This is an example of document'),
    to_tsquery('example  document')) as relevancy;
 relevancy-----------
 0.0985009
(1 row)SELECT ts_rank(to_tsvector('This is an example of document'), 
    to_tsquery('example  unknown')) as relevancy;
 relevancy-----------
 1e-20
(1 row)

但是, 相關(guān)性的概念是模糊的，而且是與特定的應(yīng)用相關(guān). 不同的應(yīng)用可能需要額外的信息來得到想要的排序結(jié)果,比如,文檔的修改時(shí)間. 內(nèi)建的排序功能如asts_rank只是個(gè)例子. 你可以寫出自己的排序函數(shù) 并且/或者將得到的結(jié)果和其他因素混合來適應(yīng)你自己的特定需求.

這里說明一下, 如果我們想是新的文章比舊的文章更重要，可以講ts_rank函數(shù)的數(shù)值除以文檔的年齡+1(為防止被0除).

優(yōu)化與索引

將一個(gè)表中的搜索結(jié)果優(yōu)化為直線前進(jìn)的. PostgreSQL 支持基于索引的功能，因此你可以用tsvector()函數(shù)方便地創(chuàng)建GIN索引.

CREATE INDEX idx_fts_post ON post 
USING gin(setweight(to_tsvector(language, title),'A') || 
   setweight(to_tsvector(language, content), 'B'));

GIN還是GiST索引? 這兩個(gè)索引會(huì)成為與他們相關(guān)的博文的主題. GiST會(huì)導(dǎo)出一個(gè)錯(cuò)誤的匹配，之后需要一個(gè)額外的表行查找來驗(yàn)證得到的匹配. 另一方面, GIN 可以更快地查找但是在創(chuàng)建時(shí)會(huì)更大更慢.

一個(gè)經(jīng)驗(yàn), GIN索引適合靜態(tài)的數(shù)據(jù)因?yàn)椴檎沂茄杆俚? 對于動(dòng)態(tài)數(shù)據(jù), GiST 可以更快的更新. 具體來說, GiST索引在動(dòng)態(tài)數(shù)據(jù)上是好用的并且如果單獨(dú)的字（詞位）在100,000以下也是快速的,然而GIN 索引在處理100,000詞位以上時(shí)是更好的但是更新就要慢點(diǎn)了.

-- Postgres 文檔 : 第12章全文搜索

在我們的例子中,我們選擇GIN。但是這個(gè)選擇不是一定的，你可以根據(jù)你自己的數(shù)據(jù)來作出決定。

我們的架構(gòu)例子中有一個(gè)問題; 分當(dāng)時(shí)分布在擁有不同權(quán)重的不同表中的. 為了更好的運(yùn)行，通過觸發(fā)器和物化視圖使得數(shù)據(jù)非規(guī)范化是必要的.

我們并非總是需要非規(guī)范化并且有時(shí)也需要加入基于索引的功能，就像上面所做的那樣. 另外你可以通過postgres觸發(fā)器功能tsvector_update_trigger(...)或者tsvector_update_trigger_column(...)實(shí)現(xiàn)相同表的數(shù)據(jù)的非規(guī)范化.參見Postgres文檔以得到更多詳細(xì)的信息.

在我們的應(yīng)用中在結(jié)果返回之前存在著一些可接受的延遲. 這是一個(gè)使用物化視圖將額外索引加載其中的好的情況.

CREATE MATERIALIZED VIEW search_index AS SELECT post.id,
  post.title,
  setweight(to_tsvector(post.language::regconfig, post.title), 'A') || 
  setweight(to_tsvector(post.language::regconfig, post.content), 'B') ||
  setweight(to_tsvector('simple', author.name), 'C') ||
  setweight(to_tsvector('simple', coalesce(string_agg(tag.name, ' '))), 'A') as documentFROM postJOIN author ON author.id = post.author_idJOIN posts_tags ON posts_tags.post_id = posts_tags.tag_idJOIN tag ON tag.id = posts_tags.tag_idGROUP BY post.id, author.id

之后重新索引搜索引擎就是定期運(yùn)行REFRESH MATERIALIZED VIEW search_index這么簡單.

現(xiàn)在我們可以給物化視圖添加索引.

CREATE INDEX idx_fts_search ON search_index USING gin(document);

查詢也變得同樣簡單.

SELECT id as post_id, titleFROM search_indexWHERE document @@ to_tsquery('english', 'Endangered  Species')ORDER BY ts_rank(p_search.document, to_tsquery('english', 'Endangered  Species')) DESC;

如果延遲變得無法忍受，你就應(yīng)該去研究一下使用觸發(fā)器的替代方法.

建立文檔存儲(chǔ)的方式并不唯一;這取決于你文檔的情況: 單表、多表，多國語言，數(shù)據(jù)量 ...

Thoughtbot.com 發(fā)表了文章"Implementing Multi-Table Full Text Search with Postgres in Rails" 我建議閱讀以下.

拼寫錯(cuò)誤

PostgreSQL 提供了一個(gè)非常有用的擴(kuò)展程序pg_trgm。相關(guān)文檔見pg_trgm doc。

CREATE EXTENSION pg_trgm;

pg_trgm支持N元語法如N==3。N元語法比較有用因?yàn)樗梢圆檎蚁嗨频淖址鋵?shí)，這就是拼寫錯(cuò)誤的定義 – 一個(gè)相似但不正確的單詞。

SELECT similarity('Something', 'something');
 similarity------------
  1
(1 row)SELECT similarity('Something', 'samething');
 similarity------------
 0.538462
(1 row)SELECT similarity('Something', 'unrelated');
 similarity------------
  0
(1 row)SELECT similarity('Something', 'everything');
 similarity           
------------
 0.235294
(1 row)SELECT similarity('Something', 'omething');
 similarity------------
 0.583333
(1 row)

通過上面的示例你可以看到，similarity 函數(shù)返回一個(gè)表示兩個(gè)字符串之間相似度的浮點(diǎn)值。檢測拼寫錯(cuò)誤就是一系列的收集文檔中使用的詞位、比較詞位與輸入文本的相似度的過程。我發(fā)現(xiàn)檢測拼寫錯(cuò)誤時(shí)，相似度臨界值設(shè)置為0.5比較合適。首先，我們需要根據(jù)文檔創(chuàng)建一個(gè)唯一性詞位列表，在列表中每一個(gè)詞位都是唯一的。

CREATE MATERIALIZED VIEW unique_lexeme ASSELECT word FROM ts_stat('SELECT to_tsvector('simple', post.title) || 
 to_tsvector('simple', post.content) ||
 to_tsvector('simple', author.name) ||
 to_tsvector('simple', coalesce(string_agg(tag.name, ' ')))
FROM post
JOIN author ON author.id = post.author_id
JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id
JOIN tag ON tag.id = posts_tags.tag_id
GROUP BY post.id, author.id');

上面的腳本使用word列創(chuàng)建了一個(gè)視圖，word列內(nèi)容來自于詞位列表。我們使用simple關(guān)鍵字，這樣table表中可以存儲(chǔ)多種語言的文本。一旦創(chuàng)建了這個(gè)實(shí)體化視圖，我們需要添加一個(gè)索引來使相似度查詢速度更快。

CREATE INDEX words_idx ON search_words USING gin(word gin_trgm_ops);

幸運(yùn)的是，搜索引擎中使用的唯一性詞位列表不會(huì)快速變化，這樣我們就無需通過下面腳本經(jīng)常刷新實(shí)體化視圖：

REFRESH MATERIALIZED VIEW unique_lexeme;

一旦我們建立起這個(gè)表，查找最接近的匹配是很容易的。

SELECT word 
WHERE similarity(word, 'samething') > 0.5 ORDER BY word -> 'samething'LIMIT 1;

這個(gè)查詢返回的是這樣一個(gè)語義，它相似度滿足(>0.5)，再根據(jù)輸入的samething將其最接近的排在首位。操作符->返回的是參數(shù)間的“距離”，而且是一減去similarity()的值。

當(dāng)你決定在你的搜索中處理拼寫錯(cuò)誤的時(shí)候，你不會(huì)希望看到它（拼寫錯(cuò)誤）出現(xiàn)在每一個(gè)查詢中。相反地，當(dāng)你在搜索無結(jié)果時(shí)，可以為了拼寫錯(cuò)誤去查詢，并使用查詢所提供結(jié)果給用戶一些建議。如果數(shù)據(jù)來自于非正式的通訊，例如：社交網(wǎng)絡(luò)，可能你的數(shù)據(jù)中會(huì)包含拼寫錯(cuò)誤。你可以通過追加一個(gè)類似的語義到你的tsquery中，來獲得一個(gè)好點(diǎn)的結(jié)果。

"Super Fuzzy Searching on PostgreSQL" 是一篇很好的關(guān)于為拼寫錯(cuò)誤和搜索Postgres使用三字母組的參考文章。

在我使用的例子中，使用unique語義的表不會(huì)大于2000行，而且我的理解是，如果你有超過1M的文本時(shí)使用unique語義，你將會(huì)遇到該方法的性能問題。

關(guān)于MySQL和RDS（遠(yuǎn)程數(shù)據(jù)服務(wù)）

這在Postgres RDS上能運(yùn)行嗎？

上面所有的示例在RDS上都是可以運(yùn)行的。據(jù)我所知，RDS搜索特性中唯一的限制是搜索某些數(shù)據(jù)時(shí)需要訪問文件系統(tǒng)，如自定義字典，拼寫檢查程序，同義詞，主題詞表。相關(guān)信息見亞馬遜aws論壇。

我使用的是MYSQL數(shù)據(jù)庫，我可以使用內(nèi)置的全文本搜索功能嗎？

如果是我，我不會(huì)去用這個(gè)功能。無需爭論，MySQL的全文本搜索功能非常局限。默認(rèn)情況，它不支持任何語言的詞干提取功能。我偶然發(fā)現(xiàn)一個(gè)可以安裝的詞干提取的函數(shù)，但是MYSQL不支持基于索引的函數(shù)。

那么你可以做些什么？鑒于我們上面的討論，如果 Postgres能夠勝任你使用的各個(gè)場景，那么考慮下把數(shù)據(jù)庫換為 Postgres。數(shù)據(jù)庫遷移工作可以通過工具如 py-mysql2pgsql方便地完成。或者你可以研究一下更高級(jí)的解決方案如 SOLR（基于 Lucene的全文搜索服務(wù)器）和 Elasticsearch（基于 Lucene的開源、分布式、 RESTful搜索引擎）。

總結(jié)

我們已經(jīng)了解了基于一個(gè)特殊的文檔如何構(gòu)建一個(gè)性能良好且支持多語言的文本搜索引擎。這篇文章只是一個(gè)概述，但是它已經(jīng)給你提供了足夠的背景知識(shí)和示例，這樣你可以開始構(gòu)建自己的搜索引擎。在這篇文章中，我也許犯了一些錯(cuò)誤，如果你能把錯(cuò)誤信息發(fā)送到blog@lostpropertyhq.com，我將感激不盡。

Postgres的全文本搜索特性非常好，而且搜索速度足夠快。這可以使你的應(yīng)用中的數(shù)據(jù)不斷增長，而無需依賴其它工具進(jìn)行處理。 Postgres的搜索功能是銀彈嗎？如果你的核心業(yè)務(wù)圍繞搜索進(jìn)行，它可能不是的。

它移除了一些特性，但是在大部分場景中你不會(huì)用到這些特性。毫無疑問，你需要認(rèn)真分析和理解你的需求來決定使用哪種搜索方式。

就我個(gè)人而言，我希望Postgres全文本搜索功能繼續(xù)改善，并新增下面的一些特性：

額外的內(nèi)置語言支持：漢語，日語...
圍繞Lucene的外國數(shù)據(jù)包裝程序。在全文本搜索功能上，Lucene仍然是最優(yōu)秀的工具，把它集成到Postgres中會(huì)有很多好處。
更多排名結(jié)果的提高或評分特性會(huì)是一流的。 Elasticsearch 和 SOLR已經(jīng)提供了先進(jìn)的解決方案。
進(jìn)行模糊查詢（tsquery）時(shí)不使用trigram的方式會(huì)非常棒。 Elasticsearch 提供了一種非常簡單的方式來實(shí)現(xiàn)模糊搜索查詢。
能夠通過SQL動(dòng)態(tài)創(chuàng)建和編輯如字典內(nèi)容、同義詞、主題詞表的特性，而不再使用把文件添加到文件系統(tǒng)的方式。

Postgres 沒有ElasticSearch 和 SOLR 那么先進(jìn)，畢竟ElasticSearch 和 SOLR是專門進(jìn)行全文本搜索的工具，而全文本搜索只是PostgresSQL一個(gè)比較優(yōu)秀的特性。

您可能感興趣的文章:

使用Bucardo5實(shí)現(xiàn)PostgreSQL的主數(shù)據(jù)庫復(fù)制
在PostgreSQL的基礎(chǔ)上創(chuàng)建一個(gè)MongoDB的副本的教程
在PostgreSQL中使用數(shù)組時(shí)值得注意的一些地方
使用Ruby on Rails和PostgreSQL自動(dòng)生成UUID的教程
在PostgreSQL中使用日期類型時(shí)一些需要注意的地方
一個(gè)提升PostgreSQL性能的小技巧
在PostgreSQL中實(shí)現(xiàn)遞歸查詢的教程
在PostgreSQL上安裝并使用擴(kuò)展模塊的教程
介紹PostgreSQL中的范圍類型特性
深入解讀PostgreSQL中的序列及其相關(guān)函數(shù)的用法

標(biāo)簽：淮安柳州景德鎮(zhèn) 瀘州那曲威海江蘇荊門

巨人網(wǎng)絡(luò)通訊聲明：本文標(biāo)題《詳細(xì)講解PostgreSQL中的全文搜索的用法》，本文關(guān)鍵詞詳細(xì),講解,PostgreSQL,中的,；如發(fā)現(xiàn)本文內(nèi)容存在版權(quán)問題，煩請?zhí)峁┫嚓P(guān)信息告之我們，我們將及時(shí)溝通與處理。本站內(nèi)容系統(tǒng)采集于網(wǎng)絡(luò)，涉及言論、版權(quán)與本站無關(guān)。