(mysql.info.gz) Fulltext Search
Info Catalog
(mysql.info.gz) Date and time functions
(mysql.info.gz) Functions
(mysql.info.gz) Cast Functions
12.6 Full-Text Search Functions
===============================
Menu
* Fulltext Boolean Boolean Full-Text Searches
* Fulltext Query Expansion Full-Text Searches with Query Expansion
* Fulltext Restrictions Full-Text Restrictions
* Fulltext Fine-tuning Fine-Tuning MySQL Full-Text Search
* Fulltext TODO Full-Text Search TODO
`MATCH (COL1,COL2,...) AGAINST (EXPR [IN BOOLEAN MODE | WITH QUERY EXPANSION])'
As of MySQL 3.23.23, MySQL has support for full-text indexing and
searching. A full-text index in MySQL is an index of type
`FULLTEXT'. `FULLTEXT' indexes are used with `MyISAM' tables only
and can be created from `CHAR', `VARCHAR', or `TEXT' columns at
`CREATE TABLE' time or added later with `ALTER TABLE' or `CREATE
INDEX'. For large datasets, it will be much faster to load your
data into a table that has no `FULLTEXT' index, then create the
index with `ALTER TABLE' (or `CREATE INDEX'). Loading data into a
table that has an existing `FULLTEXT' index could be significantly
slower.
Constraints on full-text searching are listed in Fulltext
Restrictions.
Full-text searching is performed with the `MATCH()' function.
mysql> CREATE TABLE articles (
-> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
-> title VARCHAR(200),
-> body TEXT,
-> FULLTEXT (title,body)
-> );
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO articles (title,body) VALUES
-> ('MySQL Tutorial','DBMS stands for DataBase ...'),
-> ('How To Use MySQL Well','After you went through a ...'),
-> ('Optimizing MySQL','In this tutorial we will show ...'),
-> ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
-> ('MySQL vs. YourSQL','In the following database comparison ...'),
-> ('MySQL Security','When configured properly, MySQL ...');
Query OK, 6 rows affected (0.00 sec)
Records: 6 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM articles
-> WHERE MATCH (title,body) AGAINST ('database');
+----+-------------------+------------------------------------------+
| id | title | body |
+----+-------------------+------------------------------------------+
| 5 | MySQL vs. YourSQL | In the following database comparison ... |
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)
The `MATCH()' function performs a natural language search for a string
against a text collection. A collection is a set of one or more columns
included in a `FULLTEXT' index. The search string is given as the
argument to `AGAINST()'. For every row in the table, `MATCH()' returns
a relevance value, that is, a similarity measure between the search
string and the text in that row in the columns named in the `MATCH()'
list.
By default, the search is performed in case-insensitive fashion. In
MySQL 4.1 and up, you can make a full-text search by using a binary
collation for the indexed columns. For example, a column that has a
character set of `latin1' can be assigned a collation of `latin1_bin'
to make it case sensitive for full-text searches.
When `MATCH()' is used in a `WHERE' clause, as in the preceding example,
the rows returned are automatically sorted with the highest relevance
first. Relevance values are non-negative floating-point numbers. Zero
relevance means no similarity. Relevance is computed based on the
number of words in the row, the number of unique words in that row, the
total number of words in the collection, and the number of documents
(rows) that contain a particular word.
For natural-language full-text searches, it is a requirement that the
columns named in the `MATCH()' function be the same columns included in
some `FULLTEXT' index in your table. For the preceding query, note
that the columns named in the `MATCH()' function (`title' and `body')
are the same as those named in the definition of the `article' table's
`FULLTEXT' index. If you wanted to search the `title' or `body'
separately, you would need to create `FULLTEXT' indexes for each column.
It is also possible to perform a boolean search or a search with query
expansion. These search types are described in Fulltext
Boolean and Fulltext Query Expansion.
The preceding example is a basic illustration showing how to use the
`MATCH()' function where rows are returned in order of decreasing
relevance. The next example shows how to retrieve the relevance values
explicitly. Returned rows are not ordered because the `SELECT'
statement includes neither `WHERE' nor `ORDER BY' clauses:
mysql> SELECT id, MATCH (title,body) AGAINST ('Tutorial')
-> FROM articles;
+----+-----------------------------------------+
| id | MATCH (title,body) AGAINST ('Tutorial') |
+----+-----------------------------------------+
| 1 | 0.65545833110809 |
| 2 | 0 |
| 3 | 0.66266459226608 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
+----+-----------------------------------------+
6 rows in set (0.00 sec)
The following example is more complex. The query returns the relevance
values and it also sorts the rows in order of decreasing relevance. To
achieve this result, you should specify `MATCH()' twice: once in the
`SELECT' list and once in the `WHERE' clause. This causes no additional
overhead, because the MySQL optimizer notices that the two `MATCH()'
calls are identical and invokes the full-text search code only once.
mysql> SELECT id, body, MATCH (title,body) AGAINST
-> ('Security implications of running MySQL as root') AS score
-> FROM articles WHERE MATCH (title,body) AGAINST
-> ('Security implications of running MySQL as root');
+----+-------------------------------------+-----------------+
| id | body | score |
+----+-------------------------------------+-----------------+
| 4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
| 6 | When configured properly, MySQL ... | 1.3114095926285 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)
MySQL uses a very simple parser to split text into words. A "word" is
any sequence of true word characters (letters, digits, and underscores),
optionally separated by no more than one sequential `'' character. For
example, `wasn't' is parsed as a single word, but `wasn''t' is parsed
as two words `wasn' and `t'. (And then `t' would be ignored as too
short according to the rules following.) Also, single quotes at the
ends of words are stripped; only embedded single quotes are retained.
Some words are ignored in full-text searches:
* Any word that is too short is ignored. The default minimum length
of words that will be found by full-text searches is four
characters.
* Words in the stopword list are ignored. A stopword is a word such
as "the" or "some" that is so common that it is considered to have
zero semantic value. There is a built-in stopword list.
The default minimum word length and stopword list can be changed as
described in Fulltext Fine-tuning.
Every correct word in the collection and in the query is weighted
according to its significance in the collection or query. This way, a
word that is present in many documents has a lower weight (and may even
have a zero weight), because it has lower semantic value in this
particular collection. Conversely, if the word is rare, it receives a
higher weight. The weights of the words are then combined to compute
the relevance of the row.
Such a technique works best with large collections (in fact, it was
carefully tuned this way). For very small tables, word distribution
does not adequately reflect their semantic value, and this model may
sometimes produce bizarre results. For example, although the word
"MySQL" is present in every row of the `articles' table, a search for
the word produces no results:
mysql> SELECT * FROM articles
-> WHERE MATCH (title,body) AGAINST ('MySQL');
Empty set (0.00 sec)
The search result is empty because the word "MySQL" is present in at
least 50% of the rows. As such, it is effectively treated as a
stopword. For large datasets, this is the most desirable behavior--a
natural language query should not return every second row from a 1GB
table. For small datasets, it may be less desirable.
A word that matches half of rows in a table is less likely to locate
relevant documents. In fact, it will most likely find plenty of
irrelevant documents. We all know this happens far too often when we
are trying to find something on the Internet with a search engine. It
is with this reasoning that rows containing the word are assigned a low
semantic value for _the particular dataset in which they occur_. A
given word may exceed the 50% threshold in one dataset but not another.
The 50% threshold has a significant implication when you first try
full-text searching to see how it works: If you create a table and
insert only one or two rows of text into it, every word in the text
occurs in at least 50% of the rows. As a result, no search returns any
results. Be sure to insert at least three rows, and preferably many
more.
Info Catalog
(mysql.info.gz) Date and time functions
(mysql.info.gz) Functions
(mysql.info.gz) Cast Functions
automatically generated byinfo2html