[翻译]tag的数据库设计

In: 未分类

2 2008

网上搜索下tag的设计,虽然找到一个tag的设计描述,但也是跟我们正常的设计差不多

最后他认为最好tag设计要遵循三范式的设计基础设计

大概描述如下:

原文:http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html

数据库结构

最近, del.icio.us 邮件列表, 我问了个问题 “有谁知道 del.icio.us的数据库设计?” .
我得到些私人的反馈,下面跟大家分享下:

问题: 你需要设计一个数据库作为一个书签(或者是一个博客或者其他的)需要给它们建立许多的标签, 之后,你想通过SQL语句获取标签的union(并集)或者intersection(交集), 你同时需要在短时间内然会标签查询结果.

显然 有三种的解决方法 (注意::如果你做一个网站允许用户自定义标签, 确信看过我在站点里面进行的 my performance tests测试报告.)

“MySQLicious” solution

一个表的解决方法:

Intersection (AND)交集

Query for “search+webservice+semweb”: //查询一段tag
SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags LIKE "%semweb%"

Union (OR)并集

Query for “search|webservice|semweb”:

SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
OR tags LIKE "%webservice%"
OR tags LIKE "%semweb%"

Minus

Query for “search+webservice-semweb”
SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags NOT LIKE "%semweb%"

Conclusion

优点:

缺点:

  • 必须限制tag的长度. 正常使用 256byte长度字段 (VARCHAR). 否则你要用到,text的字段或者相似的, 但是查询的速度会很慢, I suppose
  • If you paid attention (as Patrice did) you notice that LIKE "%search" will also find tags with “websearch”. If you alter the query to LIKE " %search% " you end up having a messy solution: You have to add a space to the beginning of the tags value to make this work.

“Scuttle” solution

使用两个表,一个外键关联在一起

Intersection (AND)

Query for “bookmark+webservice+semweb”:
SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId
HAVING COUNT( b.bId )=3

First, all bookmark-tag combinations are searched, where the tag is “bookmark”, “webservice” or “semweb” (c.category IN ('bookmark', 'webservice', 'semweb')), then just the bookmarks that have got all three tags searched for are taken into account (HAVING COUNT(b.bId)=3).

Union (OR)

Query for “bookmark|webservice|semweb”:
Just leave out the HAVING clause and you have union:
SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId

Minus (Exclusion)

Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.
SELECT b. *
FROM scBookmarks b, scCategories c
WHERE b.bId = c.bId
AND (c.category IN ('bookmark', 'webservice'))
AND b.bId NOT
IN (SELECT b.bId FROM scBookmarks b, scCategories c WHERE b.bId = c.bId AND c.category = 'semweb')
GROUP BY b.bId
HAVING COUNT( b.bId ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.
Credits go to Rhomboid for helping me out with this query.

Conclusion

I guess the main advantage of this solution is that it is more normalized than the first solution, and that you can have unlimited number of tags per bookmark.

“Toxi” solution

Toxi came up with a three-table structure. Via the table “tagmap” the bookmarks and the tags are n-to-m related. Each tag can be used together with different bookmarks and vice versa. This DB-schema is also used by wordpress.
The queries are quite the same as in the “scuttle” solution.

Intersection (AND)

Query for “bookmark+webservice+semweb”
SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id
HAVING COUNT( b.id )=3

Union (OR)

Query for “bookmark|webservice|semweb”
SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id

Minus (Exclusion)

Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.

SELECT b. *
FROM bookmark b, tagmap bt, tag t
WHERE b.id = bt.bookmark_id
AND bt.tag_id = t.tag_id
AND (t.name IN ('Programming', 'Algorithms'))
AND b.id NOT IN (SELECT b.id FROM bookmark b, tagmap bt, tag t WHERE b.id = bt.bookmark_id

 AND bt.tag_id = t.tag_id AND t.name = 'Python')
GROUP BY b.id
HAVING COUNT( b.id ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.
Credits go to Rhomboid for helping me out with this query.

Conclusion

The advantages of this solution:

  • You can save extra information on each tag (description, tag hierarchy, …)
  • This is the most normalized solution (that is, if you go for 3NF: take this one :-)

Disadvantages:

  • When altering or deleting bookmarks you can end up with tag-orphans.

If you want to have more complicated queries like (bookmarks OR bookmark) AND (webservice or WS) AND NOT (semweb or semanticweb) the queries tend to become very complicated. In these cases I suggest the following query/computation process:

  1. Run a query for each tag appearing in your “tag-query”: SELECT b.id FROM tagmap bt, bookmark b, tag t WHERE bt.tag_id = t.tag_id AND b.id = bt.bookmark_id AND t.name = "semweb"
  2. Put each id-set from the result into an array (that is: in your favourite coding language). You could cache this arrays if you want..
  3. Constrain the arrays with union or intersection or whatever.

In this way, you can also do queries like (del.icio.us|delicious)+(semweb|semantic_web)-search. This type of queries (that is: the brackets) cannot be done by using the denormalized “MySQLicious solution”.
This is the most flexible data structure and I guess it should scale pretty good (that is: if you do some caching).

Update May, 2006. This arcticle got quite some attention. I wasn’t really prepared for that! It seems people keep referring to it and even some new sites that allow tagging give credit to my articles. I think the real credit goes to the contributers of the different schemas: MySQLicious, scuttle, Toxi and to all the contributors of the comments (be sure to read them!)

P.S. Thanks to Toxi for sending me the queries for the three-table-schema, Benjamin Reitzammer for pointing me to a loughing meme article (a good reference for tag queries) and powerlinux for pointing me to scuttle.

Further reading

 

Comment Form

About this blog

QK31欢迎你的到来.

Photostream

search_extends

 

2008年十月
« 九   十一 »
 12345
6789101112
13141516171819
20212223242526
2728293031  

51
Unique
Visitors
Powered By Google Analytics

分类目录

标签云