文章介紹了關於Twitter的cursor方式進行Web數據分頁用法,有需要了解大數據分頁的朋友可參考一下。
上圖功能的技術實現方法拿MySQL來舉例就是
select * from msgs where thread_id = ? limit page * count, count
不過在看Twitter API的時候,我們卻發現不少接口使用cursor的方法,而不用page, count這樣直觀的形式,如 followers ids 接口
代碼如下
復制代碼
URL:
http://twitter.com/followers/ids.format
Returns an array of numeric IDs for every user following the specified user.
Parameters:
* cursor. Required. Breaks the results into pages. Provide a value of -1 to begin paging. Provide values as returned to in the response body’s next_cursor and previous_cursor attributes to page back and forth in the list.
o Example: http://twitter.com/followers/ids/barackobama.xml?cursor=-1
o Example: http://twitter.com/followers/ids/barackobama.xml?cursor=-1300794057949944903
http://twitter.com/followers/ids.
format
從上面描述可以看到,http://twitter.com/followers/ids.xml 這個調用需要傳cursor參數來進行分頁,而不是傳統的 url?page=n&count=n的形式。這樣做有什麼優點呢?是否讓每個cursor保持一個當時數據集的鏡像?防止由於結果集實時改變而產生查詢結果有重復內容?
在Google Groups這篇Cursor Expiration討論中Twitter的架構師John Kalucki提到
代碼如下
復制代碼
A cursor is an opaque deletion-tolerant index into a Btree keyed by source
userid and modification time. It brings you to a point in time in the
reverse chron sorted list. So, since you can’t change the past, other than
erasing it, it’s effectively stable. (Modifications bubble to the top.) But
you have to deal with additions at the list head and also block shrinkage
due to deletions, so your blocks begin to overlap quite a bit as the data
ages. (If you cache cursors and read much later, you’ll see the first few
rows of cursor[n+1]’s block as duplicates of the last rows of cursor[n]’s
block. The intersection cardinality is equal to the number of deletions in
cursor[n]’s block). Still, there may be value in caching these cursors and
then heuristically rebalancing them when the overlap proportion crosses some
threshold.
在另外一篇new cursor-based pagination not multithread-friendly中John又提到
代碼如下
復制代碼
The page based approach does not scale with large sets. We can no
longer support this kind of API without throwing a painful number of
503s.
Working with row-counts forces the data store to recount rows in an O
(n^2) manner. Cursors avoid this issue by allowing practically
constant time access to the next block. The cost becomes O(n/
block_size) which, yes, is O(n), but a graceful one given n < 10^7 and
a block_size of 5000. The cursor approach provides a more complete and
consistent result set.
Proportionally, very few users require multiple page fetches with a
page size of 5,000.
Also, scraping the social graph repeatedly at high speed is could
often be considered a low-value, borderline abusive use of the social
graph API.
通過這兩段文字我們已經很清楚了,對於大結果集的數據,使用cursor方式的目的主要是為了極大地提高性能。還是拿MySQL為例說明,比如翻頁到100,000條時,不用cursor,對應的SQL為
select * from msgs limit 100000, 100
在一個百萬記錄的表上,第一次執行這條SQL需要5秒以上。
假定我們使用表的主鍵的值作為cursor_id, 使用cursor分頁方式對應的SQL可以優化為
select * from msgs where id > cursor_id limit 100;