Darts-clone Q&A
Q: What is the limit number of string in darts clone?
A double-array uses an array and its size must be less than 2^29 (=536M). The array size is greater than the number of keys. So, the maximum number of keys is less than 2^29.
The actual limit depends on keys and values. In general, the array size is proportional to #keys and longer keys require a larger array. Additionally, the number of distinct values affects the array size. If there are few distinct values, the array size will be small.
You can estimate the actual limit by using a part of your keys.
The following are examples (`keys`: the number of keys, `size`: the array size):
Word keys (the average length is 13 bytes).
## Unique values. $ mkdarts -t ~/corpus/1gm.zero.1m 1gm.zero.1m.darts keys: 1000000 total: 12740688 Making double-array: 100% |*******************************************| size: 1861632 total_size: 7446528
## All values are zero. $ mkdarts ~/corpus/1gm.uniq.1m 1gm.uniq.1m.darts keys: 1000000 total: 12740688 Making double-array: 100% |*******************************************| size: 4885248 total_size: 19540992
URL keys (the average length is 53 bytes).
## Unique values. $ mkdarts ~/corpus/urls.uniq.1m urls.uniq.1m.darts keys: 1000000 total: 53166751 Making double-array: 100% |*******************************************| size: 18637312 total_size: 74549248
## All values are zero. $ mkdarts -t ~/corpus/urls.zero.1m urls.zero.1m.darts keys: 1000000 total: 53166751 Making double-array: 100% |*******************************************| size: 11225344 total_size: 44901376
Marisa-trie Q&A
Q: How can I know IDs when I create a keyset? Or should I reread the whole dictionary after build()?
As you mentioned, "reread all the keys"-approach is the answer. IDs are allocated in construction and depend on the constructed tree structure. It is difficult to guess IDs before construction.
Please note that If you use an option MARISA_LABEL_ORDER, IDs will change.
grn_ts: フィルタの式を省略できるようになりました
Groonga ブログに書くほどでもない細かい内容はこちらに書いていくことにしました.
grn_ts は --filter の先頭に '?' を付けることで有効になるわけですが, '?' に続けてフィルタの式を指定する必要がありました.
今回の修正では, --filter '?' だけで grn_ts を有効化できるようにしました.実際にフィルタの適用も省略するようになりましたが,そもそも式が定数の true であれば評価などをスキップするように実装していたため,速度への影響はほとんどありません.
$ groonga /tmp/groonga/db > select Table --filter '?true' --output_columns '_id' --limit 1 [[0,1448888003.34893,0.0627093315124512],[[[10000000],[["_id","UInt32"]],[1]]]] > select Table --filter '?' --output_columns '_id' --limit 1 [[0,1448887983.34983,0.0628361701965332],[[[10000000],[["_id","UInt32"]],[1]]]]