Darts-clone Q&A

Q: What is the limit number of string in darts clone?

A double-array uses an array and its size must be less than 2^29 (=536M). The array size is greater than the number of keys. So, the maximum number of keys is less than 2^29.

The actual limit depends on keys and values. In general, the array size is proportional to #keys and longer keys require a larger array. Additionally, the number of distinct values affects the array size. If there are few distinct values, the array size will be small.

You can estimate the actual limit by using a part of your keys.

The following are examples (`keys`: the number of keys, `size`: the array size):

Word keys (the average length is 13 bytes).

## Unique values.
$ mkdarts -t ~/corpus/1gm.zero.1m 1gm.zero.1m.darts
keys: 1000000
total: 12740688
Making double-array: 100% |*******************************************|
size: 1861632
total_size: 7446528
## All values are zero.
$ mkdarts ~/corpus/1gm.uniq.1m 1gm.uniq.1m.darts
keys: 1000000
total: 12740688
Making double-array: 100% |*******************************************|
size: 4885248
total_size: 19540992

URL keys (the average length is 53 bytes).

## Unique values.
$ mkdarts ~/corpus/urls.uniq.1m urls.uniq.1m.darts
keys: 1000000
total: 53166751
Making double-array: 100% |*******************************************|
size: 18637312
total_size: 74549248
## All values are zero.
$ mkdarts -t ~/corpus/urls.zero.1m urls.zero.1m.darts
keys: 1000000
total: 53166751
Making double-array: 100% |*******************************************|
size: 11225344
total_size: 44901376

Marisa-trie Q&A

Q: How can I know IDs when I create a keyset? Or should I reread the whole dictionary after build()?

As you mentioned, "reread all the keys"-approach is the answer. IDs are allocated in construction and depend on the constructed tree structure. It is difficult to guess IDs before construction.

Please note that If you use an option MARISA_LABEL_ORDER, IDs will change.

grn_ts: フィルタの式を省略できるようになりました

Groonga ブログに書くほどでもない細かい内容はこちらに書いていくことにしました.

grn_ts は --filter の先頭に '?' を付けることで有効になるわけですが, '?' に続けてフィルタの式を指定する必要がありました.

今回の修正では, --filter '?' だけで grn_ts を有効化できるようにしました.実際にフィルタの適用も省略するようになりましたが,そもそも式が定数の true であれば評価などをスキップするように実装していたため,速度への影響はほとんどありません.

$ groonga /tmp/groonga/db
> select Table --filter '?true' --output_columns '_id' --limit 1
[[0,1448888003.34893,0.0627093315124512],[[[10000000],[["_id","UInt32"]],[1]]]]
> select Table --filter '?' --output_columns '_id' --limit 1
[[0,1448887983.34983,0.0628361701965332],[[[10000000],[["_id","UInt32"]],[1]]]]