CLOVERšŸ€

That was when it all began.

mecab-ipadic-neologdć®č¾žę›ø悒态Lucene Kuromoji恫適ē”Ø恗恦ćæ悋

ę³Øꄏļ¼‰
恓恮ć‚Øćƒ³ćƒˆćƒŖ恧ćÆ态mecab-ipadic-neologd悒Lucene Kuromoji恫適ē”Øć™ć‚‹ć«ć‚ćŸć‚Šć€2ć¤ć»ć©å•é”ŒćŒē™ŗē”Ÿć—ćŸć®ć§ć™ćŒć€ä½œč€…ć®@overlastć•ć‚“ć«ćć®ć†ć”ć®ć²ćØ恤悒åƾåæœć„ćŸć ćć¾ć—ćŸć€‚

äæ®ę­£ē‰ˆmecab-ipadic-neologd悒ä½æć£ć¦Lucene Kuromoji恫適ē”Ø恙悋ć‚Øćƒ³ćƒˆćƒŖćÆ态仄äø‹ć‚’č¦‹ć‚‹ć‚ˆć†ć«ć—ć¦ćć ć•ć„ć€‚

äæ®ę­£ć•ć‚ŒćŸmecab-ipadic-neologdć®č¾žę›ø悒态Lucene Kuromoji恫適ē”Ø恗恦ćæ悋
http://d.hatena.ne.jp/Kazuhira/20150316/1426520209

恓恔悉恮ć‚Øćƒ³ćƒˆćƒŖćÆć€å‚™åæ˜éŒ²ēš„ć«ę®‹ć£ć¦ć„ć‚‹ć ć‘ć§ć™ć€‚

仄降ćÆ态恝悌悒čøć¾ćˆćŸäøŠć§čŖ­ć¾ć‚Œć¾ć™ć‚ˆć†ć€‚ćØ悊恂恈恚Lucene Kuromoji恫適ē”Øć—ćŸć„å “åˆćÆ态äøŠčØ˜ć®ć‚Øćƒ³ćƒˆćƒŖć‚’ć”č¦§ćć ć•ć„ć€‚


å…ˆę—„ć€ć”ć‚‡ć£ćØ갗恫ćŖ悋ć‚Øćƒ³ćƒˆćƒŖ恌äø–恮äø­ć«å‡ŗć¦ć„ć¾ć—ćŸć€‚

MeCab ē”Ø恮ꖰčŖžč¾žę›ø mecab-ipadic-neologd ć‚’å…¬é–‹ć—ć¾ć—ćŸ
http://diary.overlasting.net/2015-03-13-1.html

ę›“ę–°ćŒę­¢ć¾ć£ć¦ä¹…ć—ć„IPAč¾žę›ø恫åÆ¾ć—ć¦ć€ć‚·ćƒ¼ćƒ‰ć‚’åŠ ćˆć¦ę–°č¾žę›øć‚’ä½œć£ćŸćć†ć§ć™ć€‚ć‚¹ć‚“ć‚¤ā€¦ć€‚

ꖰ恗恄IPAč¾žę›ø恌ä½æ恈悋ćØ恄恆恓ćØćÆ态Lucene恧遊悓恧恄悋äŗŗ恋悉ćæ悋ćØKuromoji恫適ē”Ø恗恟恏ćŖ悋悂恮恧恙怂

ćØć„ć†ć‚ć‘ć§ć€ć‚„ć£ć¦ćæć¾ć—ćŸļ¼ć ć„ć¶č‹¦åŠ“ć—ć¾ć—ćŸć‘ć©ā€¦ć€‚

各ēØ®ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«

ć“ć®ä½œę„­ć‚’č”Œć†ćŸć‚ć«ć€ć¾ćšå…ƒć®ć‚µć‚¤ćƒˆć‹ć‚‰ćƒŖćƒ³ć‚Æć•ć‚Œć¦ć„ć‚‹ę‰‹é †ć‚’č¦‹ć¦å„ēØ®ć‚½ćƒ•ćƒˆć‚¦ć‚§ć‚¢ć‚’ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć—ć¾ć—ćŸć€‚

mecab-ipadic-NEologd : Neologism dictionary for MeCab
https://github.com/neologd/mecab-ipadic-neologd/blob/master/README.ja.md

åæ…要ćŖ悂恮ćÆ态C++ć‚³ćƒ³ćƒ‘ć‚¤ćƒ©ć€iconv态MeCab态mecab-ipadic态xz恠恝恆恧恙怂

恆恔恮ē’°å¢ƒć ćØ态MeCabä»„å¤–ć ćØC++ć‚³ćƒ³ćƒ‘ć‚¤ćƒ©ć®ćæćŒå…„ć£ć¦ć„ćŖć‹ć£ćŸć®ć§ć€g++ć‚’ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć€‚

$ sudo apt-get install g++

ć“ć“ć‹ć‚‰å…ˆćÆ态MeCabć®ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć§ć™ć€‚

MeCabć‚’ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć™ć‚‹

ć¾ćšćÆ态MeCabć‚’ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć—ć¾ć™ć€‚

MeCab: Yet Another Part-of-Speech and Morphological Analyzer
http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html

ćŖć®ć§ć™ćŒć€ć“ć®ć‚µć‚¤ćƒˆć§ćƒ€ć‚¦ćƒ³ćƒ­ćƒ¼ćƒ‰ć§ćć‚‹mecab-0.996.tar.gzćÆ态tarćŒå£Šć‚Œć¦ć„ć‚‹ć‚ˆć†ćŖć®ć§å±•é–‹ć§ćć¾ć›ć‚“ć§ć—ćŸā€¦ć€‚

ä»•ę–¹ćŒćŖ恄恮恧态恓恓ćÆ仄äø‹ć‚’å‚č€ƒć«ć€å°‘ć—å‰ć®MeCabć‚’ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć€‚

Mecabć®ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ćƒ”ćƒ¢
http://qiita.com/ShingoOikawa/items/175be8a472ec8ed8a707

今回ćÆ态MeCabć‚’ć‚·ć‚¹ćƒ†ćƒ ć‚°ćƒ­ćƒ¼ćƒćƒ«ć«ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć—ćŸć„ć‚ć‘ć§ćÆćŖć„ć®ć§ć€ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«å…ˆć‚’ęŒ‡å®šć—ć¾ć™ć€‚ć“ć“ć§ćÆ态怌$MECAB_HOME怍ćØčØ˜č¼‰ć—ć¾ć™ć€‚

$ wget http://mecab.googlecode.com/files/mecab-0.994.tar.gz
$ tar -zxvf mecab-0.994.tar.gz
$ cd mecab-0.994
$ ./configure --prefix=$MECAB_HOME
$ make
$ sudo make install

ć§ć€ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć•ć‚ŒćŸMeCabć«ćƒ‘ć‚¹ć‚’é€šć—ć¾ć™ć€‚

$ export PATH=$MECAB_HOME/bin:$PATH

ē¢ŗčŖć€‚

$ mecab --version
mecab of 0.994

ē¶šć„恦态mecab-ipadicć®ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć€‚

$ wget http://mecab.googlecode.com/files/mecab-ipadic-2.7.0-20070801.tar.gz
$ tar -zxvf mecab-ipadic-2.7.0-20070801.tar.gz
$ cd mecab-ipadic-2.7.0-20070801
$ ./configure --with-charset=utf-8
$ make
$ sudo make install

ć“ć‚Œć§ć€å…ˆć»ć©ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć—ćŸMeCabć®ćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖ恫态IPAč¾žę›øć‚’å…ƒć«ć—ćŸč¾žę›øćŒć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć•ć‚Œć¾ć™ć€‚

$ ls -l $MECAB_HOME/lib/mecab/dic
合č؈ 4
drwxr-xr-x 2 root root 4096  3꜈ 15 01:42 ipadic

ć“ć“ć¾ć§ć§ć€MeCabć®ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ćÆēµ‚äŗ†ć§ć™ć€‚

mecab-ipadic-neologdć®ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«

ꬔćÆ态mecab-ipadic-neologdć‚’ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć—ć¾ć™ć€‚

恓恔悉ćÆ态仄äø‹ć«č¼‰ć£ć¦ć„悋ꉋ順恫ę²æć£ć¦é€²ć‚ć¦ć„ć‘ć°OK恧恙怂

mecab-ipadic-NEologd : Neologism dictionary for MeCab
https://github.com/neologd/mecab-ipadic-neologd/blob/master/README.ja.md

$ git clone https://github.com/neologd/mecab-ipadic-neologd.git
$ cd mecab-ipadic-neologd
$ git pull
$ ./bin/install-mecab-ipadic-neologd

ć‚¹ć‚ÆćƒŖćƒ—ćƒˆć‚’å®Ÿč”Œć™ć‚‹ćØ态途äø­ć§ē¾åœØć®ćƒ‡ćƒ•ć‚©ćƒ«ćƒˆć®ć‚·ć‚¹ćƒ†ćƒ č¾žę›øć‹ć‚‰ć€ć©ć®ć‚ˆć†ć«å¤‰ć‚ć‚‹ć®ć‹ćŒč”Øē¤ŗć•ć‚Œć¾ć™ć€‚

default system dictonary     |	mecab-ipadic-neologd
ćƒŖć‚¢ćƒ« ć‚¹ć‚³ćƒ¼ćƒ— 	     |	ćƒŖć‚¢ćƒ«ć‚¹ć‚³ćƒ¼ćƒ— 
äø–ē•Œäø€ å—ć‘ 恟恄 ęŽˆę„­ 	     |	äø–ē•Œäø€å—ć‘ćŸć„ęŽˆę„­ 
悁 ć”ć‚ƒ 悤悱 		     |	ć‚ć”ć‚ƒć‚¤ć‚± 
å­¦ę ” 恮 ć‚«ć‚¤ćƒ€ćƒ³ 	     |	å­¦ę ”ć®ć‚«ć‚¤ćƒ€ćƒ³ 
åæ—ę‘ 動ē‰© 園 		     |	åæ—ę‘ 動ē‰©åœ’ 
åæ—ę‘ 恩恆 恶恤 園 	     |	åæ—ę‘ć©ć†ć¶ć¤åœ’ 
ć‚¢ćƒ‰ č”— 		     |	ć‚¢ćƒ‰č”— 
惉ć‚Æć‚æćƒ¼ 悤ć‚Øćƒ­ćƒ¼ 	     |	惉ć‚Æć‚æćƒ¼ć‚¤ć‚Øćƒ­ćƒ¼ 
äø­ę‘ ę˜Žę—„ē¾Ž 子 		     |	äø­ę‘ę˜Žę—„ē¾Žå­ 
同ē“šē”Ÿ ć‚¢ćƒ‹ćƒ” 化 	     |	同ē“šē”Ÿ ć‚¢ćƒ‹ćƒ”åŒ– 
恵恗恎 ē™ŗ見 		     |	恵恗恎ē™ŗ見 
悁 ć”ć‚ƒ ć‚®ćƒ³ćƒˆćƒ³ 	     |	ć‚ć”ć‚ƒć‚®ćƒ³ćƒˆćƒ³ 
å­¦ę ” 恮 階ꮵ 		     |	å­¦ę ”ć®éšŽę®µ

ē¶šč”Œć—ć¦ć‚ˆć‘ć‚Œć°ć€ć€Œyes怍恧怂

[install-mecab-ipadic-neologd] : Do you want to install mecab-ipadic-neologd? Type yes or no.
yes

ē¢ŗčŖć€‚

$ mecab -d $MECAB_HOME/lib/mecab/dic/mecab-ipadic-neologd ćć‚ƒć‚Šćƒ¼ć±ćæ悅恱ćæ悅
ćć‚ƒć‚Šćƒ¼ć±ćæ悅恱ćæ悅	åč©ž,å›ŗęœ‰åč©ž,äø€čˆ¬,*,*,*,ćć‚ƒć‚Šćƒ¼ć±ćæ悅恱ćæ悅,ć‚­ćƒ£ćƒŖćƒ¼ćƒ‘ćƒŸćƒ„ćƒ‘ćƒŸćƒ„,ć‚­ćƒ£ćƒŖćƒ¼ćƒ‘ćƒŸćƒ„ćƒ‘ćƒŸćƒ„
EOS

ē„”äŗ‹ć€ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć•ć‚ŒćŸć‚ˆć†ć§ć™ć€‚

ćŖćŠć€ć“ć“ć§ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć•ć‚ŒćŸMeCabćć®ć‚‚ć®ćÆć€ä»„å¾ŒćÆä½æć„ć¾ć›ć‚“ć€‚ć‚€ć—ć‚ć€mecab-ipadic-neologdć®ćƒ“ćƒ«ćƒ‰ę™‚ć«ć‚«ćƒ¬ćƒ³ćƒˆćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖ恫ē”Ÿęˆć•ć‚Œć‚‹ć€buildćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖ恮äø­čŗ«ć®ę–¹ćŒåæ…要恧恙怂

$ ls -l build
合č؈ 11928
drwxrwxr-x 2 xxxxx xxxxx     4096  3꜈ 15 01:47 mecab-ipadic-2.7.0-20070801-neologd-20150313
-rw-rw-r-- 1 xxxxx xxxxx 12208105  3꜈ 15 01:47 mecab-ipadic-2.7.0-20070801.tar.gz

Luceneć®ćƒ“ćƒ«ćƒ‰

今åŗ¦ćÆć€č©±é”Œć‚’Lucene恫怂

ć¾ćšćÆLuceneć®ć‚½ćƒ¼ć‚¹ć‚³ćƒ¼ćƒ‰ć‚’svn exportć—ć¦ć€ćƒ“ćƒ«ćƒ‰ć‚’č”Œć„ć¾ć™ć€‚

$ svn export http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_5_0_0
$ cd lucene_solr_5_0_0/lucene
$ ant ivy-bootstrap
$ ant compile

怌ant ivy-bootstrap怍ćÆ态恙恧恫Ant恫IvyćŒå°Žå…„ęøˆćæ恧恂悌恰äøč¦ć§ć™ć€‚

ćŖ恊态恓恓恧Lucene悒ć‚Øć‚Æć‚¹ćƒćƒ¼ćƒˆć—ćŸćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖļ¼ˆ/path/to/lucene_solr_5_0_0ļ¼‰ć‚’态$LUCENE_SRC_HOMEćØčØ˜č¼‰ć—ć¾ć™ć€‚

ē¶šć„恦态Kuromoji恮ä½æć†č¾žę›øć®ćƒ“ćƒ«ćƒ‰ć€‚

ćØć‚Šć‚ćˆćšć€ä½•ć‚‚č€ƒćˆćšć«ćƒ‡ćƒ•ć‚©ćƒ«ćƒˆć®č¾žę›øć§ćƒ“ćƒ«ćƒ‰ć—ć¦ćæć¾ć™ć€‚

$ cd analysis/kuromoji
$ ant regenerate

恓恮Ꙃ态IPAč¾žę›øć‚’ćƒ€ć‚¦ćƒ³ćƒ­ćƒ¼ćƒ‰ć—ć¦ćć¾ć™ć€‚

展開先ćÆ态恓恔悉恫ćŖć‚Šć¾ć™ć€‚

$ ls -l $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji
合č؈ 53332
drwxrwxr-x 4 xxxxx xxxxx     4096  3꜈ 15 01:56 classes
drwxrwxr-x 2 xxxxx xxxxx     4096  3꜈ 15 01:56 mecab-ipadic-2.7.0-20070801
-rw-rw-r-- 1 xxxxx xxxxx 54599680  3꜈ 15 01:56 mecab-ipadic-2.7.0-20070801.tar
lrwxrwxrwx 1 xxxxx xxxxx       84  3꜈ 15 01:56 mecab-ipadic-2.7.0-20070801.tar.gz -> /xxxxx/.ivy2/cache/mecab/mecab-ipadic/.tar.gzs/ipadic-2.7.0-20070801..tar.gz

ć“ć®ä»˜čæ‘恫态mecab-ipadic-neologdć®ćƒ“ćƒ«ćƒ‰ę™‚ć«ä½œęˆć—ćŸč¾žę›øć®å…ƒćƒć‚æ悒ē½®ć„恦态Kuromoji恧ä½æć†č¾žę›øć‚’ćƒ“ćƒ«ćƒ‰ć—ć¦ćæć¾ć—ć‚‡ć†ć€‚

mecab-ipadic-neologdć®č¾žę›ø悒ä½æć£ć¦ć€Kuromojić®č¾žę›øćØKuromojić‚’ćƒ“ćƒ«ćƒ‰ć™ć‚‹

Lucene Kuromojić®č¾žę›øä½œęˆćƒ„ćƒ¼ćƒ«ćÆć€ęŒ‡å®šć•ć‚ŒćŸćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖ配äø‹ć«ć‚ć‚‹CSVćƒ•ć‚”ć‚¤ćƒ«ļ¼ˆę‹”å¼µå­ćŒć€Œ.csv怍ļ¼‰ć‚’処ē†åÆ¾č±”ćØ恙悋悈恆恧恙怂

ć“ć“ć§ć€å…ˆć»ć©ä½œęˆć—ćŸmecab-ipadic-neologd恮äø­é–“ē”Ÿęˆē‰©ć‚’态Kuromojić®ćƒ“ćƒ«ćƒ‰ę™‚ć®ćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖć«ć‚³ćƒ”ćƒ¼ć—ć¾ć™ć€‚

$ cp -Rp [mecab-ipadic-neologdć‚’ćƒ“ćƒ«ćƒ‰ć—ćŸćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖ]/build/mecab-ipadic-2.7.0-20070801-neologd-20150313 $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji

ćć—ć¦ć€Kuromoji恮build.xml悒äæ®ę­£ć—ć¾ć™ć€‚

ćƒ‡ćƒ•ć‚©ćƒ«ćƒˆć®IPAč¾žę›ø恧ćÆćŖćć€ć‚³ćƒ”ćƒ¼ć—ćŸmecab-ipadic-neologd恮äø­é–“ē”Ÿęˆē‰©ć‚’ä½æć†ć‚ˆć†ć«ć€build.xml恮ipadic.version悒äæ®ę­£ć—ć¾ć™ļ¼ˆć“ć“ćŒć€ćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖåć‚‚ęŒ‡ć™ć‚ˆć†ć«ćŖć£ć¦ć„ć‚‹ć®ć§ļ¼‰ć€‚

  <!-- <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" /> -->
  <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801-neologd-20150313" />

今回ä½æć†č¾žę›øļ¼ˆćØ恄恆恋CSVćƒ•ć‚”ć‚¤ćƒ«ļ¼‰ćÆUTF-8恧ę›øć‹ć‚Œć¦ć„ć‚‹ć®ć§ć€ćƒ‡ćƒ•ć‚©ćƒ«ćƒˆć®EUC-JPć‹ć‚‰å¤‰ę›“ć—ć¾ć™ć€‚

  <!-- <property name="dict.encoding" value="euc-jp"/> -->
  <property name="dict.encoding" value="utf-8"/>

build-dictć‚æć‚¹ć‚Æ恧ćÆć€č¾žę›øć®ćƒ€ć‚¦ćƒ³ćƒ­ćƒ¼ćƒ‰ćÆäøč¦ć«ćŖ悋恮恧态depends恋悉download-dictć‚æć‚¹ć‚Æć‚’åˆ‡ć‚Šé›¢ć—ć¾ć™ć€‚

  <!-- <target name="build-dict" depends="compile-tools, download-dict"> -->
  <target name="build-dict" depends="compile-tools">

č¾žę›øä½œęˆćƒ„ćƒ¼ćƒ«ćÆć€ä»Šå›žć®CSVćƒ•ć‚”ć‚¤ćƒ«ć‚’čŖ­ć¾ć›ć‚‹ćØćƒ‡ćƒ•ć‚©ćƒ«ćƒˆć®ćƒ’ćƒ¼ćƒ—ć‚µć‚¤ć‚ŗļ¼ˆ1Gļ¼‰ć§ćÆč¶³ć‚ŠćŖ恏ćŖć‚‹ć®ć§ć€ę‹”å¼µć—ć¾ć™ć€‚2Gć«ć—ć¾ć—ćŸćŒć€ä»Šå›žćÆć“ć‚Œć§ååˆ†ć«ä½™č£•ćŒć‚ć‚Šć¾ć—ćŸć€‚

      <!-- TODO: optimize the dictionary construction a bit so that you don't need 1G -->
      <!-- <java fork="true" failonerror="true" maxmemory="1g" classname="org.apache.lucene.analysis.ja.util.DictionaryBuilder"> -->
      <java fork="true" failonerror="true" maxmemory="2g" classname="org.apache.lucene.analysis.ja.util.DictionaryBuilder">

恧ćÆć€č¾žę›øć‚’ćƒ“ćƒ«ćƒ‰ć—ć¦ćæć¾ć—ć‚‡ć†ļ¼

$ ant regenerate

ć—ć°ć‚‰ćå¾…ć£ć¦ć„ć‚‹ćØć€ć‚³ć‚±ć¾ć™ā€¦ć€‚

     [java] building tokeninfo dict...
     [java]   parse...
     [java]   sort...
     [java]   encode...
     [java] Exception in thread "main" java.lang.AssertionError
     [java] 	at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:129)
     [java] 	at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:143)
     [java] 	at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:78)
     [java] 	at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
     [java] 	at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)

ć‚³ć‚±ćŸå “ę‰€ć‚’č¦‹ć¦ćæć¾ć™ć€‚

恓恓恮assertć«å¼•ć£ć‹ć‹ć£ć¦ć„ć‚‹ć‚ˆć†ć§ć™ć€‚

      assert baseForm.length() < 16;

https://github.com/apache/lucene-solr/blob/lucene_solr_5_0_0/lucene/analysis/kuromoji/src/tools/java/org/apache/lucene/analysis/ja/util/BinaryDictionaryWriter.java#L129

BaseFormćÆ态15ę–‡å­—ä»„å†…ćØ恙悋åæ…č¦ćŒć‚ć‚‹ć‚ˆć†ć§ć™ć€‚ć“ć‚ŒćÆ态Kuromojić®ä»•ę§˜ć§ć—ć‚‡ć†ć‹ļ¼Ÿ

恓恓恧态MeCabć®č¾žę›øć‚Øćƒ³ćƒˆćƒŖć®ćƒ•ć‚©ćƒ¼ćƒžćƒƒćƒˆć‚’č¦‹ć¦ćæć¾ć™ć€‚

単čŖžć®čæ½åŠ ę–¹ę³•
http://mecab.googlecode.com/svn/trunk/mecab/doc/dic.html

恓悓ćŖćƒ•ć‚©ćƒ¼ćƒžćƒƒćƒˆć§ć™ć€‚

č”Ø層形,å·¦ę–‡č„ˆID,å³ę–‡č„ˆID,ć‚³ć‚¹ćƒˆ,å“č©ž,å“č©žē“°åˆ†é”ž1,å“č©žē“°åˆ†é”ž2,å“č©žē“°åˆ†é”ž3,ę“»ē”Øå½¢,ę“»ē”Ø型,原形,čŖ­ćæ,ē™ŗ音

例ćÆ态恓悓ćŖꄟ恘怂

å·„č—¤,1223,1223,6058,åč©ž,å›ŗęœ‰åč©ž,äŗŗ名,名,*,*,ćć©ć†,ć‚Æćƒ‰ć‚¦,ć‚Æćƒ‰ć‚¦

BaseFormćÆ态10ē•Ŗē›®ć®č¦ē“ ćŖć®ć§ć€ć€ŒåŽŸå½¢ć€ć§ć™ć­ć€‚åŽŸå½¢ćŒć€15ę–‡å­—ć‚’č¶…ćˆć¦ćÆćŖ悉ćŖ恄态ćØ怂

ć¾ć‚ć€MeCabćŒć“ć®č¾žę›øć‚’å–ć‚Šč¾¼ć‚ć¦ć„ć‚‹ć“ćØ悒考恈悋ćØ态Kuromojić®åˆ¶é™ćŖ恮恧恗悇恆怂

ć“ć®åˆ¶é™ć®ē†ē”±ć‚„č©³ē“°ćÆć”ć‚ƒć‚“ćØč¦‹ć‚Œć¦ć„ć¾ć›ć‚“ćŒć€ćØć‚Šć‚ćˆćšć“ć®åˆ¶é™ć‚’å¤–ć—ć¦č©¦ć—ć¦ćæć¾ć—ć‚‡ć†ć€‚

      // assert baseForm.length() < 16;

å®Ÿč”Œć€‚

$ ant regenerate

ć¾ćŸć‚³ć‚±ć¾ć—ćŸā€¦ć€‚

     [java] building tokeninfo dict...
     [java]   parse...
     [java]   sort...
     [java]   encode...
     [java] Exception in thread "main" java.lang.AssertionError
     [java] 	at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:122)
     [java] 	at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:143)
     [java] 	at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:78)
     [java] 	at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
     [java] 	at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)

ć¾ćŸć‚³ć‚±ćŸē®‡ę‰€ć‚’ē¢ŗčŖć—恦ćæć¾ć™ć€‚

    String existing = posDict.get(leftId);
    assert existing == null || existing.equals(fullPOSData);

https://github.com/apache/lucene-solr/blob/lucene_solr_5_0_0/lucene/analysis/kuromoji/src/tools/java/org/apache/lucene/analysis/ja/util/BinaryDictionaryWriter.java#L122

ć“ć“ć€ć‚½ćƒ¼ć‚¹ć‚’čŖ­ć‚€ćØć€č¾žę›øć®å…ƒćƒć‚æ恮CSV恮仄äø‹ć‚’ę–‡å­—åˆ—é€£ēµć—ćŸć‚‚ć®ćØ

å“č©ž,å“č©žē“°åˆ†é”ž1,å“č©žē“°åˆ†é”ž2,å“č©žē“°åˆ†é”ž3,ę“»ē”Øå½¢,ę“»ē”Ø型

仄äø‹ć®ćƒšć‚¢ćŒē•°ćŖ悋ēµ„ćæåˆć‚ć›ćŒć‚ć‚‹ćØ态ē™ŗē”Ÿć™ć‚‹ć‚ˆć†ć§ć™ć€‚

å·¦ę–‡č„ˆID,å³ę–‡č„ˆID

ä¾‹ćˆć°ć€å·¦ę–‡č„ˆID恌怌1288ć€ć§ć‚ć‚‹ę™‚ć«ć€ć™ć§ć«ä»„äø‹ć®ēµ„ćæåˆć‚ć›ćŒå‡ŗē¾ć—恦恄悋恮恫

åč©ž-å›ŗęœ‰åč©ž-äø€čˆ¬

åŒć˜å·¦ę–‡č„ˆID恫

åč©ž-å›ŗęœ‰åč©ž-äŗŗ名

恌ē™»å “恗恟悊恙悋ćØ态ē™ŗē”Ÿć—ć¾ć™ć€‚

ć•ć™ćŒć«ć€ć“ć‚ŒćÆč‰Æ恏ćŖć„ę°—ćŒć—ć¾ć™ā€¦ć€‚

ćØć‚Šć‚ćˆćšå¤‰ę›“ć—ćŸč¾žę›øä½œęˆćƒ„ćƒ¼ćƒ«ć®ć‚½ćƒ¼ć‚¹ć‚³ćƒ¼ćƒ‰ćÆ态15ę–‡å­—ć¾ć§ć®åˆ¶é™č§£é™¤ć‚‚å«ć‚ć¦ć€å…ƒć«ęˆ»ć—ć¾ć—ćŸć€‚

mecab-ipadic-neologdć®ć‚·ćƒ¼ćƒ‰ć‚’č£œę­£ć™ć‚‹

ćØćŖ悋ćØ态mecab-ipadic-neologdå“ć®ć‚·ćƒ¼ćƒ‰ć®CSVćƒ•ć‚”ć‚¤ćƒ«ć‚’äæ®ę­£ć—ćŸę–¹ćŒć‚ˆć•ćć†ć§ć™ć­ć€‚ISSUEę›ø恓恆恋ćØć‚‚ę€ć„ć¾ć—ćŸćŒć€Kuromojić®éƒ½åˆćŖćØć“ć‚ć‚‚ć‚ć‚‹ę°—ćŒć™ć‚‹ć®ć§ā€¦ć€‚

ć¾ćšć€é€šåøø恮IPAč¾žę›øć®å†…å®¹ć®CSVć‹ć‚‰ć€å·¦ę–‡č„ˆID,å³ę–‡č„ˆID,å“č©ž,å“č©žē“°åˆ†é”ž1,å“č©žē“°åˆ†é”ž2,å“č©žē“°åˆ†é”ž3,ę“»ē”Øå½¢,ę“»ē”Øåž‹ć‚’å–ć‚Šå‡ŗć—ć¾ć™ć€‚

$ find $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/mecab-ipadic-2.7.0-20070801-neologd-20150313/*.csv | \
> grep -v 'mecab-user-dict-seed' | \
> xargs cat | \
> perl -wanl -F, -e 'print "$F[1],$F[2],$F[4],$F[5],$F[6],$F[7],$F[8]"' | \
> sort -n | \
> uniq > ipadic-id-with-part-of-speech.csv

mecab-ipadic-neologdć®ć‚·ćƒ¼ćƒ‰ćÆęŠœć„ć¦ć„ć¾ć™ć€‚

ć“ć®ć‚ˆć†ćŖćƒ•ć‚”ć‚¤ćƒ«ćŒć§ćć‚ćŒć‚Šć¾ć™ć€‚

$ head ipadic-id-with-part-of-speech.csv 
1,1,ćć®ä»–,間ꊕ,*,*,*
2,2,ćƒ•ć‚£ćƒ©ćƒ¼,*,*,*,*
3,3,ę„Ÿå‹•č©ž,*,*,*,*
4,4,čؘ号,ć‚¢ćƒ«ćƒ•ć‚”ćƒ™ćƒƒćƒˆ,*,*,*
5,5,čؘ号,äø€čˆ¬,*,*,*
6,6,čؘ号,ę‹¬å¼§é–‹,*,*,*
7,7,čؘ号,ę‹¬å¼§é–‰,*,*,*
8,8,čؘ号,叄ē‚¹,*,*,*
9,9,čؘ号,ē©ŗē™½,*,*,*
10,10,čؘ号,čŖ­ē‚¹,*,*,*

恓悌悒åŸŗęœ¬ć®ęƒ…å ±ćØ恗恦态mecab-ipadic-neologdć®ć‚·ćƒ¼ćƒ‰ć®CSVć®å·¦ę–‡č„ˆIDćØå³ę–‡č„ˆIDć‚’č£œę­£ć™ć‚‹ć‚¹ć‚ÆćƒŖ惗惈悒ę›øćć¾ć™ć€‚ćŖćœć‹ć‚¹ć‚ÆćƒŖ惗惈ćÆ态Groovy怂
ā€»ćć†ć„ćˆć°ć€IDćØå“č©žć®ć©ć£ć”ć«åÆ„ć›ć‚‹ć¹ćć‹ćÆč€ƒćˆć¦ćŖć‹ć£ćŸā€¦
transform.groovy

transform.groovy 
def partOfSpeechCsv = args[0]
def inputCsv = args[1]
def outputCsv = args[2]

def maxContext = 1

def partOfSpeechMap = [:]
new File(partOfSpeechCsv).eachLine { line ->
  def tokens = line.split(/,/)
  def contexts = [tokens[0], tokens[1]]
  def partOfSpeech = tokens.drop(2).join('-')
  partOfSpeechMap[partOfSpeech] = contexts

  if ((tokens[0] as int) > maxContext) {
    maxContext = tokens[0] as int
  }
}

new File(outputCsv).withWriter('UTF-8') { writer ->
  new File(inputCsv).eachLine('UTF-8') { line ->
    def tokens = line.split(/,/)

    if (tokens[10].length() >= 16) {
      println("[WARN] Discard, BaseForm length greather than 16. => [${tokens[10]}]")
      return
    }
   
    def contexts = [tokens[1], tokens[2]]
    def partOfSpeech = "${tokens[4]}-${tokens[5]}-${tokens[6]}-${tokens[7]}-${tokens[8]}"

    def leftContext
    def rightContext

    def ipadicContext = partOfSpeechMap[partOfSpeech]
    if (ipadicContext == null) {
      maxContext++
      leftContext = maxContext
      rightContext = maxContext
      partOfSpeechMap[partOfSpeech] = [leftContext as String, rightContext as String]
    } else if (ipadicContext != contexts) {
      leftContext = ipadicContext[0] as int
      rightContext = ipadicContext[1] as int
    } else {
      leftContext = contexts[0] as int
      rightContext = contexts[1] as int
    }

    writer.write(tokens[0])
    writer.write(',')
    writer.write(leftContext as String)
    writer.write(',')
    writer.write(rightContext as String)
    writer.write(',')

    writer.write(tokens.drop(3).join(','))
    writer.newLine()
  }
}

å…ˆć»ć©ä½œęˆć—ćŸć€ipadic-id-with-part-of-speech.csvćØć„ć†ćƒ•ć‚”ć‚¤ćƒ«ć®äø­ć«ć€åŒć˜å“č©ž,å“č©žē“°åˆ†é”ž1,å“č©žē“°åˆ†é”ž2,å“č©žē“°åˆ†é”ž3,ę“»ē”Øå½¢,ę“»ē”Øåž‹ć‚’ęŒć£ć¦ć„ć¦ć€ć‹ć¤å·¦ę–‡č„ˆID,å³ę–‡č„ˆIDćŒćšć‚Œć¦ć„ćŸå “åˆć«ćÆIPAč¾žę›øćØåˆć‚ć›ć‚‹ć‚ˆć†ć«ć—ć¾ć™ć€‚ęœŖē™»éŒ²ć®å “合ćÆć€ęœ€å¤§ć®ę–‡č„ˆIDć‚’ć²ćØć¤ćšć¤ć‚¤ćƒ³ć‚ÆćƒŖćƒ”ćƒ³ćƒˆć—ćŸå€¤ć‚’ä»˜äøŽć—ć¾ć™ļ¼ˆé©å½“ļ¼‰ć€‚

ć¾ćŸć€åŽŸå½¢ćŒ15ę–‡å­—ć‚’č¶…ćˆć¦ć„ćŸå “åˆćÆ态åÆ¾č±”å¤–ćØ恗恦ē “ę£„ć—ć¾ć™ć€‚åŽŸå½¢ćŒ15ę–‡å­—ć‚’č¶…ćˆć¦ć„ćŸå “åˆćÆ态ćØć‚Šć‚ćˆćšč­¦å‘Šć™ć‚‹ć‚ˆć†ć«ć—ć¦ćæć¾ć—ćŸćŒā€¦ć€‚

恂ćØ态CSVć‚’åˆ†č§£ć™ć‚‹éš›ć«ę°—ę„½ć«split悒ä½æć£ć¦ć„ć¾ć™ćŒć€ęœ¬å½“ćŖ悉Kuromoji恮äø­ć«ć‚ć‚‹CSVUtil#parse悒ä½æē”Øć™ć‚‹ć®ćŒē¢ŗå®Ÿć ć£ćŸć‹ć‚‚ć—ć‚Œć¾ć›ć‚“ć€‚

恧ćÆć€ä½œęˆć€‚

$ groovy transform.groovy \
> ipadic-id-with-part-of-speech.csv \
> $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/mecab-ipadic-2.7.0-20070801-neologd-20150313/mecab-user-dict-seed.20150313.csv \
> $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/mecab-ipadic-2.7.0-20070801-neologd-20150313/mecab-user-dict-seed.20150313_revised.csv

ćŖćŠć€åŽŸå½¢ćŒ15ę–‡å­—ć‚’č¶…ćˆć‚‹ć‚‚ć®ćÆ态66707å€‹ć‚ć‚Šć¾ć—ćŸā€¦ć€‚

恓悌恧态$LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/mecab-ipadic-2.7.0-20070801-neologd-20150313ćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖ配äø‹ć«ć€mecab-user-dict-seed.20150313_revised.csvćØć„ć†åå‰ć®ćƒ•ć‚”ć‚¤ćƒ«ćŒć§ćć¾ć™ć€‚

å…ƒć®ć‚·ćƒ¼ćƒ‰ć®CSVćÆ态恄悉ćŖ恏ćŖć£ćŸć®ć§å‰Šé™¤ć—ć¾ć™ć€‚

$ rm $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/mecab-ipadic-2.7.0-20070801-neologd-20150313/mecab-user-dict-seed.20150313.csv

今ęø”ć“ćć€ę°—ć‚’å–ć‚Šē›“ć—ć¦å®Ÿč”Œļ¼

$ ant regenerate

恆恔恮PC恧3åˆ†ć»ć©ć‹ć‹ć‚Šć¾ć—ćŸćŒć€ä»Šåŗ¦ćÆć†ć¾ćć„ć£ćŸć‚ˆć†ć§ć™ć€‚

regenerate:

BUILD SUCCESSFUL
Total time: 3 minutes 17 seconds

ęœ€å¾Œć«ć€Kuromojić‚’ćƒ“ćƒ«ćƒ‰ć—ć¾ć™ć€‚

$ ant jar-core

ä»Šå›žć®č¾žę›ø悒ä½æć£ćŸć€Lucene Kuromoji恮JARćƒ•ć‚”ć‚¤ćƒ«ćŒć§ćć‚ćŒć‚Šć¾ć™ć€‚

jar-core:
      [jar] Building jar: $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/lucene-analyzers-kuromoji-5.0.0-SNAPSHOT.jar

BUILD SUCCESSFUL
Total time: 5 seconds

恓恔悉悒ä½æć£ć¦ć€å‹•ä½œē¢ŗčŖć—恦ćæć¾ć—ć‚‡ć†ć€‚

Kuromoji悒ä½æć£ćŸćƒ—ćƒ­ć‚°ćƒ©ćƒ ć‚’ę›ø恏

恝悌恧ćÆć€ć¾ćšćÆꙮ通恫Kuromoji悒ä½æć£ćŸćƒ—ćƒ­ć‚°ćƒ©ćƒ ć‚’ę›ø恄恦ćæć¾ć—ć‚‡ć†ć€‚

ćƒ“ćƒ«ćƒ‰ćƒ„ćƒ¼ćƒ«ćŒsbtćŖ恮ćÆć€ć”ę„›å¬Œć€‚
build.sbt

name := "lucene-kuromoji-mecab-neologd"

version := "0.0.1-SNAPSHOT"

scalaVersion := "2.11.5"

organization := "org.littlewings"

updateOptions := updateOptions.value.withCachedResolution(true)

scalacOptions ++= Seq("-Xlint", "-unchecked", "-deprecation", "-feature")

libraryDependencies ++= Seq(
  "org.apache.lucene" % "lucene-core" % "5.0.0",
  "org.apache.lucene" % "lucene-analyzers-common" % "5.0.0",
  "org.apache.lucene" % "lucene-analyzers-kuromoji" % "5.0.0"
)

LucenećÆ态5.0.0恧恙怂

ć‚½ćƒ¼ć‚¹ć‚³ćƒ¼ćƒ‰ćŒScalaćŖć®ć‚‚ć€ć”ę„›å¬Œć€‚
src/main/scala/org/littlewings/lucene/kuromoji/KuromojiWithNeologd.scala

package org.littlewings.lucene.kuromoji

import org.apache.lucene.analysis.ja.JapaneseAnalyzer
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute

object KuromojiWithNeologd {
  def main(args: Array[String]): Unit = {
    val texts = Array(
      "恙悂悂悂悂悂悂悂悂恮恆恔",
      "ćć‚ƒć‚Šćƒ¼ć±ćæ悅恱ćæ悅",
      "ę—„ęœ¬ēµŒęøˆę–°čžć§ćƒ¢ćƒć‚²ćƒ¼ć®čؘäŗ‹ć‚’čŖ­ć‚“恠",
      "ćć‚Šćƒ¼ć‚€ć—ć”ć‚…ćƒ¼",
      "č‰¦éšŠć“ć‚Œćć—ć‚‡ć‚“"
    )

    val analyzer = new JapaneseAnalyzer

    for (text <- texts) {
      val tokenStream = analyzer.tokenStream("", text)

      val charTermAttr = tokenStream.addAttribute(classOf[CharTermAttribute])

      tokenStream.reset()

      val tokens =
        Iterator
          .continually(tokenStream.incrementToken())
          .takeWhile(identity)
          .map(_ => charTermAttr.toString)

      println(s"InputText = $text")
      println(s"  Tokenized = ${tokens.mkString("[", ", ", "]")}")

      tokenStream.close()
    }
  }
}

å½¢ę…‹ē“ č§£ęžć™ć‚‹ę–‡ē« ć‚„単čŖžćÆć€é©å½“ć«éøć‚“ć§ć„ć¾ć™ć€‚KuromojićÆć€ćƒ‡ćƒ•ć‚©ćƒ«ćƒˆć®SEARCHćƒ¢ćƒ¼ćƒ‰ć§ć™ć€‚

ć“ć®ćƒ—ćƒ­ć‚°ćƒ©ćƒ ć‚’å®Ÿč”Œć—ć¦ćæć¾ć™ć€‚

> run
[info] Running org.littlewings.lucene.kuromoji.KuromojiWithNeologd 
InputText = 恙悂悂悂悂悂悂悂悂恮恆恔
  Tokenized = [恙悂悂, 悂悂, 悂悂]
InputText = ćć‚ƒć‚Šćƒ¼ć±ćæ悅恱ćæ悅
  Tokenized = [恏, ćƒ¼, 恱ćæ悅恱ćæ悅]
InputText = ę—„ęœ¬ēµŒęøˆę–°čžć§ćƒ¢ćƒć‚²ćƒ¼ć®čؘäŗ‹ć‚’čŖ­ć‚“恠
  Tokenized = [ę—„ęœ¬, ę—„ęœ¬ēµŒęøˆę–°čž, ēµŒęøˆ, ꖰ聞, ćƒ¢ćƒć‚², čؘäŗ‹, čŖ­ć‚€]
InputText = ćć‚Šćƒ¼ć‚€ć—ć”ć‚…ćƒ¼
  Tokenized = [恏悊, ćƒ¼, 悀恗, 恔悋, 悅, ćƒ¼]
InputText = č‰¦éšŠć“ć‚Œćć—ć‚‡ć‚“
  Tokenized = [艦隊, 恏恄]
[success] Total time: 1 s, completed 2015/03/15 2:57:30

当ē„¶ć€ę–°ć—ć„å˜čŖžćŒć‚ć‹ć‚‰ćŖ恄恮恧态悂恮恙恔恄ēµęžœć«ćŖć‚Šć¾ć™ć€‚

恧ćÆć€ć“ć“ć§å…ˆć»ć©č¾žę›ø悒ä½æć£ć¦ćƒ“ćƒ«ćƒ‰ć—ćŸć€Kuromoji恮JARćƒ•ć‚”ć‚¤ćƒ«ć‚’ä½æć£ć¦ćæć¾ć™ć€‚
1åŗ¦sbt悒ēµ‚äŗ†ć€‚

> exit

libćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖć‚’ä½œęˆć—ć¾ć™ć€‚

$ mkdir lib

恓恮äø­ć«ć€ćƒ“ćƒ«ćƒ‰ć—ćŸJARćƒ•ć‚”ć‚¤ćƒ«ć‚’ę”¾ć‚Šč¾¼ćæć¾ć™ć€‚

$ cp $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/lucene-analyzers-kuromoji-5.0.0-SNAPSHOT.jar lib/

sbtć®ä¾å­˜é–¢äæ‚定ē¾©ć‹ć‚‰ć€Kuromojić‚’å¤–ć—ć¾ć™ć€‚

libraryDependencies ++= Seq(
  "org.apache.lucene" % "lucene-core" % "5.0.0",
  "org.apache.lucene" % "lucene-analyzers-common" % "5.0.0"
  // "org.apache.lucene" % "lucene-analyzers-kuromoji" % "5.0.0"
)

恧ćÆć€å†åŗ¦sbtć‚’čµ·å‹•ć—ć¦ć€å®Ÿč”Œć€‚

> run
[info] Running org.littlewings.lucene.kuromoji.KuromojiWithNeologd 
InputText = 恙悂悂悂悂悂悂悂悂恮恆恔
  Tokenized = [恙悂悂悂悂悂悂, 恙悂悂悂悂悂悂悂悂恮恆恔, 悂悂]
InputText = ćć‚ƒć‚Šćƒ¼ć±ćæ悅恱ćæ悅
  Tokenized = [ćć‚ƒć‚Šćƒ¼ć±ćæ悅恱ćæ悅]
InputText = ę—„ęœ¬ēµŒęøˆę–°čžć§ćƒ¢ćƒć‚²ćƒ¼ć®čؘäŗ‹ć‚’čŖ­ć‚“恠
  Tokenized = [ę—„ęœ¬, ę—„ęœ¬ēµŒęøˆę–°čž, ēµŒęøˆ, ꖰ聞, mobage, čؘäŗ‹, čŖ­ć‚€]
InputText = ćć‚Šćƒ¼ć‚€ć—ć”ć‚…ćƒ¼
  Tokenized = [ćć‚Šćƒ¼ć‚€ć—ć”ć‚…ćƒ¼]
InputText = č‰¦éšŠć“ć‚Œćć—ć‚‡ć‚“
  Tokenized = [č‰¦éšŠć“ć‚Œćć—ć‚‡ć‚“]
[success] Total time: 1 s, completed 2015/03/15 3:02:09

ēµęžœćŒå¤§ććå¤‰ć‚ć‚Šć¾ć—ćŸļ¼ćć‚ƒć‚Šćƒ¼ć±ćæ悅恱ćæ悅ćØć‹ć‚‚å˜čŖžćØ恗恦čŖč­˜ć—ć¦ć„ć¾ć™ć­ļ¼ćƒ¢ćƒć‚²ćƒ¼ćŒć€mobage恫ā€¦ć€‚

ćØć“ć‚ć§ć€ć€Œć™ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć®ć†ć”ć€ć®ēµęžœćŒć€å¦™ćŖ恓ćØ恫ćŖć‚Šć¾ć—ćŸć€‚

InputText = 恙悂悂悂悂悂悂悂悂恮恆恔
  Tokenized = [恙悂悂悂悂悂悂, 恙悂悂悂悂悂悂悂悂恮恆恔, 悂悂]

恓悌ćÆć©ć†ć—ćŸć“ćØ恧恗悇恆ļ¼ŸćØę€ć„ć€ć‚·ćƒ¼ćƒ‰ć®CSV悒見恦ćæ悋ćØ

$ view [mecab-ipadic-neologdć‚’ćƒ“ćƒ«ćƒ‰ć—ćŸćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖ]/build/mecab-ipadic-2.7.0-20070801-neologd-20150313/mecab-user-dict-seed.20150313.csv

仄äø‹ć®å®šē¾©ćŒā€¦ć€‚

恙悂悂悂悂悂,1288,1288,5072,åč©ž,å›ŗęœ‰åč©ž,äø€čˆ¬,*,*,*,恙悂悂悂悂悂,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢
恙悂悂悂悂悂悂,1288,1288,4587,åč©ž,å›ŗęœ‰åč©ž,äø€čˆ¬,*,*,*,恙悂悂悂悂悂悂,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢
ć™ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć€œåœ°äøŠęœ€å¼·ć®ćƒØćƒ”ć€œ,1288,1288,3763,åč©ž,å›ŗęœ‰åč©ž,äø€čˆ¬,*,*,*,ć™ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć€œåœ°äøŠęœ€å¼·ć®ćƒØćƒ”ć€œ,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒć‚øćƒ§ć‚¦ć‚µć‚¤ć‚­ćƒ§ć‚¦ćƒŽćƒØ惔,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒć‚ø惧>ć‚¦ć‚µć‚¤ć‚­ćƒ§ć‚¦ćƒŽćƒØ惔

ć“ć‚Œć«å¼•ć£ć‹ć‹ć£ćŸć®ć‹ā€¦ć€‚

恕悉恫态恓恆恄恆ć‚Øćƒ³ćƒˆćƒŖć¾ć§ā€¦ć€‚

恙悂悂悂悂悂悂悂悂恮恆恔,1288,1288,4143,åč©ž,å›ŗęœ‰åč©ž,äø€čˆ¬,*,*,*,恙悂悂悂悂悂悂悂悂恮恆恔,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒŽć‚¦ćƒ,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒŽć‚¦ćƒ

ć“ć‚Œć€åč©žćŖ恮ļ¼Ÿļ¼Ÿ

ć¾ć‚ć€ćØć‚Šć‚ćˆćšå‹•ć‹ć›ćŸć®ć§ć‚ˆć—ćØ恙悋恋ćŖā€¦ć€‚

ēµ‚ć‚ć‚Šć«

ćŖ悓ćØć‹å‹•ć‹ć›ć‚‹ćØć“ć‚ć¾ć§ćÆć„ćć¾ć—ćŸćŒć€Lucene恮Kuromoji恧ä½æ恆恫ćÆ恝悌ćŖć‚Šć«č‹¦åŠ“ć—ć¾ć—ćŸć€‚

å·¦ę–‡č„ˆIDćØå³ę–‡č„ˆID恫åÆ¾ć—ć¦å“č©žć®ēµ„ćæåˆć‚ć›ćŒćšć‚Œć¦ć„ć‚‹ć®ćÆć¾ć ć—ć‚‚ć€åŽŸå½¢ćŒ15ę–‡å­—ä»„å†…ćØć„ć†åˆ¶é™ćÆēŸ„悉ćŖć‹ć£ćŸć§ć™ć­ć€‚ć‚ćØ恧Kuromojić®ć‚½ćƒ¼ć‚¹ć‚’ē¢ŗčŖć—恦ćæ悋恋悂怂

恂ćØć€ä»Šå›žćÆå–ć‚Šč¾¼ćæꙂ恫15ę–‡å­—ć‚ˆć‚Šć‚‚åŽŸå½¢ćŒé•·ć„å “åˆćÆē “ę£„ć—ć¾ć—ćŸćŒć€ć”ć‚ƒć‚“ćØč¦‹ćŸę–¹ćŒć„ć„ć®ć‹ćŖćØ怂splitć®ä»•ę–¹ć‚‚ē©ćæꮋ恗ēš„ćŖꄟ恘恧恙恭怂

恠恄恶ę­Ŗć‚“ć å½¢ć«ćŖć£ćŸć‹ć‚‚ć§ć™ćŒć€ē›®ęؙćÆé”ęˆć§ććŸć®ć§ć“ć‚Œć§ćŠć—ć¾ć„ć€ćØ怂

ä»Šå›žä½œęˆć—ćŸć‚½ćƒ¼ć‚¹ć‚³ćƒ¼ćƒ‰ćØć‚¹ć‚ÆćƒŖ惗惈ćÆ态恓恔悉恫ē½®ć„ć¦ć„ć¾ć™ć€‚
https://github.com/kazuhira-r/lucene-examples/tree/master/lucene-kuromoji-mecab-neologd

ćć‚Œć«ć—ć¦ć‚‚ć€ć“ć‚Œć©ć†ć‚„ć£ć¦č¾žę›øć‚’ä½œć£ć¦ć‚‹ć®ć‹ć«čˆˆå‘³ćŒć‚ć‚Šć¾ć™ć­ć€‚