ę³Øęļ¼
ćć®ćØć³ććŖć§ćÆćmecab-ipadic-neologdćLucene Kuromojić«é©ēØććć«ćććć2ć¤ć»ć©åé”ćēŗēććć®ć§ćććä½č
ć®@overlastććć«ćć®ćć”ć®ć²ćØć¤ćåƾåæććć ćć¾ććć
äæ®ę£ēmecab-ipadic-neologdćä½æć£ć¦Lucene Kuromojić«é©ēØćććØć³ććŖćÆć仄äøćč¦ćććć«ćć¦ćć ććć
äæ®ę£ćććmecab-ipadic-neologdć®č¾ęøććLucene Kuromojić«é©ēØćć¦ćæć
http://d.hatena.ne.jp/Kazuhira/20150316/1426520209
ćć”ćć®ćØć³ććŖćÆćååæé²ēć«ę®ć£ć¦ććć ćć§ćć
仄éćÆććććčøć¾ććäøć§čŖć¾ćć¾ćććććØććććLucene Kuromojić«é©ēØćććå “åćÆćäøčØć®ćØć³ććŖćć覧ćć ććć
å ę„ćć”ćć£ćØę°ć«ćŖććØć³ććŖćäøć®äøć«åŗć¦ćć¾ććć
MeCab ēØć®ę°čŖč¾ęø mecab-ipadic-neologd ćå
¬éćć¾ćć
http://diary.overlasting.net/2015-03-13-1.html
ę“ę°ćę¢ć¾ć£ć¦ä¹ ććIPAč¾ęøć«åƾćć¦ćć·ć¼ććå ćć¦ę°č¾ęøćä½ć£ćććć§ććć¹ć“ć¤ā¦ć
ę°ććIPAč¾ęøćä½æćććØććććØćÆćLuceneć§éćć§ććäŗŗćććæććØKuromojić«é©ēØććććŖććć®ć§ćć
ćØććććć§ććć£ć¦ćæć¾ććļ¼ć ćć¶č¦å“ćć¾ćććć©ā¦ć
åēØ®ć¤ć³ć¹ćć¼ć«
ćć®ä½ę„ćč”ćććć«ćć¾ćå ć®ćµć¤ććććŖć³ćÆććć¦ććęé ćč¦ć¦åēØ®ć½ććć¦ć§ć¢ćć¤ć³ć¹ćć¼ć«ćć¾ććć
mecab-ipadic-NEologd : Neologism dictionary for MeCab
https://github.com/neologd/mecab-ipadic-neologd/blob/master/README.ja.md
åæ č¦ćŖćć®ćÆćC++ć³ć³ćć¤ć©ćiconvćMeCabćmecab-ipadicćxzć ććć§ćć
ćć”ć®ē°å¢ć ćØćMeCab仄å¤ć ćØC++ć³ć³ćć¤ć©ć®ćæćå „ć£ć¦ććŖćć£ćć®ć§ćg++ćć¤ć³ć¹ćć¼ć«ć
$ sudo apt-get install g++
ććććå ćÆćMeCabć®ć¤ć³ć¹ćć¼ć«ć§ćć
MeCabćć¤ć³ć¹ćć¼ć«ćć
ć¾ććÆćMeCabćć¤ć³ć¹ćć¼ć«ćć¾ćć
MeCab: Yet Another Part-of-Speech and Morphological Analyzer
http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html
ćŖć®ć§ććććć®ćµć¤ćć§ćć¦ć³ćć¼ćć§ććmecab-0.996.tar.gzćÆćtarćå£ćć¦ćććććŖć®ć§å±éć§ćć¾ććć§ććā¦ć
ä»ę¹ććŖćć®ć§ććććÆ仄äøćåčć«ćå°ćåć®MeCabćć¤ć³ć¹ćć¼ć«ć
Mecabć®ć¤ć³ć¹ćć¼ć«ć”ć¢
http://qiita.com/ShingoOikawa/items/175be8a472ec8ed8a707
ä»åćÆćMeCabćć·ć¹ćć ć°ćć¼ćć«ć«ć¤ć³ć¹ćć¼ć«ćććććć§ćÆćŖćć®ć§ćć¤ć³ć¹ćć¼ć«å ćęå®ćć¾ććććć§ćÆćć$MECAB_HOMEććØčØč¼ćć¾ćć
$ wget http://mecab.googlecode.com/files/mecab-0.994.tar.gz $ tar -zxvf mecab-0.994.tar.gz $ cd mecab-0.994 $ ./configure --prefix=$MECAB_HOME $ make $ sudo make install
ć§ćć¤ć³ć¹ćć¼ć«ćććMeCabć«ćć¹ćéćć¾ćć
$ export PATH=$MECAB_HOME/bin:$PATH
ē¢ŗčŖć
$ mecab --version
mecab of 0.994
ē¶ćć¦ćmecab-ipadicć®ć¤ć³ć¹ćć¼ć«ć
$ wget http://mecab.googlecode.com/files/mecab-ipadic-2.7.0-20070801.tar.gz $ tar -zxvf mecab-ipadic-2.7.0-20070801.tar.gz $ cd mecab-ipadic-2.7.0-20070801 $ ./configure --with-charset=utf-8 $ make $ sudo make install
ććć§ćå ć»ć©ć¤ć³ć¹ćć¼ć«ććMeCabć®ćć£ć¬ćÆććŖć«ćIPAč¾ęøćå ć«ććč¾ęøćć¤ć³ć¹ćć¼ć«ććć¾ćć
$ ls -l $MECAB_HOME/lib/mecab/dic åčØ 4 drwxr-xr-x 2 root root 4096 3ę 15 01:42 ipadic
ććć¾ć§ć§ćMeCabć®ć¤ć³ć¹ćć¼ć«ćÆēµäŗć§ćć
mecab-ipadic-neologdć®ć¤ć³ć¹ćć¼ć«
ꬔćÆćmecab-ipadic-neologdćć¤ć³ć¹ćć¼ć«ćć¾ćć
ćć”ććÆć仄äøć«č¼ć£ć¦ććęé ć«ę²æć£ć¦é²ćć¦ććć°OKć§ćć
mecab-ipadic-NEologd : Neologism dictionary for MeCab
https://github.com/neologd/mecab-ipadic-neologd/blob/master/README.ja.md
$ git clone https://github.com/neologd/mecab-ipadic-neologd.git
$ cd mecab-ipadic-neologd
$ git pull
$ ./bin/install-mecab-ipadic-neologd
ć¹ćÆćŖćććå®č”ćććØćéäøć§ē¾åØć®ććć©ć«ćć®ć·ć¹ćć č¾ęøćććć©ć®ććć«å¤ććć®ććč”Øē¤ŗććć¾ćć
default system dictonary | mecab-ipadic-neologd ćŖć¢ć« ć¹ć³ć¼ć | ćŖć¢ć«ć¹ć³ć¼ć äøēäø åć ćć ęę„ | äøēäøåćććęę„ ć ć”ć ć¤ć± | ćć”ćć¤ć± å¦ę ” ć® ć«ć¤ćć³ | å¦ę ”ć®ć«ć¤ćć³ åæę åē© å | åæę åē©å åæę ć©ć ć¶ć¤ å | åæęć©ćć¶ć¤å ć¢ć č” | ć¢ćč” ććÆćæć¼ ć¤ćØćć¼ | ććÆćæć¼ć¤ćØćć¼ äøę ęę„ē¾ å | äøęęę„ē¾å åē“ē ć¢ćć” å | åē“ē ć¢ćć”å ćµćć ēŗč¦ | ćµććēŗč¦ ć ć”ć ć®ć³ćć³ | ćć”ćć®ć³ćć³ å¦ę ” ć® éꮵ | å¦ę ”ć®éꮵ
ē¶č”ćć¦ćććć°ććyesćć§ć
[install-mecab-ipadic-neologd] : Do you want to install mecab-ipadic-neologd? Type yes or no. yes
ē¢ŗčŖć
$ mecab -d $MECAB_HOME/lib/mecab/dic/mecab-ipadic-neologd ćććć¼ć±ćæć ć±ćæć ćććć¼ć±ćæć ć±ćæć åč©,åŗęåč©,äøč¬,*,*,*,ćććć¼ć±ćæć ć±ćæć ,ćć£ćŖć¼ććć„ććć„,ćć£ćŖć¼ććć„ććć„ EOS
ē”äŗćć¤ć³ć¹ćć¼ć«ćććććć§ćć
ćŖććććć§ć¤ć³ć¹ćć¼ć«ćććMeCabćć®ćć®ćÆć仄å¾ćÆä½æćć¾ćććććććmecab-ipadic-neologdć®ćć«ćęć«ć«ć¬ć³ććć£ć¬ćÆććŖć«ēęććććbuildćć£ć¬ćÆććŖć®äøčŗ«ć®ę¹ćåæ č¦ć§ćć
$ ls -l build åčØ 11928 drwxrwxr-x 2 xxxxx xxxxx 4096 3ę 15 01:47 mecab-ipadic-2.7.0-20070801-neologd-20150313 -rw-rw-r-- 1 xxxxx xxxxx 12208105 3ę 15 01:47 mecab-ipadic-2.7.0-20070801.tar.gz
Luceneć®ćć«ć
ä»åŗ¦ćÆć話é”ćLuceneć«ć
ć¾ććÆLuceneć®ć½ć¼ć¹ć³ć¼ććsvn exportćć¦ććć«ććč”ćć¾ćć
$ svn export http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_5_0_0 $ cd lucene_solr_5_0_0/lucene $ ant ivy-bootstrap $ ant compile
ćant ivy-bootstrapććÆććć§ć«Antć«Ivyćå°å „ęøćæć§ććć°äøč¦ć§ćć
ćŖććććć§LuceneććØćÆć¹ćć¼ććććć£ć¬ćÆććŖļ¼/path/to/lucene_solr_5_0_0ļ¼ćć$LUCENE_SRC_HOMEćØčØč¼ćć¾ćć
ē¶ćć¦ćKuromojić®ä½æćč¾ęøć®ćć«ćć
ćØćććććä½ćčććć«ććć©ć«ćć®č¾ęøć§ćć«ććć¦ćæć¾ćć
$ cd analysis/kuromoji
$ ant regenerate
ćć®ęćIPAč¾ęøććć¦ć³ćć¼ććć¦ćć¾ćć
å±éå ćÆććć”ćć«ćŖćć¾ćć
$ ls -l $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji åčØ 53332 drwxrwxr-x 4 xxxxx xxxxx 4096 3ę 15 01:56 classes drwxrwxr-x 2 xxxxx xxxxx 4096 3ę 15 01:56 mecab-ipadic-2.7.0-20070801 -rw-rw-r-- 1 xxxxx xxxxx 54599680 3ę 15 01:56 mecab-ipadic-2.7.0-20070801.tar lrwxrwxrwx 1 xxxxx xxxxx 84 3ę 15 01:56 mecab-ipadic-2.7.0-20070801.tar.gz -> /xxxxx/.ivy2/cache/mecab/mecab-ipadic/.tar.gzs/ipadic-2.7.0-20070801..tar.gz
ćć®ä»čæć«ćmecab-ipadic-neologdć®ćć«ćęć«ä½ęććč¾ęøć®å ććæćē½®ćć¦ćKuromojić§ä½æćč¾ęøććć«ććć¦ćæć¾ćććć
mecab-ipadic-neologdć®č¾ęøćä½æć£ć¦ćKuromojić®č¾ęøćØKuromojiććć«ććć
Lucene Kuromojić®č¾ęøä½ęćć¼ć«ćÆćęå®ććććć£ć¬ćÆććŖé äøć«ććCSVćć”ć¤ć«ļ¼ę”å¼µåćć.csvćļ¼ćå¦ēåÆ¾č±”ćØććććć§ćć
ććć§ćå ć»ć©ä½ęććmecab-ipadic-neologdć®äøéēęē©ććKuromojić®ćć«ćęć®ćć£ć¬ćÆććŖć«ć³ćć¼ćć¾ćć
$ cp -Rp [mecab-ipadic-neologdććć«ććććć£ć¬ćÆććŖ]/build/mecab-ipadic-2.7.0-20070801-neologd-20150313 $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji
ććć¦ćKuromojić®build.xmlćäæ®ę£ćć¾ćć
ććć©ć«ćć®IPAč¾ęøć§ćÆćŖććć³ćć¼ććmecab-ipadic-neologdć®äøéēęē©ćä½æćććć«ćbuild.xmlć®ipadic.versionćäæ®ę£ćć¾ćļ¼ćććććć£ć¬ćÆććŖåćęćććć«ćŖć£ć¦ććć®ć§ļ¼ć
<!-- <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" /> --> <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801-neologd-20150313" />
ä»åä½æćč¾ęøļ¼ćØćććCSVćć”ć¤ć«ļ¼ćÆUTF-8ć§ęøććć¦ććć®ć§ćććć©ć«ćć®EUC-JPććå¤ę“ćć¾ćć
<!-- <property name="dict.encoding" value="euc-jp"/> --> <property name="dict.encoding" value="utf-8"/>
build-dictćæć¹ćÆć§ćÆćč¾ęøć®ćć¦ć³ćć¼ććÆäøč¦ć«ćŖćć®ć§ćdependsććdownload-dictćæć¹ćÆćåćé¢ćć¾ćć
<!-- <target name="build-dict" depends="compile-tools, download-dict"> --> <target name="build-dict" depends="compile-tools">
č¾ęøä½ęćć¼ć«ćÆćä»åć®CSVćć”ć¤ć«ćčŖć¾ćććØććć©ć«ćć®ćć¼ććµć¤ćŗļ¼1Gļ¼ć§ćÆč¶³ććŖććŖćć®ć§ćę”å¼µćć¾ćć2Gć«ćć¾ććććä»åćÆććć§ååć«ä½č£ćććć¾ććć
<!-- TODO: optimize the dictionary construction a bit so that you don't need 1G --> <!-- <java fork="true" failonerror="true" maxmemory="1g" classname="org.apache.lucene.analysis.ja.util.DictionaryBuilder"> --> <java fork="true" failonerror="true" maxmemory="2g" classname="org.apache.lucene.analysis.ja.util.DictionaryBuilder">
ć§ćÆćč¾ęøććć«ććć¦ćæć¾ćććļ¼
$ ant regenerate
ćć°ććå¾ ć£ć¦ćććØćć³ć±ć¾ćā¦ć
[java] building tokeninfo dict... [java] parse... [java] sort... [java] encode... [java] Exception in thread "main" java.lang.AssertionError [java] at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:129) [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:143) [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:78) [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37) [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
ć³ć±ćå “ęćč¦ć¦ćæć¾ćć
ććć®assertć«å¼ć£ććć£ć¦ććććć§ćć
assert baseForm.length() < 16;
BaseFormćÆć15ęå仄å ćØććåæ č¦ćććććć§ćććććÆćKuromojić®ä»ę§ć§ććććļ¼
ććć§ćMeCabć®č¾ęøćØć³ććŖć®ćć©ć¼ććććč¦ć¦ćæć¾ćć
åčŖć®čæ½å ę¹ę³
http://mecab.googlecode.com/svn/trunk/mecab/doc/dic.html
ćććŖćć©ć¼ćććć§ćć
č”Ø層形,å·¦ęčID,å³ęčID,ć³ć¹ć,åč©,åč©ē“°åé”1,åč©ē“°åé”2,åč©ē“°åé”3,ę“»ēØå½¢,ę“»ēØå,åå½¢,čŖćæ,ēŗé³
ä¾ćÆććććŖęćć
å·„č¤,1223,1223,6058,åč©,åŗęåč©,äŗŗå,å,*,*,ćć©ć,ćÆćć¦,ćÆćć¦
BaseFormćÆć10ēŖē®ć®č¦ē“ ćŖć®ć§ććåå½¢ćć§ćććåå½¢ćć15ęåćč¶ ćć¦ćÆćŖććŖćććØć
ć¾ććMeCabććć®č¾ęøćåćč¾¼ćć¦ććććØćčćććØćKuromojić®å¶éćŖć®ć§ćććć
ćć®å¶éć®ēē±ćč©³ē“°ćÆć”ćććØč¦ćć¦ćć¾ćććććØćććććć®å¶éćå¤ćć¦č©¦ćć¦ćæć¾ćććć
// assert baseForm.length() < 16;
å®č”ć
$ ant regenerate
ć¾ćć³ć±ć¾ććā¦ć
[java] building tokeninfo dict... [java] parse... [java] sort... [java] encode... [java] Exception in thread "main" java.lang.AssertionError [java] at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:122) [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:143) [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:78) [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37) [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
ć¾ćć³ć±ćē®ęćē¢ŗčŖćć¦ćæć¾ćć
String existing = posDict.get(leftId); assert existing == null || existing.equals(fullPOSData);
ćććć½ć¼ć¹ćčŖććØćč¾ęøć®å ććæć®CSVć®ä»„äøćęååé£ēµćććć®ćØ
åč©,åč©ē“°åé”1,åč©ē“°åé”2,åč©ē“°åé”3,ę“»ēØå½¢,ę“»ēØå
仄äøć®ćć¢ćē°ćŖćēµćæåććććććØćēŗēććććć§ćć
å·¦ęčID,å³ęčID
ä¾ćć°ćå·¦ęčIDćć1288ćć§ććęć«ććć§ć«ä»„äøć®ēµćæåćććåŗē¾ćć¦ććć®ć«
åč©-åŗęåč©-äøč¬
åćå·¦ęčIDć«
åč©-åŗęåč©-äŗŗå
ćē»å “ććććććØćēŗēćć¾ćć
ćććć«ććććÆčÆććŖćę°ććć¾ćā¦ć
ćØććććå¤ę“ććč¾ęøä½ęćć¼ć«ć®ć½ć¼ć¹ć³ć¼ććÆć15ęåć¾ć§ć®å¶éč§£é¤ćå«ćć¦ćå ć«ę»ćć¾ććć
mecab-ipadic-neologdć®ć·ć¼ććč£ę£ćć
ćØćŖććØćmecab-ipadic-neologdå“ć®ć·ć¼ćć®CSVćć”ć¤ć«ćäæ®ę£ććę¹ćććććć§ćććISSUEęøććććØćęćć¾ććććKuromojić®é½åćŖćØćććććę°ćććć®ć§ā¦ć
ć¾ććéåøøć®IPAč¾ęøć®å 容ć®CSVćććå·¦ęčID,å³ęčID,åč©,åč©ē“°åé”1,åč©ē“°åé”2,åč©ē“°åé”3,ę“»ēØå½¢,ę“»ēØåćåćåŗćć¾ćć
$ find $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/mecab-ipadic-2.7.0-20070801-neologd-20150313/*.csv | \ > grep -v 'mecab-user-dict-seed' | \ > xargs cat | \ > perl -wanl -F, -e 'print "$F[1],$F[2],$F[4],$F[5],$F[6],$F[7],$F[8]"' | \ > sort -n | \ > uniq > ipadic-id-with-part-of-speech.csv
mecab-ipadic-neologdć®ć·ć¼ććÆęćć¦ćć¾ćć
ćć®ćććŖćć”ć¤ć«ćć§ććććć¾ćć
$ head ipadic-id-with-part-of-speech.csv 1,1,ćć®ä»,éę,*,*,* 2,2,ćć£ć©ć¼,*,*,*,* 3,3,ęåč©,*,*,*,* 4,4,čØå·,ć¢ć«ćć”ććć,*,*,* 5,5,čØå·,äøč¬,*,*,* 6,6,čØå·,ę¬å¼§é,*,*,* 7,7,čØå·,ę¬å¼§é,*,*,* 8,8,čØå·,å„ē¹,*,*,* 9,9,čØå·,ē©ŗē½,*,*,* 10,10,čØå·,čŖē¹,*,*,*
ćććåŗę¬ć®ę
å ±ćØćć¦ćmecab-ipadic-neologdć®ć·ć¼ćć®CSVć®å·¦ęčIDćØå³ęčIDćč£ę£ććć¹ćÆćŖćććęøćć¾ćććŖććć¹ćÆćŖćććÆćGroovyć
ā»ććććć°ćIDćØåč©ć®ć©ć£ć”ć«åÆććć¹ćććÆčćć¦ćŖćć£ćā¦
transform.groovy
transform.groovy def partOfSpeechCsv = args[0] def inputCsv = args[1] def outputCsv = args[2] def maxContext = 1 def partOfSpeechMap = [:] new File(partOfSpeechCsv).eachLine { line -> def tokens = line.split(/,/) def contexts = [tokens[0], tokens[1]] def partOfSpeech = tokens.drop(2).join('-') partOfSpeechMap[partOfSpeech] = contexts if ((tokens[0] as int) > maxContext) { maxContext = tokens[0] as int } } new File(outputCsv).withWriter('UTF-8') { writer -> new File(inputCsv).eachLine('UTF-8') { line -> def tokens = line.split(/,/) if (tokens[10].length() >= 16) { println("[WARN] Discard, BaseForm length greather than 16. => [${tokens[10]}]") return } def contexts = [tokens[1], tokens[2]] def partOfSpeech = "${tokens[4]}-${tokens[5]}-${tokens[6]}-${tokens[7]}-${tokens[8]}" def leftContext def rightContext def ipadicContext = partOfSpeechMap[partOfSpeech] if (ipadicContext == null) { maxContext++ leftContext = maxContext rightContext = maxContext partOfSpeechMap[partOfSpeech] = [leftContext as String, rightContext as String] } else if (ipadicContext != contexts) { leftContext = ipadicContext[0] as int rightContext = ipadicContext[1] as int } else { leftContext = contexts[0] as int rightContext = contexts[1] as int } writer.write(tokens[0]) writer.write(',') writer.write(leftContext as String) writer.write(',') writer.write(rightContext as String) writer.write(',') writer.write(tokens.drop(3).join(',')) writer.newLine() } }
å ć»ć©ä½ęćććipadic-id-with-part-of-speech.csvćØćććć”ć¤ć«ć®äøć«ćåćåč©,åč©ē“°åé”1,åč©ē“°åé”2,åč©ē“°åé”3,ę“»ēØå½¢,ę“»ēØåćęć£ć¦ćć¦ććć¤å·¦ęčID,å³ęčIDćććć¦ććå “åć«ćÆIPAč¾ęøćØåćććććć«ćć¾ććęŖē»é²ć®å “åćÆćę大ć®ęčIDćć²ćØć¤ćć¤ć¤ć³ćÆćŖć”ć³ćććå¤ćä»äøćć¾ćļ¼é©å½ļ¼ć
ć¾ććåå½¢ć15ęåćč¶ ćć¦ććå “åćÆćåÆ¾č±”å¤ćØćć¦ē “ę£ćć¾ććåå½¢ć15ęåćč¶ ćć¦ććå “åćÆććØććććč¦åććććć«ćć¦ćæć¾ćććā¦ć
ććØćCSVćåč§£ććéć«ę°ę„½ć«splitćä½æć£ć¦ćć¾ćććę¬å½ćŖćKuromojić®äøć«ććCSVUtil#parsećä½æēØććć®ćē¢ŗå®ć ć£ćććććć¾ććć
ć§ćÆćä½ęć
$ groovy transform.groovy \ > ipadic-id-with-part-of-speech.csv \ > $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/mecab-ipadic-2.7.0-20070801-neologd-20150313/mecab-user-dict-seed.20150313.csv \ > $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/mecab-ipadic-2.7.0-20070801-neologd-20150313/mecab-user-dict-seed.20150313_revised.csv
ćŖććåå½¢ć15ęåćč¶ ćććć®ćÆć66707åććć¾ććā¦ć
ććć§ć$LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/mecab-ipadic-2.7.0-20070801-neologd-20150313ćć£ć¬ćÆććŖé äøć«ćmecab-user-dict-seed.20150313_revised.csvćØććååć®ćć”ć¤ć«ćć§ćć¾ćć
å ć®ć·ć¼ćć®CSVćÆććććŖććŖć£ćć®ć§åé¤ćć¾ćć
$ rm $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/mecab-ipadic-2.7.0-20070801-neologd-20150313/mecab-user-dict-seed.20150313.csv
ä»ęø”ćććę°ćåćē“ćć¦å®č”ļ¼
$ ant regenerate
ćć”ć®PCć§3åć»ć©ćććć¾ććććä»åŗ¦ćÆćć¾ććć£ćććć§ćć
regenerate: BUILD SUCCESSFUL Total time: 3 minutes 17 seconds
ęå¾ć«ćKuromojiććć«ććć¾ćć
$ ant jar-core
ä»åć®č¾ęøćä½æć£ććLucene Kuromojić®JARćć”ć¤ć«ćć§ććććć¾ćć
jar-core: [jar] Building jar: $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/lucene-analyzers-kuromoji-5.0.0-SNAPSHOT.jar BUILD SUCCESSFUL Total time: 5 seconds
ćć”ććä½æć£ć¦ćåä½ē¢ŗčŖćć¦ćæć¾ćććć
Kuromojićä½æć£ćććć°ć©ć ćęøć
ććć§ćÆćć¾ććÆę®éć«Kuromojićä½æć£ćććć°ć©ć ćęøćć¦ćæć¾ćććć
ćć«ććć¼ć«ćsbtćŖć®ćÆććęå¬ć
build.sbt
name := "lucene-kuromoji-mecab-neologd" version := "0.0.1-SNAPSHOT" scalaVersion := "2.11.5" organization := "org.littlewings" updateOptions := updateOptions.value.withCachedResolution(true) scalacOptions ++= Seq("-Xlint", "-unchecked", "-deprecation", "-feature") libraryDependencies ++= Seq( "org.apache.lucene" % "lucene-core" % "5.0.0", "org.apache.lucene" % "lucene-analyzers-common" % "5.0.0", "org.apache.lucene" % "lucene-analyzers-kuromoji" % "5.0.0" )
LucenećÆć5.0.0ć§ćć
ć½ć¼ć¹ć³ć¼ććScalaćŖć®ćććęå¬ć
src/main/scala/org/littlewings/lucene/kuromoji/KuromojiWithNeologd.scala
package org.littlewings.lucene.kuromoji import org.apache.lucene.analysis.ja.JapaneseAnalyzer import org.apache.lucene.analysis.tokenattributes.CharTermAttribute object KuromojiWithNeologd { def main(args: Array[String]): Unit = { val texts = Array( "ćććććććććć®ćć”", "ćććć¼ć±ćæć ć±ćæć ", "ę„ę¬ēµęøę°čć§ć¢ćć²ć¼ć®čØäŗćčŖćć ", "ććć¼ććć”ć ć¼", "č¦éćććććć" ) val analyzer = new JapaneseAnalyzer for (text <- texts) { val tokenStream = analyzer.tokenStream("", text) val charTermAttr = tokenStream.addAttribute(classOf[CharTermAttribute]) tokenStream.reset() val tokens = Iterator .continually(tokenStream.incrementToken()) .takeWhile(identity) .map(_ => charTermAttr.toString) println(s"InputText = $text") println(s" Tokenized = ${tokens.mkString("[", ", ", "]")}") tokenStream.close() } } }
å½¢ę ē“ č§£ęććęē« ćåčŖćÆćé©å½ć«éøćć§ćć¾ććKuromojićÆćććć©ć«ćć®SEARCHć¢ć¼ćć§ćć
ćć®ććć°ć©ć ćå®č”ćć¦ćæć¾ćć
> run [info] Running org.littlewings.lucene.kuromoji.KuromojiWithNeologd InputText = ćććććććććć®ćć” Tokenized = [ććć, ćć, ćć] InputText = ćććć¼ć±ćæć ć±ćæć Tokenized = [ć, ć¼, ć±ćæć ć±ćæć ] InputText = ę„ę¬ēµęøę°čć§ć¢ćć²ć¼ć®čØäŗćčŖćć Tokenized = [ę„ę¬, ę„ę¬ēµęøę°č, ēµęø, ę°č, ć¢ćć², čØäŗ, čŖć] InputText = ććć¼ććć”ć ć¼ Tokenized = [ćć, ć¼, ćć, ć”ć, ć , ć¼] InputText = č¦éćććććć Tokenized = [č¦é, ćć] [success] Total time: 1 s, completed 2015/03/15 2:57:30
å½ē¶ćę°ććåčŖćććććŖćć®ć§ććć®ćććēµęć«ćŖćć¾ćć
ć§ćÆćććć§å
ć»ć©č¾ęøćä½æć£ć¦ćć«ććććKuromojić®JARćć”ć¤ć«ćä½æć£ć¦ćæć¾ćć
1åŗ¦sbtćēµäŗć
> exit
libćć£ć¬ćÆććŖćä½ęćć¾ćć
$ mkdir lib
ćć®äøć«ććć«ćććJARćć”ć¤ć«ćę¾ćč¾¼ćæć¾ćć
$ cp $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/lucene-analyzers-kuromoji-5.0.0-SNAPSHOT.jar lib/
sbtć®ä¾åé¢äæå®ē¾©ćććKuromojićå¤ćć¾ćć
libraryDependencies ++= Seq( "org.apache.lucene" % "lucene-core" % "5.0.0", "org.apache.lucene" % "lucene-analyzers-common" % "5.0.0" // "org.apache.lucene" % "lucene-analyzers-kuromoji" % "5.0.0" )
ć§ćÆćååŗ¦sbtćčµ·åćć¦ćå®č”ć
> run [info] Running org.littlewings.lucene.kuromoji.KuromojiWithNeologd InputText = ćććććććććć®ćć” Tokenized = [ććććććć, ćććććććććć®ćć”, ćć] InputText = ćććć¼ć±ćæć ć±ćæć Tokenized = [ćććć¼ć±ćæć ć±ćæć ] InputText = ę„ę¬ēµęøę°čć§ć¢ćć²ć¼ć®čØäŗćčŖćć Tokenized = [ę„ę¬, ę„ę¬ēµęøę°č, ēµęø, ę°č, mobage, čØäŗ, čŖć] InputText = ććć¼ććć”ć ć¼ Tokenized = [ććć¼ććć”ć ć¼] InputText = č¦éćććććć Tokenized = [č¦éćććććć] [success] Total time: 1 s, completed 2015/03/15 3:02:09
ēµęć大ććå¤ććć¾ććļ¼ćććć¼ć±ćæć ć±ćæć ćØććåčŖćØćć¦čŖčćć¦ćć¾ććļ¼ć¢ćć²ć¼ććmobageć«ā¦ć
ćØććć§ćććććććććććć®ćć”ćć®ēµęććå¦ćŖććØć«ćŖćć¾ććć
InputText = ćććććććććć®ćć” Tokenized = [ććććććć, ćććććććććć®ćć”, ćć]
ćććÆć©ćććććØć§ćććļ¼ćØęććć·ć¼ćć®CSVćč¦ć¦ćæććØ
$ view [mecab-ipadic-neologdććć«ććććć£ć¬ćÆććŖ]/build/mecab-ipadic-2.7.0-20070801-neologd-20150313/mecab-user-dict-seed.20150313.csv
仄äøć®å®ē¾©ćā¦ć
ćććććć,1288,1288,5072,åč©,åŗęåč©,äøč¬,*,*,*,ćććććć,ć¹ć¢ć¢ć¢ć¢ć¢,ć¹ć¢ć¢ć¢ć¢ć¢ ććććććć,1288,1288,4587,åč©,åŗęåč©,äøč¬,*,*,*,ććććććć,ć¹ć¢ć¢ć¢ć¢ć¢ć¢,ć¹ć¢ć¢ć¢ć¢ć¢ć¢ ććććććććå°äøęå¼·ć®ćØć”ć,1288,1288,3763,åč©,åŗęåč©,äøč¬,*,*,*,ććććććććå°äøęå¼·ć®ćØć”ć,ć¹ć¢ć¢ć¢ć¢ć¢ć¢ććøć§ć¦ćµć¤ćć§ć¦ććØć”,ć¹ć¢ć¢ć¢ć¢ć¢ć¢ććøć§>ć¦ćµć¤ćć§ć¦ććØć”
ććć«å¼ć£ććć£ćć®ćā¦ć
ććć«ććććććØć³ććŖć¾ć§ā¦ć
ćććććććććć®ćć”,1288,1288,4143,åč©,åŗęåč©,äøč¬,*,*,*,ćććććććććć®ćć”,ć¹ć¢ć¢ć¢ć¢ć¢ć¢ć¢ć¢ćć¦ć,ć¹ć¢ć¢ć¢ć¢ć¢ć¢ć¢ć¢ćć¦ć
ćććåč©ćŖć®ļ¼ļ¼
ć¾ćććØććććåćććć®ć§ćććØććććŖā¦ć
ēµććć«
ćŖććØćåććććØććć¾ć§ćÆććć¾ććććLuceneć®Kuromojić§ä½æćć«ćÆćććŖćć«č¦å“ćć¾ććć
å·¦ęčIDćØå³ęčIDć«åƾćć¦åč©ć®ēµćæåćććććć¦ććć®ćÆć¾ć ćććåå½¢ć15ęå仄å ćØććå¶éćÆē„ććŖćć£ćć§ćććććØć§Kuromojić®ć½ć¼ć¹ćē¢ŗčŖćć¦ćæćććć
ććØćä»åćÆåćč¾¼ćæęć«15ęåćććåå½¢ćé·ćå “åćÆē “ę£ćć¾ććććć”ćććØč¦ćę¹ćććć®ććŖćØćsplitć®ä»ę¹ćē©ćæę®ćēćŖęćć§ććć
ć ćć¶ęŖćć å½¢ć«ćŖć£ćććć§ćććē®ęØćÆéęć§ććć®ć§ććć§ććć¾ćććØć
ä»åä½ęććć½ć¼ć¹ć³ć¼ććØć¹ćÆćŖćććÆććć”ćć«ē½®ćć¦ćć¾ćć
https://github.com/kazuhira-r/lucene-examples/tree/master/lucene-kuromoji-mecab-neologd
ććć«ćć¦ććććć©ććć£ć¦č¾ęøćä½ć£ć¦ćć®ćć«čå³ćććć¾ććć