INFORMATION
テクノロジ

Solr 3.2からYahoo! JAPAN キーフレーズ抽出を使う

先日リリースされましたSolr 3.2からYahoo! JAPAN キーフレーズ抽出API（以下キーフレーズ抽出APIと記す）を使う方法を説明します。キーフレーズ抽出APIは、日本語文書からキーフレーズを抽出するツールでWeb API形式で使えるようになっているものです。本稿はSolr 3.2からキーフレーズ抽出APIを使う方法について記しますが、サブスクリプション・パッケージをお使いのお客様は、0.7以降をお使いいただければ同じことができます。サブスクリプション・パッケージのREADME.txtにはキーフレーズ抽出APIを使う方法が書かれていますので、ここではSolr 3.2のディレクトリ構造に沿って説明していきます。

最初にSolr 3.2をダウンロードします。apache-solr-3.2.0.tgzファイルをダウンロードしたら適当なディレクトリに展開します。ここでは説明の便宜上、/home/solrディレクトリにダウンロードして展開することとします。
```
$ cd /home/solr
# ここにダウンロードして...
# ダウンロードしたファイルを展開
$ tar xvzf apache-solr-3.2.0.tgz
```
次に、日本語文書を扱うので、日本語形態素解析器lucene-gosenをインストールします。インストールは、Solrホームディレクトリの下にlibというディレクトリを作成し、そこにjarファイルをおくだけです。
```
$ cd /home/solr/apache-solr-3.2.0/example/solr
$ mkdir lib
$ cd lib
$ wget http://lucene-gosen.googlecode.com/files/lucene-gosen-1.1.0-ipadic.jar
```
次に上記と同じlibディレクトリにrondhuit-uimaをインストールします。rondhuit-uimaは、SolrからYahoo!のキーワード抽出Web APIを利用したり、NExTR on Railsを利用する橋渡しの役目をするライブラリです。
```
$ cd /home/solr/apache-solr-3.2.0/example/solr/lib
$ wget http://rondhuit-uima.googlecode.com/files/rondhuit-uima-1.1.1.jar
```
次に、schema.xmlに日本語形態素解析器を使ったフィールド型と、そのフィールド型を使って日本語のニュース記事を保存するフィールドを定義します。ここではフィールド型をtext_jaとし、記事を保存するフィールド名をarticleとします。
```
$ cd /home/solr/apache-solr-3.2.0/example/solr/conf
$ vi schema.xml
```
schema.xmlファイルには次のフィールド型text_jaとそのフィールド型を使ったフィールドarticleを定義します。
```
    :
<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.JapaneseTokenizerFactory"/>
  </analyzer>
</fieldType>
    :
<field name="article" type="text_ja" indexed="true" stored="true" multiValued="true"/>
    :
```
さらに、keyphraseというstring型でmultiValuedなフィールドを定義します。
```
<field name="keyphrase"  type="string"  indexed="true"  stored="true" multiValued="true"/>
```

次に、solrconfig.xmlにUIMAのライブラリの設定を行います。

$ cd /home/solr/apache-solr-3.2.0/example/solr/conf
$ vi solrconfig.xml

solrconfig.xmlに、次の設定を追加します。

<lib dir="../../contrib/uima/lib" />
<lib dir="../../dist/" regex="apache-solr-uima-\d.*\.jar" />

次に、同じくsolrconfig.xmlファイルに、rondhuit-uimaを使うための設定を行います。runtimeParametersの${YAHOO_APPID}には、Yahoo!デベロッパーで取得した各自のアプリケーションIDを設定します。

<updateRequestProcessorChain name="uima">
  <processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
    <lst name="uimaConfig">
      <lst name="runtimeParameters">
        <str name="appid">${YAHOO_APPID}</str>
      </lst>
      <str name="analysisEngine">/com/rondhuit/uima/desc/KeyphraseExtractAnnotatorDescriptor.xml</str>
      <bool name="ignoreErrors">true</bool>
      <str name="logField">id</str>
      <lst name="analyzeFields">
        <bool name="merge">false</bool>
        <arr name="fields">
          <str>article</str>
        </arr>
      </lst>
      <lst name="fieldMappings">
        <lst name="type">
          <str name="name">com.rondhuit.uima.yahoo.Keyphrase</str>
          <lst name="mapping">
            <str name="feature">keyphrase</str>
            <str name="field">keyphrase</str>
          </lst>
        </lst>
      </lst>
    </lst>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

<requestHandler name="/update/uima" class="solr.XmlUpdateRequestHandler">
  <lst name="defaults">
    <str name="update.chain">uima</str>
  </lst>
</requestHandler>

次に、キーフレーズ抽出を試すための適当なニュース記事を用意します。ここではこの記事の文章を使いました。

$ cd cd /home/solr/apache-solr-3.2.0/example/exampledocs
$ vi news.xml

news.xmlには次のようにニュース記事を貼り付けます。

<add>
  <doc>
    <field name="id">http://news.tbs.co.jp/newseye/tbs_newseye4758970.html</field>
    <field name="article">
    <!-- ここにニュース記事文章を貼り付ける -->
    </field>
  </doc>
</add>

次に、上記と同じディレクトリにあるpost.shを編集してUIMAを使うように変更します。

$ cd /home/solr/apache-solr-3.2.0/example/exampledocs
$ vi post.sh

変更する部分は、次の(1)と(2)の2カ所です。

FILES=$*
URL=http://localhost:8983/solr/update
# (1) 上の行をコピーして、変数UIMA_URLを設定
UIMA_URL=http://localhost:8983/solr/update/uima

for f in $FILES; do
  # (2) 以下2行の$URLを$UIMA_URLに変更
  echo Posting file $f to $UIMA_URL
  curl $UIMA_URL --data-binary @$f -H 'Content-type:application/xml'
  echo
done

#send the commit command to make sure all the changes are flushed and visible                                 
curl $URL --data-binary '<commit/>' -H 'Content-type:application/xml'
echo

Solrを起動します。

$ cd /home/solr/apache-solr-3.2.0/example
$ java -jar start.jar

別のコンソールを使って、ニュース記事news.xmlを登録します。
```
$ cd /home/solr/apache-solr-3.2.0/example/exampledocs
$ ./post.sh news.xml
```

次のURLにブラウザからアクセスし、ニュース記事とともにキーフレーズが抽出できていることを確認します。

http://localhost:8983/solr/select/?q=*%3A*

<doc>
  <arr name="keyphrase">
    <str>亀井</str>
    <str>菅総理</str>
    <str>国民新党</str>
    <str>総理官邸</str>
    <str>閣僚</str>
    <str>復興基本法</str>
    <str>亀井静香</str>
    <str>参院議員会長</str>
    <str>与党幹部</str>
    <str>輿石</str>
    <str>復興担当相</str>
    <str>私個人</str>
    <str>原子力事故</str>
    <str>復旧復興</str>
    <str>冗談交じり</str>
    <str>蓮舫行政刷新担当大臣</str>
    <str>亀井代表</str>
    <str>選任</str>
    <str>復興担当大臣</str>
    <str>内閣改造</str>
  </arr> 
   :
</doc>

KandaSearch

KandaSearch はクラウド型企業向け検索エンジンサービスです。
オープンAPIでカスタマイズが自由にできます。

セマンティックサーチ
人間が理解するように検索エンジンがテキストや画像を理解して検索できます。
クローラー
検索対象文書を収集するWebクローラーが使えます。
簡単操作のUIと豊富なライブラリー
検索や辞書UIに加え、定義済み専門用語辞書／類義語辞書やプラグインがあります。
ローコードで低コスト導入
検索UIで使い勝手を調整した後、Webアプリケーションを自動生成できます。

セミナー

企業が検索エンジンを選定する際のポイントから、
実際の導入デモをお客様ご自身でご体験！

INFORMATIONテクノロジ

Solr 3.2からYahoo! JAPAN キーフレーズ抽出を使う

関連リンク

INFORMATION

KandaSearch

セミナー

INFORMATION
テクノロジ