Saturday, July 20, 2013

Auto Complete

Requirement : While searching we need to provide auto complete (suggest).

Solr version : 4.3.1

Solr comes with a default search application called browse. ( http://<server>:<port>/solr/browse  e.g. http://localhost:8983/solr/browse)

Solr installation directory has a example folder. Now please navigate to <solr installation directory>/example/solr/collection1/conf/velocity/ . This directory contains all the velocity template files and jquery.autocomplete.css and jquery.autocomplete.js

Now open head.vm following code is responsible for auto complete  functionality.

 <script>
    $(document).ready(function(){
      $("\#q").autocomplete('#{url_for_solr}/terms', {  ## backslash escaped #q as that is a macro defined in VM_global_library.vm
           extraParams:{
             'terms.prefix': function() { return $("\#q").val();},
             'terms.sort': 'count',
             'terms.fl': 'name',
             'wt': 'velocity',
             'v.template': 'suggest'
           }
         }
      ).keydown(function(e){
        if (e.keyCode === 13){
          $("#query-form").trigger('submit');
        }
      });

      // http://localhost:8983/solr/terms?terms.fl=name&terms.prefix=i&terms.sort=count
    });

    </script>

This part we have to customize for our need. Most of the time I copied this example folder rename it then start modifying schemal.xml  solrconfig.xml  and other files.

My schema fields are like below.
 <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
   <field name="category" type="text_general" indexed="true" stored="true" omitNorms="true"/>
   <field name="countryName" type="text_general" indexed="true" stored="true"/>
  
   <field name="countryCode" type="text_general" indexed="true" stored="true"/>
   <field name="values" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="store" type="location" indexed="true" stored="true"/>

First change I made in head.vm
<script>
    $(document).ready(function(){
      $("\#q").autocomplete('#{url_for_solr}/terms', {  ## backslash escaped #q as that is a macro defined in VM_global_library.vm
           extraParams:{
             'terms.prefix': function() { return $("\#q").val();},
             'terms.sort': 'count',
             'terms.fl': 'countryName',
             'wt': 'velocity',
             'v.template': 'suggest'
           }
         }
      ).keydown(function(e){
        if (e.keyCode === 13){
          $("#query-form").trigger('submit');
        }
      });

      // http://localhost:8983/solr/terms?terms.fl=name&terms.prefix=i&terms.sort=count
    });
head.vm (location : your_application_name/solr/collection1/conf/velocity/head.vm)  only change is 'name' -> 'countryName'.

Next change in suggest.vm (location: your_application_name/solr/collection1/conf/velocity/suggest.vm) change the content like below

#foreach($t in $response.response.terms.countryName)
$t.key
#end


Another important file is richtext-doc.vm edit the field names to include your desired fields.





 Auto complete with display from more than one field

Earlier we displayed country name in the drop down. Now I need another field also (e.g. category).

Two changes I made to do this.

Change in head.vm

$("\#q").autocomplete('#{url_for_solr}/terms?terms.fl=category'
 
Earlier it was only $("\#q").autocomplete('#{url_for_solr}/terms

Change in suggest.vm

#foreach($t in $response.response.terms.category)
$t.key
#end

#foreach($t in $response.response.terms.countryName)
$t.key
#end

First I am listing category then the countryName.

Please see the image below, earlier category 'inflation' was not displayed. But this time 'inflation' also displayed along with the country name.
 

Wednesday, July 17, 2013

Storing any CSV file into SOLR

Who is responsible for handling csv files.

<requestHandler name="/update/csv" class="solr.CSVRequestHandler">
        <lst name="defaults">
         <str name="stream.contentType">application/csv</str>
       </lst>
</requestHandler>

How to load CSV file in SOLR?


 java -Durl=http://localhost:8983/solr/update/csv -Dtype=text/csv -jar post.jar *.csv

How to CSV with dynamic field?

Single valued string fields name should be ended with _s. Multi-valued string field name should be ended with _ss.
 Following are the settings for dynamic fields.
   <dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
   <dynamicField name="*_is" type="int"    indexed="true"  stored="true"  multiValued="true"/>
   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true" />
   <dynamicField name="*_ss" type="string"  indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
   <dynamicField name="*_ls" type="long"   indexed="true"  stored="true"  multiValued="true"/>
   <dynamicField name="*_t"  type="text_general"    indexed="true"  stored="true"/>
   <dynamicField name="*_txt" type="text_general"   indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="*_en"  type="text_en"    indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="*_b"  type="boolean" indexed="true" stored="true"/>
   <dynamicField name="*_bs" type="boolean" indexed="true" stored="true"  multiValued="true"/>
   <dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
   <dynamicField name="*_fs" type="float"  indexed="true"  stored="true"  multiValued="true"/>
   <dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
   <dynamicField name="*_ds" type="double" indexed="true"  stored="true"  multiValued="true"/>
 <dynamicField name="*_coordinate"  type="tdouble" indexed="true"  stored="false" />

   <dynamicField name="*_dt"  type="date"    indexed="true"  stored="true"/>
   <dynamicField name="*_dts" type="date"    indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="*_p"  type="location" indexed="true" stored="true"/>

Errors Faced

No id is defined. Reason there must be an id for each row. Reason is in schema.xml <uniqueKey>id</uniqueKey>
After adding one id column and putting some unique value for each row able to post
 it to solr. Solr did not complain anything. 

Thursday, July 11, 2013

Configuring Solr

Solr Version used for this blog: 4.3.1

How to start solr?

<Solr Installation Directory>\example\java -jar start.jar
The default port is 8983.

How to change solr default port 8983?

port is mentioned in <solr installation directory>/etc/jetty.xml


<Call name="addConnector">
     <Arg>
         <New class="org.eclipse.jetty.server.bio.SocketConnector">
           <Call class="java.lang.System" name="setProperty"> <Arg>log4j.configuration</Arg> <Arg>etc/log4j.properties</Arg> </Call>
           <Set name="host"><SystemProperty name="jetty.host" /></Set>
           <Set name="port"><SystemProperty name="jetty.port" default="8983"/></Set>
           <Set name="maxIdleTime">50000</Set>
           <Set name="lowResourceMaxIdleTime">1500</Set>
           <Set name="statsOn">false</Set>
         </New>
     </Arg>
    </Call>


Change 8983 to any available port and restart solr.


How to access solr web interface?

http://<server name>:<port>/solr   e.g. http://localhost:8983/solr

Configuration for partial search


There is some keyword in my document (e.g. Player), what I want to achieve is to search with partial entry like Pla it should return me "Player". Now question is how to configure field type for this.

First try
<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.KeywordRepeatFilter"/>
            <filter class="solr.PorterStemFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
    </fieldType>
Error : org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] field
Type "text_keyword": Plugin init failure for [schema.xml] analyzer/filter: Error
 loading class 'solr.KeywordRepeatFilter'. I am currently using solr 4.3.1. 


Tried to find out  KeywordRepeatFilter It is package org.apache.lucene.analysis.miscellaneous
Tried small change instead of solr.KeywordRepeatFilter put org.apache.lucene.analysis.miscel
laneous.KeywordRepeatFilter


Another barrier Caused by: java.lang.ClassCastException: class org.apache.lucene.analysis.miscel
laneous.KeywordRepeatFilter
 


Time is running fast, anyhow I have to make this partial search working and introduced another field type

<fieldType name="string_partial_search" class="solr.TextField" sortMissingLast="true" omitNorms="true">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StandardFilterFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10" side="front" />
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10" side="back" />
        </analyzer>
    </fieldType>


Now start debugging I have stored country names and their unemployment rate and inflation rate. I started searching with "Sin" with hope that it will return me all the country names having the word "Sin" any where in the name.

Following is the result
<lst name="responseHeader"> <int name="status">0</int> <int name="QTime">28</int> <lst name="params"> <str name="debugQuery">true</str> <str name="indent">true</str> <str name="q">Sin</str> <str name="_">1374122942599</str> <str name="wt">xml</str> </lst> </lst> <result name="response" numFound="40" start="0">

40 results are found. Debug result is like below.
<lst name="debug"> <str name="rawquerystring">Sin</str> <str name="querystring">Sin</str> <str name="parsedquery">(countryName:si countryName:in countryName:sin)/no_coord</str> <str name="parsedquery_toString">countryName:si countryName:in countryName:sin</str>

As my min gram size is 2 it started with combination si, in and sin.
Some results are Tunisia,Russian Federation, Micronesia, Fed. Sts.,Malaysia,French Polynesia,
Indonesia, Sint Maarten (Dutch part),Singapore are in the top ten results.

Not so happy with the result.
Then found the best one for the current situation. Search with "Sin*". It will give me any thing 
which start with Sin.

<lst name="debug"> <str name="rawquerystring">Sin*</str> <str name="querystring">Sin*</str> <str name="parsedquery">countryName:sin*</str> <str name="parsedquery_toString">countryName:sin*</str> <lst name="explain"> <str name="200"> 1.0 = (MATCH) ConstantScore(countryName:sin*), product of: 1.0 = boost 1.0 = queryNorm </str>

<lst name="responseHeader"> <int name="status">0</int> <int name="QTime">32</int> <lst name="params"> <str name="debugQuery">true</str> <str name="indent">true</str> <str name="q">Sin*</str> <str name="_">1374124575932</str> <str name="wt">xml</str> </lst> </lst> <result name="response" numFound="4" start="0">

Even if we want to find out the occurrence of   Sin in any part of the word. We can search by *Sin*

<lst name="explain"> <str name="200"> 1.0 = (MATCH) ConstantScore(countryName:*sin*), product of: 1.0 = boost 1.0 = queryNorm </str>