Thursday, January 13, 2011

Solr


  • index csv
    curl "http://localhost:8080/solr/fnac/update/csv?commit=true&separator=%7c&header=false&fieldnames=id,reference,ean13,title,desc,mark,cat,,,,,,,,,,,,,,,,,,,,,," --data-binary @marchand.csv -H 'Content-type:text/plain; charset=iso-8859-1'

  • to delete all
    http://localhost:8080/solr/fnac/update?commit=true&stream.body=%3Cdelete%3E%3Cquery%3E*%3A*%3C%2Fquery%3E%3C%2Fdelete%3E

  • remove items where price=299.00
    http://192.168.0.70:8080/solr/lg_fr_small/update?commit=true&stream.body=%3Cdelete%3E%3Cquery%3Eprice%3A299.0%3C%2Fquery%3E%3C%2Fdelete%3E

  • usefull paths
    /var/lib/tomcat6/conf
    /etc/tomcat6/solr/fnac/conf
    /var/data/solr

  • logging
    /etc/tomcat6/logging.properties
    set the following
    org.apache.catalina.core.ContainerBase.[Catalina].[localhost].level = INFO

  • compoundFiles: when set to false, will not authorize to have a lot of file descritpors open, so indexation is slower (you have to do it when you start to have too many open files problems)

  • query syntax
    "jouet cubes"~2 (ecarts)
    "(desc:bois)^10000 jouets" (boost)
    mobile && _val_:"popularity"^0.7
    q.op=AND df=text
    Using quoted means "these tokens in exactly this order", not "use this as a literal" (stemming implicitly disable exact search)
    &debugQuery=true
    &fl=offer_event,score: "field list" to be displayed in the output
  • query field mapping with edismax




  • /solr/nl_00/select?q=cat2:5010000&wt=xml&f.cat2.qf=cat&defType=edismax



  • Filter queries
    - fields used by the filter need to be indexed (everything used by Solr need to be indexed, expept what is used only for result feed presentation)
    - q=mobile&fq=popularity:[10 TO *] (filter query result set)
    - fq={!cache=false}year:[2005 TO *] (disable filter cache)

  • Function queries
    - boost scores: either
    . . _val_ in the query
    . . bf (boost function) parameter, or extended dismax boost parameter (multiplicative boost). 





  • stemming and exact search 
  • from nabble A phrase means to solr (or rather to the lucene and dismax query parsers, which are what understand double-quoted phrases) "these tokens in exactly this order" So a phrase of one token "manager", is exactly the same as if you didn't use the double quotes. It's only one token, so "all the tokens in this phrase in exactly the order specified" is, well, just the same as one token without phrase quotes.
    If you've set up a stemmed field at indexing time, then "manager" and "management" are stemmed IN THE INDEX, probably to something like "manag". There is no longer any information in the index (at least in that field) on what the original literal was, it's been stemmed in the index. So there's no way possible for it to only match certain un-stemmed versions -- at least using that field. And when you enter either 'manager' or 'management' at query time, it is analyzed and stemmed to match that stemmed something-like "manag" in the index either way. If it didn't analyze and stem at query time, then instead the query would just match NOTHING, because neither 'manager' nor 'management' are in the index at all, only the stemmed versions. 




  • Span queries and Payloads




  • Span queries are like a "near" query: tokens emitted during the analysis include the position of  the previous token



  • Payloads are a byte array that allow to tag infos at index time.




  •  Local params
    q={!q.op=AND df=title}solr rocks
    q={!type=dismax qf='myfield yourfield'}solr rocks
    q={!dismax qf=myfield}solr rocks ---(implies: type:dismax)
    q={!dismax}solr rocks --- is equivalent to --- q={!type=dismax qf=myfield v=$qq}&qq=solr rocks --- (parameter dereferencing)
    q={!geodist}
    q={!cache}
    q={!tag=foo} --- tag is a local param to arbitrarily label a parameter
    q={!lucene q.op=AND df=text}myfield:foo +bar -baz --- specifies a lucene/solr query

  • facets
    • number of results by type for a specific field
    • field faceted must be indexed
    • basic facet
      http://localhost:8080/solr/append/select?q=mobile&facet=true&facet.limit=10&facet.field=brand
    • simple facet query: new query on the result set (not really a facet) : how many from shop 164
      http://localhost:8080/solr/append/select?q=*:*&start=0&rows=10&facet=true&facet.query=shop:164
    • facets for automcompletion with "mo"
      http://192.168.0.1:8080/solr/append/select?q=mobile&facet=on&facet.limit=10&facet.mincount=1&facet.field=title&facet.prefix=mo
    • facets by fixed interval method 1 (facet.range)
      http://192.168.0.1:8080/solr/append/select?indent=on&version=2.2&q=mobile%0D%0A&fq=&start=0&rows=10&fl=*%2Cscore&facet=true&facet.range=price&facet.range.start=0&facet.range.end=1000&facet.range.gap=100
    • facets by fixed interval method 2 (facet.query)
      http://localhost:8080/solr/select?q=video&rows=0&facet=true&facet.query=price:[*+TO+500]&facet.query=price:[500+TO+*]
    • facets by dynamically computed intervals is not supported as of Solr 3.3
    • facet.pivot (Solr4)
      blog post on facet.pivot
    • post group faceting group.truncate=true JIRA SOLR-2665 (solved 3.4)
    • exlude filter when faceting
      http://localhost:8080/solr/select/?q=*%3A*&version=2.2&start=0&rows=0&indent=on&facet=on&facet.field={!ex=fcounty}fcounty&fq={!tag=fcounty}fcounty:Kent
    • using patch SOLR 2242 (order seems important)
      &facet.numTerms=true&facet.limit=-1&facet.mincount=1
  • Result Grouping
    • former name: Field collapsing
    • from Solr 3.3 on
    • best result for each type of a specific field
    • http://192.168.0.1:8080/solr/append/select?q=mobile&group=true&group.field=brand
    • ngroup
      • group=true&group.field=product&rows=0&group.ngroups=true
    • &group.main=true&group.format=simple: to get the same XML feed format as no grouping
    • group.truncate compute facet counts for only the highest ranking documents per-group
      q=chanel&start=0&rows=10&fl=*%2Cscore&group=true&group.field=shop_name&group.ngroups=true&group.truncate=true&facet=true&facet.field=type_on_sale
  • sort
    &sort=shop_state+desc,score+desc (put shop_state grouping first, then score in each grouping)






  • optimize
    Reorganize segments, merge all segments into one
    Remove any deleted docs
    stack overflox post about otimize






  • shards / sharding
    http://localhost:8080/solr/fr_big/select?shards=localhost:8080/solr/fr_big,localhost:8080/solr/fr_big2&indent=on&version=2.2&q=mobile&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl.fl=

  • StatsComponent to get stats on all data of a field in an index:
    http://192.168.0.1:8080/solr/append/select?q=mobile&stats=true&stats.field=price


  • Other notes:
    - solr.StrField does not support specifying an analyzer. Have to use solr.TextField with solr.KeywordTokenizerFactory for ex - 
    -When changing data model, remove the associated
    /var/data/solr
    - Solr4 atomic updates: curl '192.168.0.20:8080/solr/lg_fr_00/update?commit=true' -H 'Content-type:application/json' -d '[{"external_id":"28027304llant","offer_event":{"set":3}}]'
    - to specify a query encoding   ?ie=ISO-8859-1

    simple install  - Download solr from http://mir2.ovh.net/ftp.apache.org/dist//lucene/solr/1.4.1/apache-solr-1.4.1.zip
    - Copy apache-solr-1.4.1/dist/apache-solr-1.4.1.war in $TOMCAT_HOME/webapps/solr.war (for ex /var/lib/tomcat6/webapps/ )
    - From the solr distribution, copy example/solr in your solr home /etc/tomcat6/solr
    - in catalina.sh (/usr/share/tomcat/bin/catalina.sh):
    JAVA_OPTS="-Dsolr.solr.home=/etc/tomcat6/solr -Djava.awt.headless=true -server -XX:NewSize=256m -XX:MaxNewSize=256m -XX:PermSize=256m -XX:MaxPermSize=256m -XX:+DisableExplicitGC"
    - http://localhost:8080/solr/admin

    Full export
    wget --output-document=/tmp/offers_202.txt "http://solrhost:8080/solr/lg_fr_00/select?q=*:*&fl=external_id&rows=100000000&wt=csv&csv.header=false"





  • Zookeeper
    Zookeeper and Solr

  • Get Solr 3.5 source code
    svn co http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_5_0
    cd lucene_solr_3_5_0/
    ant get-maven-poms
    ant generate-maven-artifacts (may have mem exception but artifacts already generated)
    cd solr
    ant dist

  • Get Solr4    (doesn't compile with Java 8)





  • works also with Solr 6 / Java 8 ( https://issues.apache.org/jira/browse/SOLR-11014

  • tar zxvf ../Downloads/solr-4.1.0-src.tgz
    cd solr-4.1.0/

    cp /tmp/apache-ivy-2.4.0/ivy-2.4.0.jar /usr/share/ant/lib/
    ---------------- using Maven
    ant get-maven-poms
        Copying 49 files to /home/lee/tools/solr-4.1.0/maven-build
    ant ivy-bootstrap
    ant generate-maven-artifacts  
        actually builds everything: ./solr/build/solr-core/classes
    cd maven-build
    mvn -DskipTests source:jar-no-fork install
        To compile, package, and install all binary and source artifacts to your
        local repository, without running any tests:
    ------------- using Eclipse (importing the mvn project in Eclipse doesn't work)
    ant eclipse
    from Eclipse, create a new Java project with the "solr-4.1.0" folder. A new folder "eclipse-build" is then automatically created
    Build only classes. Useful for dev.
    ------------- using ant
    cd solr
    ant dist
    • Solr application servers
      • Following Solr5, Solr should be used as a blackbox.
    https://cwiki.apache.org/confluence/display/solr/Major+Changes+from+Solr+4+to+Solr+5
    http://grokbase.com/t/lucene/solr-user/15772hc1jd/jetty-in-solr-5-2-0
      •  As of Solr6, it is still using Jetty but Solr developers may make their own application server mechanism in the future. Solr should be used as a service, and not as a war anymore.



  • Solr versions



  • September 2008, Solr 1.3



  • November 2009 Solr 1.4



  • In March 2010, the Lucene and Solr projects merged.



  • 24-Jun-2010 1.4.1



  • 30-Mar-2011 3.1.0



  • 03-Jun-2011 3.2.0



  • 01-Jul-2011 3.3.0



  • 14-Sep-2011 3.4.0 



  • 6-jun-2017   6.6.0





  • main JIRAs
    --- Post group faceting:
    . - JIRA Lucene-3097: Post group faceting presents results in global facets and not in "per group" results. (open v3.4, 4.0)
    --- join:
    . - JIRA SOLR-2272: join: Map results onto other docs within the same shard fq={!join from=blog_id to=id}append=nikon
    --- Field collapsing
    . - JIRA SOLR-236: Field collapsing (closed v3.3)
    . - JIRA SOLR-1682: following up (open 3.4, 4.0)
    . - JIRA SOLR-1683: distributed field collapsing (open 3.4, 4.0)
    --- price min & max:
    . - JIRA SOLR-1581: Facet by Function, quantize buckets (open v3.4, 4.0): need SOLR-2251 that is duplicate of SOLR-1351
    --- distinct:
    . - JIRA SOLR-1814: select count(distinct fieldname) in SOLR (patch provided)
    . - JIRA SOLR-2242: Get distinct count names for a facet field (open 4.0): "shop count"
  • Sunday, January 2, 2011

    gccsense

    gccsense is used to do automatic completion when using emacs / C++. At first, did not install using Ubuntu 9.04, but eventually worked with Ubuntu 9.10.

    It worked OK using the following:

    http://cx4a.org/software/gccsense/

    I had to add the following two lines to my .emacs:

    (add-to-list 'load-path "~/.emacs.d/")
    (require 'gccsense)

    Got the completion when doing M-x gccsense-complete after . or ->.
    Have to record include paths with gccrec if you want completion with all your files (A little bit complicated)
    .