swiss knife blog: January 2011

index csv
curl "http://localhost:8080/solr/fnac/update/csv?commit=true&separator=%7c&header=false&fieldnames=id,reference,ean13,title,desc,mark,cat,,,,,,,,,,,,,,,,,,,,,," --data-binary @marchand.csv -H 'Content-type:text/plain; charset=iso-8859-1'

to delete all
http://localhost:8080/solr/fnac/update?commit=true&stream.body=%3Cdelete%3E%3Cquery%3E*%3A*%3C%2Fquery%3E%3C%2Fdelete%3E

remove items where price=299.00
http://192.168.0.70:8080/solr/lg_fr_small/update?commit=true&stream.body=%3Cdelete%3E%3Cquery%3Eprice%3A299.0%3C%2Fquery%3E%3C%2Fdelete%3E

usefull paths
/var/lib/tomcat6/conf
/etc/tomcat6/solr/fnac/conf
/var/data/solr

logging
/etc/tomcat6/logging.properties
set the following
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].level = INFO

compoundFiles: when set to false, will not authorize to have a lot of file descritpors open, so indexation is slower (you have to do it when you start to have too many open files problems)

query syntax
"jouet cubes"~2 (ecarts)
"(desc:bois)^10000 jouets" (boost)
mobile && _val_:"popularity"^0.7
q.op=AND df=text
Using quoted means "these tokens in exactly this order", not "use this as a literal" (stemming implicitly disable exact search)
&debugQuery=true
&fl=offer_event,score: "field list" to be displayed in the output

query field mapping with edismax

/solr/nl_00/select?q=cat2:5010000&wt=xml&f.cat2.qf=cat&defType=edismax

Filter queries
- fields used by the filter need to be indexed (everything used by Solr need to be indexed, expept what is used only for result feed presentation)
- q=mobile&fq=popularity:[10 TO *] (filter query result set)
- fq={!cache=false}year:[2005 TO *] (disable filter cache)

Function queries
- boost scores: either
. . _val_ in the query
. . bf (boost function) parameter, or extended dismax boost parameter (multiplicative boost).

stemming and exact search

from nabble A phrase means to solr (or rather to the lucene and dismax query parsers, which are what understand double-quoted phrases) "these tokens in exactly this order" So a phrase of one token "manager", is exactly the same as if you didn't use the double quotes. It's only one token, so "all the tokens in this phrase in exactly the order specified" is, well, just the same as one token without phrase quotes.
If you've set up a stemmed field at indexing time, then "manager" and "management" are stemmed IN THE INDEX, probably to something like "manag". There is no longer any information in the index (at least in that field) on what the original literal was, it's been stemmed in the index. So there's no way possible for it to only match certain un-stemmed versions -- at least using that field. And when you enter either 'manager' or 'management' at query time, it is analyzed and stemmed to match that stemmed something-like "manag" in the index either way. If it didn't analyze and stem at query time, then instead the query would just match NOTHING, because neither 'manager' nor 'management' are in the index at all, only the stemmed versions.

Span queries and Payloads

Span queries are like a "near" query: tokens emitted during the analysis include the position of the previous token

Payloads are a byte array that allow to tag infos at index time.

Local params
q={!q.op=AND df=title}solr rocks
q={!type=dismax qf='myfield yourfield'}solr rocks
q={!dismax qf=myfield}solr rocks ---(implies: type:dismax)
q={!dismax}solr rocks --- is equivalent to --- q={!type=dismax qf=myfield v=$qq}&qq=solr rocks --- (parameter dereferencing)
q={!geodist}
q={!cache}
q={!tag=foo} --- tag is a local param to arbitrarily label a parameter
q={!lucene q.op=AND df=text}myfield:foo +bar -baz --- specifies a lucene/solr query

facets

number of results by type for a specific field
field faceted must be indexed
basic facet
http://localhost:8080/solr/append/select?q=mobile&facet=true&facet.limit=10&facet.field=brand
simple facet query: new query on the result set (not really a facet) : how many from shop 164
http://localhost:8080/solr/append/select?q=*:*&start=0&rows=10&facet=true&facet.query=shop:164
facets for automcompletion with "mo"
http://192.168.0.1:8080/solr/append/select?q=mobile&facet=on&facet.limit=10&facet.mincount=1&facet.field=title&facet.prefix=mo
facets by fixed interval method 1 (facet.range)
http://192.168.0.1:8080/solr/append/select?indent=on&version=2.2&q=mobile%0D%0A&fq=&start=0&rows=10&fl=*%2Cscore&facet=true&facet.range=price&facet.range.start=0&facet.range.end=1000&facet.range.gap=100
facets by fixed interval method 2 (facet.query)
http://localhost:8080/solr/select?q=video&rows=0&facet=true&facet.query=price:[*+TO+500]&facet.query=price:[500+TO+*]
facets by dynamically computed intervals is not supported as of Solr 3.3
facet.pivot (Solr4)
blog post on facet.pivot
post group faceting group.truncate=true JIRA SOLR-2665 (solved 3.4)
exlude filter when faceting
http://localhost:8080/solr/select/?q=*%3A*&version=2.2&start=0&rows=0&indent=on&facet=on&facet.field={!ex=fcounty}fcounty&fq={!tag=fcounty}fcounty:Kent
using patch SOLR 2242 (order seems important)
&facet.numTerms=true&facet.limit=-1&facet.mincount=1

Result Grouping

former name: Field collapsing
from Solr 3.3 on
best result for each type of a specific field
ngroup

group=true&group.field=product&rows=0&group.ngroups=true

&group.main=true&group.format=simple: to get the same XML feed format as no grouping
group.truncate compute facet counts for only the highest ranking documents per-group
q=chanel&start=0&rows=10&fl=*%2Cscore&group=true&group.field=shop_name&group.ngroups=true&group.truncate=true&facet=true&facet.field=type_on_sale

sort
&sort=shop_state+desc,score+desc (put shop_state grouping first, then score in each grouping)

optimize
Reorganize segments, merge all segments into one
Remove any deleted docs
stack overflox post about otimize

shards / sharding
http://localhost:8080/solr/fr_big/select?shards=localhost:8080/solr/fr_big,localhost:8080/solr/fr_big2&indent=on&version=2.2&q=mobile&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl.fl=

StatsComponent to get stats on all data of a field in an index:
http://192.168.0.1:8080/solr/append/select?q=mobile&stats=true&stats.field=price

Other notes:

- solr.StrField does not support specifying an analyzer. Have to use solr.TextField with solr.KeywordTokenizerFactory for ex -

-When changing data model, remove the associated

/var/data/solr

- Solr4 atomic updates: curl '192.168.0.20:8080/solr/lg_fr_00/update?commit=true' -H 'Content-type:application/json' -d '[{"external_id":"28027304llant","offer_event":{"set":3}}]'
- to specify a query encoding ?ie=ISO-8859-1

simple install - Download solr from http://mir2.ovh.net/ftp.apache.org/dist//lucene/solr/1.4.1/apache-solr-1.4.1.zip
- Copy apache-solr-1.4.1/dist/apache-solr-1.4.1.war in $TOMCAT_HOME/webapps/solr.war (for ex /var/lib/tomcat6/webapps/ )
- From the solr distribution, copy example/solr in your solr home /etc/tomcat6/solr
- in catalina.sh (/usr/share/tomcat/bin/catalina.sh):

JAVA_OPTS="-Dsolr.solr.home=/etc/tomcat6/solr -Djava.awt.headless=true -server -XX:NewSize=256m -XX:MaxNewSize=256m -XX:PermSize=256m -XX:MaxPermSize=256m -XX:+DisableExplicitGC"

- http://localhost:8080/solr/admin

Full export
wget --output-document=/tmp/offers_202.txt "http://solrhost:8080/solr/lg_fr_00/select?q=*:*&fl=external_id&rows=100000000&wt=csv&csv.header=false"

Zookeeper
Zookeeper and Solr

Get Solr 3.5 source code
svn co http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_5_0
cd lucene_solr_3_5_0/
ant get-maven-poms
ant generate-maven-artifacts (may have mem exception but artifacts already generated)
cd solr
ant dist

Get Solr4 (doesn't compile with Java 8)

works also with Solr 6 / Java 8 ( https://issues.apache.org/jira/browse/SOLR-11014 )

tar zxvf ../Downloads/solr-4.1.0-src.tgz

cd solr-4.1.0/

cp /tmp/apache-ivy-2.4.0/ivy-2.4.0.jar /usr/share/ant/lib/

---------------- using Maven

ant get-maven-poms

Copying 49 files to /home/lee/tools/solr-4.1.0/maven-build

ant ivy-bootstrap

ant generate-maven-artifacts

actually builds everything: ./solr/build/solr-core/classes

cd maven-build

mvn -DskipTests source:jar-no-fork install

To compile, package, and install all binary and source artifacts to your

local repository, without running any tests:

------------- using Eclipse (importing the mvn project in Eclipse doesn't work)

ant eclipse

from Eclipse, create a new Java project with the "solr-4.1.0" folder. A new folder "eclipse-build" is then automatically created

Build only classes. Useful for dev.

------------- using ant

cd solr
ant dist

Solr application servers

Following Solr5, Solr should be used as a blackbox.

https://cwiki.apache.org/confluence/display/solr/Major+Changes+from+Solr+4+to+Solr+5
http://grokbase.com/t/lucene/solr-user/15772hc1jd/jetty-in-solr-5-2-0

As of Solr6, it is still using Jetty but Solr developers may make their own application server mechanism in the future. Solr should be used as a service, and not as a war anymore.

Solr versions

September 2008, Solr 1.3

November 2009 Solr 1.4

In March 2010, the Lucene and Solr projects merged.

24-Jun-2010 1.4.1

30-Mar-2011 3.1.0

03-Jun-2011 3.2.0

01-Jul-2011 3.3.0

14-Sep-2011 3.4.0

6-jun-2017 6.6.0

main JIRAs
--- Post group faceting:
. - JIRA Lucene-3097: Post group faceting presents results in global facets and not in "per group" results. (open v3.4, 4.0)
--- join:
. - JIRA SOLR-2272: join: Map results onto other docs within the same shard fq={!join from=blog_id to=id}append=nikon
--- Field collapsing
. - JIRA SOLR-236: Field collapsing (closed v3.3)
. - JIRA SOLR-1682: following up (open 3.4, 4.0)
. - JIRA SOLR-1683: distributed field collapsing (open 3.4, 4.0)
--- price min & max:
. - JIRA SOLR-1581: Facet by Function, quantize buckets (open v3.4, 4.0): need SOLR-2251 that is duplicate of SOLR-1351
--- distinct:
. - JIRA SOLR-1814: select count(distinct fieldname) in SOLR (patch provided)
. - JIRA SOLR-2242: Get distinct count names for a facet field (open 4.0): "shop count"

swiss knife blog

Thursday, January 13, 2011

Solr

Sunday, January 2, 2011

gccsense

Blog Archive

About Me