elasticsearch - Query with multi_match is getting not expected order -
i need find phrase in document , need in title , content. title more important content, expect following result:
- get first documents have match both in title , content
- get documents have match in title
- get documents have match in content
it seems quite basic stuff.
so i've created index , data this:
put /test_index put /test_index/article/3263 { "id": 3263, "pagetitle": "lösungen", "searchable_content": "abc" } put /test_index/article/1005 { "id": 1005, "pagetitle": "lösungen", "searchable_content": "test! lösungen test?" } put /test_index/article/677 { "id": 677, "pagetitle": "lösungen", "searchable_content": "test lösungen test!" } put /test_index/article/666 { "id": 666, "pagetitle": "abc", "searchable_content": "test lösungen test abc" }
and run query this:
get /test_index/_search { "query": { "bool": { "must": [{ "multi_match": { "query": "lösungen", "fields": ["pagetitle^2", "searchable_content"] } } ] } }, "highlight": { "fields": { "pagetitle": {}, "searchable_content": {} } } }
but result not expect. document have match in title before documents have match in both title , content this:
{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 4, "max_score": 0.5753642, "hits": [ { "_index": "test_index", "_type": "article", "_id": "3263", "_score": 0.5753642, "_source": { "id": 3263, "pagetitle": "lösungen", "searchable_content": "abc" }, "highlight": { "pagetitle": [ "<em>lösungen</em>" ] } }, { "_index": "test_index", "_type": "article", "_id": "1005", "_score": 0.36464313, "_source": { "id": 1005, "pagetitle": "lösungen", "searchable_content": "test! lösungen test?" }, "highlight": { "searchable_content": [ "test! <em>lösungen</em> test?" ], "pagetitle": [ "<em>lösungen</em>" ] } }, { "_index": "test_index", "_type": "article", "_id": "677", "_score": 0.36464313, "_source": { "id": 677, "pagetitle": "lösungen", "searchable_content": "test lösungen test!" }, "highlight": { "searchable_content": [ "test <em>lösungen</em> test!" ], "pagetitle": [ "<em>lösungen</em>" ] } }, { "_index": "test_index", "_type": "article", "_id": "666", "_score": 0.2876821, "_source": { "id": 666, "pagetitle": "abc", "searchable_content": "test lösungen test abc" }, "highlight": { "searchable_content": [ "test <em>lösungen</em> test abc" ] } } ] } }
what trying manipulating more fields boosting. seems in above case worked setting boost both fields , using most_fields
type this:
get /test_index/_search { "query": { "bool": { "must": [{ "multi_match": { "query": "lösungen", "fields": ["pagetitle^3", "searchable_content^2"], "type": "most_fields" } } ] } }, "highlight": { "fields": { "pagetitle": {}, "searchable_content": {} } } }
and gave expected result set of data.
however if add 2 records:
put /test_index/article/999 { "id": 999, "pagetitle": "abc", "searchable_content": "test lösungen test abc double match lösungen" } put /test_index/article/1006 { "id": 1006, "pagetitle": "lösungen , lösungen", "searchable_content": "test sample" }
it won't work more because results now:
{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 6, "max_score": 2.2315955, "hits": [ { "_index": "test_index", "_type": "article", "_id": "1006", "_score": 2.2315955, "_source": { "id": 1006, "pagetitle": "lösungen , lösungen", "searchable_content": "test sample" }, "highlight": { "pagetitle": [ "<em>lösungen</em> , <em>lösungen</em>" ] } }, { "_index": "test_index", "_type": "article", "_id": "666", "_score": 1.219939, "_source": { "id": 666, "pagetitle": "abc", "searchable_content": "test lösungen test abc" }, "highlight": { "searchable_content": [ "test <em>lösungen</em> test abc" ] } }, { "_index": "test_index", "_type": "article", "_id": "1005", "_score": 0.86785066, "_source": { "id": 1005, "pagetitle": "lösungen", "searchable_content": "test! lösungen test?" }, "highlight": { "searchable_content": [ "test! <em>lösungen</em> test?" ], "pagetitle": [ "<em>lösungen</em>" ] } }, { "_index": "test_index", "_type": "article", "_id": "677", "_score": 0.86785066, "_source": { "id": 677, "pagetitle": "lösungen", "searchable_content": "test lösungen test!" }, "highlight": { "searchable_content": [ "test <em>lösungen</em> test!" ], "pagetitle": [ "<em>lösungen</em>" ] } }, { "_index": "test_index", "_type": "article", "_id": "3263", "_score": 0.8630463, "_source": { "id": 3263, "pagetitle": "lösungen", "searchable_content": "abc" }, "highlight": { "pagetitle": [ "<em>lösungen</em>" ] } }, { "_index": "test_index", "_type": "article", "_id": "999", "_score": 0.7876096, "_source": { "id": 999, "pagetitle": "abc", "searchable_content": "test lösungen test abc double match lösungen" }, "highlight": { "searchable_content": [ "test <em>lösungen</em> test abc double match <em>lösungen</em>" ] } } ] } }
so see text match in content got higher text match in title , content.
could please give me explanation i'm doing wrong here , how fixed?
try constant score so:
get test_index/_search { "query": { "bool": { "should": [ { "constant_score": { "query": { "match": { "pagetitle": { "query": "lösungen" } } }, "boost": 2 } }, { "constant_score": { "query": { "match": { "searchable_content": "lösungen" } } } } ] } }, "highlight": { "fields": { "pagetitle": {}, "searchable_content": {} } } }
constant score, according docs: "...wraps query , returns constant score equal query boost every document in filter." ref
@davide's link understand why match on searchable_content turn higher score document. since want ignore term frequencies , idfs across fields, can use constant score on each field's match.
edit:
according rules listed in original question, above query works perfectly. based on comments op, need rank results on basis of frequency of occurrence of searched term too. apparently, term frequency , inverse document frequency important, perhaps don't care field length here (if want rank results on number of occurrences). in case, i'd advise set index so:
post test_index_v1 { "mappings": { "article": { "properties": { "id": { "type": "long" }, "pagetitle": { "type": "string", "norms": { "enabled": false } }, "searchable_content": { "type": "string", "norms": { "enabled": false } } } } } }
note: type: string
replaced type: text
in version 5 , above.
the link mentioned @davide describes functioning of disabling norms.
secondly, running query on small number of documents, , assuming have more 1 shard assigned index, better run query search_type=dfs_query_then_fetch
local idfs per shard vary lot. (read this)
thirdly, adding last query, want factor in weight of tf-idf. last query ranking documents same, 2 or 3 occurrences of search term in same field. can add bool-should block add score constant-score blocks, so:
get test_index_v1/_search?search_type=dfs_query_then_fetch { "query": { "bool": { "should": [ { "constant_score": { "query": { "match": { "pagetitle": { "query": "lösungen" } } }, "boost": 2 } }, { "constant_score": { "query": { "match": { "searchable_content": "lösungen" } } } }, { "bool": { "should": [ { "match": { "pagetitle": { "query": "lösungen", "boost": 2 } } }, { "match": { "searchable_content": "lösungen" } } ] } } ] } }, "highlight": { "fields": { "pagetitle": {}, "searchable_content": {} } } }
Comments
Post a Comment