Lynn Cherny, Ph.D.
All examples assume localhost:9200 for ES.
A tutorial in 25 minutes ("highlights")
Tendency for higher ratings
in this dataset.
You could show this to users in a UI, of course...
But you can keep in mind that ES does this for you-- we won't use the tokenization we do here in ES directly.
Very positive, unsurprising given star ratings.
many more lines....
many more lines....
(lol people are awesome)
TSNE of top 1500 words
The exploration phase gives you some principled places to filter, if you want.
by length, by user id, by topic....
Filtering out some of the reviews -- very short or very long.
And very long ones.
Note: msgpack is a nice fast format available in pandas for saving these. Save all intermediate results too, because some of the operations take a long time on these df's.
See my code notebooks for this!
Get yourself the Sense plugin (installs to Kibana as an app, so install that too.)
standard: my kitty cat's a pain in the neck
english: my kitti cat pain neck
Create the settings...
A gotcha: It will use standard analyzer if it doesn't use yours -- always double check.
html stripped, & converted, "a" and "the" are stripped, lowercase applied...
Load your dataframe with pd.read_msgpack("yelp_df_forES.msg") and define a possible mapping.
localEs.search(index='yelp', doc_type='review', q='pizza-cookie')
Read how it works with tf-idf: https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html
TF (term frequency): frequency of a term in a document (relative to length, or number of tokens in the doc)
IDF (inverse document frequency): log of the number of docs containing a term divided by the number of docs.
TF-IDF measures importance of a term in a document, relative to a set of documents.
also "and" instead of default "or" to match all terms.
Aggregates will count results for you.
Docs on tricks for more accurate results: https://www.elastic.co/guide/en/elasticsearch/guide/master/most-fields.html
You can also boost one of them to make it weigh more in the score, e.g.,
"fields": ["text^5", "text_orig"]
Boolean, with must/should/must_not,
and ranges: "gte", "lte" on star ratings
This is your design decision... You have access to ratings over time, sentiment of ratings, terms that match, places that have those ratings....
Let's say we do "places that have those terms, sorted by mean stars."
I'm matching all words in the query in this example. Do it as you like...
Aggregate counts of matching
businesses + their stars.
Fix the info design, of course :)
Use the biz id stored in the link text to allow a new search limited to the business id,
to get matches with the search text (we sort by date)
Searching for matches in the single business: