solving your #slack anxiety since 2016
your problem, my problem, our problem...
We all love Slack. Slack creates a “virtual hallway” where ideas flow freely, questions are answered by the collective, and debate on topics of interest flourishes. This is wonderful. However, as working-students (or one might argue, members of the 21st century), we recognize that our limited time is no match for the deluge of information available on Slack. This leads to a serious case of slack anxiety. When there is all this useful and valuable information out there, finding the information you need becomes a case of finding a needle in a haystack.
Slack already provides a structure for organizing conversations: channels. Every team has a number of channels where conversation is housed. The channels are sorted by time of message. When a user opens a channel, the most recent messages appear first. The user can then scroll up to find older messages.
Because we were unsatisfied with Slack's default organizing structure, we created awaybot. A user simply invites awaybot to listen to one or more public channels. The bot listens in real time to the conversations as they unfold. It continuously uses natural language processing algorithms to divide the conversation into topics and summarize these topics. When users feel overwhelmed by the deluge of information on Slack, they simply query awaybot. The user provides a command, the name of the channel, and a time duration. The bot returns a high-level summary of the topics that transpired during that duration. It's never been easier to get summaries of the conversations you missed!
the architecture...We use the Slack real time messaging API to listen to the Slack channel. We use a simple Kafka producer to write this information to a Kafka cluster. We then use a Kafka consumer to store the data in a Cassandra cluster from which our NLP builds summaries for different time durations. These summaries are stored in SimpleDB. When the user queries the Slackbot for a summary, the request goes to a python layer that queries SimpleDB for the relevant information and serves it to the user as the Slackbot.
the natural language processing (nlp)...
We have successfully created a topic generator that appropriately splits conversation into relevant topics based on two key ideas: implicit reply and concept similarity.
Implicit Reply clusters messages based on the probability that a message is a reply to a previous message. It does not rely on the assumption that related messages share similar words. This is helpful because it is not uncommon for users who are engaged in conversation on a single topic to interchange words used for similar ideas.
We calculate concept similarity by comparing the geometric representations of words and sentences using Google's Word2Vec models. We first stem the words, and then compare the distances between the message vectors. Messages closer together are closer in semantic meaning. Messages farther apart are semantically more dissimilar.
Once the topic generation is complete, we summarize the topics using word clouds. These word clouds are built around the term frequency–inverse document frequency (TF-IDF) statistics of the words in the topic. This statistic is a measure of how important a word is to a topic in the corpus of all topics. It does this by calculating how often the word appears in that particular topic, offset by how often the word appears in the corpus of all topics.
To test the relevance and validity of awaybot's outputs, we created a small usability test with three testers. We used a public dataset of messages from a channel and team with which none of our testers were familiar. This channel produced 94 messages in a given month.
First, one human tester segmented all messages into topics. The tester created 8 topics. When awaybot ran through the same messages, it produced 7 topics. When the topics were compared, 6 of awaybot's 7 topics were also produced by the human tester.
Second, a different human tester looked at the 7 topics awaybot created. The tester summarized each of the topics without looking at the wordclouds generated by awaybot. Third, a different human tester looked at the wordcloud summarizations produced by awaybot and wrote a summary for the meaning provided by the wordcloud. When the summarizations of the two testers were compared, the summarizations were found to be similar for the majority of topics.
Overall, the validity tests showed that our topic generator algorithm is successful at creating topics that match human intuition. We have room for fine-tuning our summarization algorithm by filtering out more stop words and exploring algorithms more robust than TF-IDF.
We have the opportunity to explore more algorithms for creating and summarizing topics. Right now, we use an algorithm that uses TF-IDF statistics to find important words in topics. However, there are a variety of methods (including TextRank, Latent Semantic Analysis, Latent Dirichlet Allocation, etc.) for summarizing information that might prove useful in our application.
In the longer term, we will launch our code as a Python package to serve as base framework. We have written code that is modular and extensible. We have documented our code well which makes its accessible to others. We can potentially release this code as a package that others in industry and research can use as a baseline for further exploration. For example, we could imagine our framework as a useful tutorial on the Slack API or a basic algorithm for identifying and summarizing topics.
☆ ☆ ☆ ☆ ☆
meet the all-star slackpackThe slackpack is a team of students enrolled in UC Berkeley's Master of Information and Data Science program. This project is the culmination of a semester-long class (W210 - Capstone) that brought together all that we had learned during the program.