Word Prediction via Ngram Model

If you don’t know what it is, try it out here first! If you just want to see the code, checkout my github.

OK, if you tried it out, the concept should be easy for you to grasp. A gram is a unit of text; in our case, a gram is a word. In this application we use trigram – a piece of text with three grams, like “how are you” or “today I meet”. If we accumulate enough trigrams, we will know what trigrams appear more often than others. Such is the mechanism behind word prediction. You can go to Google Ngram Viewer to play with it.

Why do we choose trigram, rather than bigram (two-gram) or 4-gram? Bigram is not particularly interesting; 4-gram model requires a much larger amount of raw text to train. You’ll see what I mean.

This project is inspired by MOOC course Data Science Capstone on Coursera, which is brought to my attention by Ron. I didn’t take the class though; I invited Google as my teacher and came up with the solution below.

Build the model

In this section I show you how to build the trigram model, step by step. First, let’s load some packages:

library(tidyverse)
library(tau)
library(tm)
library(hash)

Second, load the data. The data can be downloaded here. Unzip the data and put it into a data/ folder:

blog.full <- readLines("data/en_US/en_US.blogs.txt")
news.full <- readLines("data/en_US/en_US.news.txt")
twit.full <- readLines("data/en_US/en_US.twitter.txt")

Although the text data can be handled by a single machine, the trigram model probably will be too big; so we divide the data into 50 batches:

# batch size
nbatch <- 50
blog.len <- ceiling(length(blog.full) / nbatch)
news.len <- ceiling(length(news.full) / nbatch)
twit.len <- ceiling(length(twit.full) / nbatch)

So that we can process them one by one.

We define some functions for preprocessing, to remove non-English characters, twitter handles and urls:

# preprocessing functions
removeURL <- function(x) gsub("http\\S+", "", x)
removeHash <- function(x) gsub("[@#&]\\S+", "", x)
# don't know how to remove location, like "@ los angeles"
removeNumPunct <- function(x) gsub("[^A-z[:space:]']*", "", x)
# it turns out that Japanese characters belong to [:alpha:], but not [A-z]. Absurd.

OK, it’s time to actually build the trigram model. Here I use a hash table to represent the model since it is fast and reliable. In the for loop below, we process each batch and append the result into the hash table:

h <- hash()

# loop over batches
for (b in 1:nbatch - 1) {
# for (b in 12932) {
  message(sprintf("Processing the %i-th batch", b))
  
  # concatenate text and preprocess
  blog <- blog.full[blog.len * b + (1:blog.len)]
  news <- news.full[news.len * b + (1:news.len)]
  twit <- twit.full[twit.len * b + (1:twit.len)]
  bnt <- c(blog, news, twit) %>% 
    removeURL() %>% removeHash() %>% 
    removeNumPunct() %>% tolower() %>% stripWhitespace()
  
  # obtain trigram
  trigram <- textcnt(bnt, n=3, split=" ", method="string", decreasing = TRUE) 
  
  # create hash table for fast prediction
  for (i in 1:length(trigram)) {
    if (trigram[i] < 4) {
      break
    } else {
      gram <- strsplit(names(trigram[i]), split = ' ')[[1]]
      history <- paste0(gram[1:2], collapse = ' ')
      candidate <- gram[3]
      count <- trigram[[i]]
      if (candidate %in% h[[history]]$candidate) {
        index <- h[[history]]$candidate == candidate
        h[[history]]$count[index] <- h[[history]]$count[index] + count
      } else {
        h[[history]]$candidate <- c(h[[history]]$candidate, candidate)
        h[[history]]$count <- c(h[[history]]$count, count)
      }
    }
  }
  gc()
}

The hash table is a set of key-value pairs, with keys being the first two words of trigrams, and the values being the last word, and its frequency. For example, h[["how are"]] can be list(c("you","things"), c(9,5)), meaning the trigram “how are you” occurs 9 times, and “how are things” occurs 5 times.

Serve the App

Once we obtain the hash table, we can use it to build the app. We simply look at the last two words that the user types, and return the most probable words in the list. Here we use shiny to deploy the app. For more information, go to Shiny.

library(shiny)
library(hash)
load("hashtable.Rdata")

ui <- fluidPage(
  
  titlePanel("Word Prediction via Ngram"),
  
  hr(),
  
  fluidRow(
    column(5, textInput("userInput", "Please type here ... type at least 
                        two words to enable prediction.")),
    column(7, h3(verbatimTextOutput(outputId = "distPrediction")))
  ),
  
  hr(),
  
  fluidRow(
    column(8, h3("What is this?"),
              h4("This app predicts the next word you will most probably type, based on 
                 previous two words that you already typed. This app uses a tri-gram model
                 that is trained over ~500 MB text data from blogs, news articles and 
                 twitter. Get the source code ", 
                 tags$a(href="https://github.com/chunjiw/wordpred", "here!")))
  ),
  
  hr(),
  h4("Made by Chunji Wang")

)

server <- function(input, output) {
  output$distPrediction <- renderText({
    userInput <- trimws(input$userInput)
    userInput <- gsub('\\s+', ' ', userInput)
    pred.sentense <- paste(userInput, "...")
    words <- strsplit(tolower(userInput), split = ' ')[[1]]
    if (length(words) >= 2) {
      history <- paste0(tail(words, 2), collapse = ' ')
      candidates <- h[[history]]$candidate
      candidates[candidates == 'i'] <- 'I'
      if (length(candidates)) {
        maxDisplay <- min(5, length(candidates))
        for (i in 1:maxDisplay) {
          pred.sentense <- paste(pred.sentense,
                                 paste(userInput, candidates[i], sep = ' '),
                                 sep = '\n')
        }
      }
      pred.sentense
    } else {
      userInput
    }
    })
  
}

shinyApp(ui = ui, server = server)

Voila! Now send the link of your app to your friends and enjoy some thumb up!

Leave comments below if you have questions! Or if you also successfully made the app!