WordCloud Using R Language

20 Sep, 2018 / WHIZ.AI By: Innovation Incubator

Why R is considered as the most prominent language for Data Science ? What is the special recipe making this langauge to work with data so efficiently ? So in this blog we gonna tell you the key things of R and also give a trail on how to work with R by generating a wordcloud from a article .

Introduction to R programming :

R is a programming language developed by Ross Ihaka and Robert Gentleman in 1993. R possesses an extensive catalog of statistical and graphical methods. It includes machine learning algorithm, linear regression, time series, statistical inference to name a few. Most of the R libraries are written in R.R is a programming language and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.R is not only entrusted by academic, but many large companies also use R programming language, including Uber, Google, Airbnb, Facebook and so on.

Why we use R nowadays?

Roughly half of all data scientists use R for data mining and statistical analysis — it is the programming language of choice within the rather nebulous “Big data” industry you keep hearing about. R includes built-in functions and variables designed to make statistical analysis easier, and it also provides graphic-generation tools that produce publication-quality data visualizations.R is highly extensible, and many packages exist to address specific data analysis tasks and problems. It owes a part of it’s popularity to its open-source status, which means that anyone can use R and have access to world-quality statistical analysis tools.R is designed to work on virtually any platform and can be run on systems with a Unix, Linux, Windows, or Mac OS operating system.

R – Installation :

For installing R ,follow the link below
https://cran.r-project.org/

R Studio Installation :

For installing R Studio ,follow the link below
https://www.rstudio.com/products/rstudio/download/#download

R Packages :

Packages are collections of **R** functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. **R** comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.

Install : Package Description

install.packages("tm") # for text mining
install.packages("wordcloud") # word-cloud generator

The package “tm” is used to text mining and the package “wordcloud” is used to generate wordcloud

Load : Library

library("tm")
library("wordcloud")

The function library is used to load the installed packages

Choosing The File

text <- readLines(file.choose())

Choose the dataset file like your article for which you have to generate wordcloud.

Load the data as a corpus-collections of documents containing (natural language) text

words <- Corpus(VectorSource(text))
inspect(words) # View

Load our corpus and extract the words from it

Cleaning the Text

 
# Convert the text to lower case
words <- tm_map(words, content_transformer(tolower))
 
# Remove numbers
words <- tm_map(words, removeNumbers)
 
# Remove english common stopwords
words <- tm_map(words, removeWords, stopwords("english"))
 
# specify your stopwords as a character vector
words <- tm_map(words, removeWords, c("the", "is"))
 
# Remove punctuations
words <- tm_map(words, removePunctuation)
 
# Eliminate extra white spaces
words <- tm_map(words, stripWhitespace)

Before processing the corpus we need to clean it . For example removinf stop words , punctuations , whitespaces etc

Build a term-document matrix


textdocument<- TermDocumentMatrix(words)
matrix<- as.matrix(textdocument)
sum <- sort(rowSums(m),decreasing=TRUE)
dataframe <- data.frame(word = names(v),freq=v)
head(d, 10)

Counting the frequency of words in a document.

Generate the Word cloud

set.seed(1)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,max.words=200, random.order=FALSE, rot.per=0.15,colors=brewer.pal(8, "Dark2"))

WORDCLOUD FOR THE DATASET

Dataset  : https://drive.google.com/open?id=1XEw73_0DmYYM48C1sasqfVLdTtUgu9r4

Conclusion:

R is free and open-source, making it possible for anyone to have access to world-class statistical analysis tools. It is used widely in academia and the private sector and is the most popular statistical analysis programming language today. Learning R isn’t easy — if it was, data scientists wouldn’t be in such high demand. However, there is no shortage of quality resources you can use to learn R if you’re willing to put in the time and effort.

Author : Vishnu Mankulam , Data Engineer