terça-feira, 11 de outubro de 2016

Using MongoDB with R directly

Hello everyone, this post is about integrating mongodb and R directly using the library "mongolite". For that it will create a dataframe that contains 20.002 rows and then insert it 'n' times in the mongodatabase in order to perform a small database-writting benchmark.

The code was tested using RStudio (R version 3.3.1) and should be working also using the command line R. It also suppose that you have already installed mongodb in your computer and it is already running.

The fist step is to install the package using the command: install.packages("mongolite")

Then we can use it: library(mongolite)

Notice that it might also install some additional packages required to make it work properly.

--Inserting a custom dataframe into the database and the benchmark function:

library(mongolite)
m <- mongo(collection = "test")

# I believe it supposes that you are using the default port for mongodb

# Initiates a variable that will hold our temporary dataframe
df <- NULL;
title <- c("name","path","timestamp","type")

t1 <- Sys.time()
c1 <- "creation"
d1 <- c("user","/Users/user/home/examples/test",t1,c1)

t2 <- Sys.time()
c2 <- "creation"
d2 <- c("user2", "/Users/user2/home/examples/test",t2,c2)

rbind(df,d1)->df
rbind(df,d2)->df
colnames(df) <- title

# Create a dataframe with the replicated values.
for (i in 1:10000) {
  t1 <- Sys.time()
  c1 <- "update"
  d1 <- c("user1", "/Users/user/home/examples/test",t1,c1)

  t2 <- Sys.time()
  c2 <- "update"
  d2 <- c("user2", "/Users/user2/home/examples/test",t2,c2)
  rbind(df,d1) -> df
  rbind(df,d2) -> df
}

print("DATA CREATED")
df2 <- data.frame(df)
print("Converted to DATAFRAME")

insert <- function() {
  initialTime <- Sys.time()
  m$insert(df2)
  finalTime <- Sys.time()
  total <- difftime(finalTime,initialTime, units = c("secs"))
  return(total)
}

x <- 0
y <- 0

# This function will insert the 20.002 data 'n; times in the database and then plot time the computer needed to do it.

testIt <- function(iterations) {
  y <<- rep(0,iterations)
  x <<- rep(0,iterations)
  for(i in 1:iterations) {
    x[i] <<- as.numeric(insert())
    y[i] <<- m$count()
  }
  plot(x,y,xlab="Time To Write", ylab="Number of items in database")
}

That's it.
For more information, refer to the following webpage:
https://cran.r-project.org/web/packages/mongolite/vignettes/intro.html

segunda-feira, 10 de outubro de 2016

Install Spark in RStudio

Hello everyone, today I will be sharing my experience of installing the Spark tool to R when downloading it from the website

1) Go to the Spark website and download the latest version

2) Put the downloaded file in the place you want (for example your home folder, etc)

3) The R library should be in  /R/lib/SparkR/ (inside your spark folder)

4) Set the environment variable of R:
4.1) To check the environment variable, you need to use the command: Sys.getenv()
4.2) Then use the command to set set the environment variable in R studio: Sys.setenv(SPARK_HOME='/YOUR_PATH_TO_SPARK')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))
Done!

Using Spark

To use spark you simply have to call library(SparkR) as you would do with any other normal libraries