Very Basic Guide to Use RSQLite
Very Basic Guide to Use RSQLite
Firstly, you need to load the packages. You could also use lapply to load multiple packages at one go.
# Run this first: install.packages(c('RSQLite.extfuns, 'ggplot2'))
library(DBI)
library(RSQLite)
library(ggplot2)
library(RSQLite.extfuns)
library(plyr)
Establish a sqlite connection with a database file, data-metro.db, and store it as db. In C, you might use something like
apop_db_open("data-metro.db")
If you want to practice the example data, you probably could download the database file from Ben Kleman's blog. Further description about the data is in his book, Modeling with Data.
db = dbConnect(dbDriver("SQLite"), dbname = "data-metro.db", loadable.extensions = TRUE)
# List the tables in this database, so you know which tables to call for.
dbListTables(db)
## [1] "lines" "riders"
The dbListTables lets you know which tables are in the database, you can then use dbGetQuery returns a data frame. SELECT * call all the columns in the table lines (FROM a table called lines).
metroLines <- dbGetQuery(db, " SELECT * FROM lines")
metroRiders <- dbGetQuery(db, " SELECT riders/100. AS riders, line FROM riders, lines WHERE lines.station = riders.station")
The second data frame used a more complicated SQL language. The uppercase is merely to signify for the user to know they are commands to instruct the SQL to query something. So the language goes like,
- SELECT the column, named riders and divided it by 100 FROM the table riders. The dot after 100 is to cause the calculation working on float point rather than on integer.
- Store the result AS riders, which is a new column in the stored data frame. Also store the column line from the lines table.
- The last line links the two tables through the relation of station column, which co-exist in both tables.
Below are the equations of within-variance and amongst-variance. They are separated from the total variance equation, ( var(x) = E[( oldsymbol{x} - ar{x})] ). This piece is about mathematics, not about programming languages.
[ var(x)_{within} = frac{1}{n} sum_{j} [n_j var(x_j)] ]
[ var(x)_{amongst} = frac{1}{n} sum_{j} [n_j ( ar{x_j} - ar{x})^2] ]
# Print the total variance
totalvar <- (metroRiders$riders - mean(metroRiders$riders))^2
sum(totalvar)/length(totalvar)
## [1] 4404
# The residuals Mean Sq also shows the within variance
aov1 <- aov(riders ~ factor(Line), data = metroRiders)
summary(aov1)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(Line) 4 341583 85396 19.9 3.4e-16 ***
## Residuals 2641 11311035 4283
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
metroLineTab <- ddply(metroRiders, .(Line), summarise, vars = var(riders), N = length(riders),
mean = mean(riders))
# Print the within variance
sum(metroLineTab$N * metroLineTab$vars)/nrow(metroRiders)
## [1] 4282
# Print the amongst variance
(sum(metroLineTab$N * (metroLineTab$mean - mean(metroRiders$riders))^2))/nrow(metroRiders)
## [1] 129.1
Here are the result calculated directly from Kleman's C programme.
total variance: 4404.63
within group variance: 4275.54
among group variance: 129.138
sum within+among: 4404.68
download file now