ML_for_Hackers
ML_for_Hackers copied to clipboard
Chapter 3 - Error executing get.msg()
Hello guys,
Great book :-) Right now, I am in the 3rd chapter (e-mail classification). I am executing the R commands one by one andi am having a problem getting the list of spam documents (page 81). The command is : all.spam <- sapply(spam.docs, function(p) get.msg(paste(spam.path,p,sep="")))
and the error i get is Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)
Any clue? Thank you very much
I wish there was some way to upvote an issue. I'm having the exact same problem. I figured out that the problem seems to be with the "encoding" argument to the "file" function. If you remove it, it works, but the results you get are somewhat different from those in the book. Also, some weird tokens appear in the list of words found in the corpus. Someone also reported this problem at the Unconfirmed Errata page for the book at O'Reilly: http://oreilly.com/catalog/errataunconfirmed.csp?isbn=0636920018483
Sorry about the lag on this, all. We'll look into it more this weekend and report back.
I am having trouble replicating the error. The current version of the code in the repository reads as follows:
# Get all the SPAM-y email into a single vector
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs != "cmds")]
all.spam <- sapply(spam.docs,
function(p) get.msg(file.path(spam.path, p)))
It runs fine for me on OS X and Ubuntu. So, perhaps the issue is the use of paste
rather than the file
command, or an operating system issue. The paste
function does appear in the text of the book, which should fixed in future editions.
I still get the errors when using file.path. These are the errors I get:
Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.) In addition: Warning messages: 1: In readLines(con) : invalid input found on input connection 'data/spam//00006.5ab5620d3d7c6c0db76234556a16f6c1' 2: In readLines(con) : invalid input found on input connection 'data/spam//00009.027bf6e0b0c4ab34db3ce0ea4bf2edab' 3: In readLines(con) : invalid input found on input connection 'data/spam//00031.a78bb452b3a7376202b5e62a81530449' 4: In readLines(con) : incomplete final line found on 'data/spam//00031.a78bb452b3a7376202b5e62a81530449' 5: In readLines(con) : invalid input found on input connection 'data/spam//00035.7ce3307b56dd90453027a6630179282e' 6: In readLines(con) : incomplete final line found on 'data/spam//00035.7ce3307b56dd90453027a6630179282e'
The problems seems to be with the encoding argument of the file function called in get.msg. If I remove encoding="latin1", the code runs without errors, but the results are quite different from those presented in the book.
I'm working on OS X with R 2.15.0.
What operation system and version of R are you using?
-- John
On Apr 21, 2012, at 9:01 AM, Cesar L. B. Silveira wrote:
I still get the errors when using file.path. These are the errors I get:
Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.) In addition: Warning messages: 1: In readLines(con) : invalid input found on input connection 'data/spam//00006.5ab5620d3d7c6c0db76234556a16f6c1' 2: In readLines(con) : invalid input found on input connection 'data/spam//00009.027bf6e0b0c4ab34db3ce0ea4bf2edab' 3: In readLines(con) : invalid input found on input connection 'data/spam//00031.a78bb452b3a7376202b5e62a81530449' 4: In readLines(con) : incomplete final line found on 'data/spam//00031.a78bb452b3a7376202b5e62a81530449' 5: In readLines(con) : invalid input found on input connection 'data/spam//00035.7ce3307b56dd90453027a6630179282e' 6: In readLines(con) : incomplete final line found on 'data/spam//00035.7ce3307b56dd90453027a6630179282e'
The problems seems to be with the encoding argument of the file function called in get.msg. If I remove encoding="latin1", the code runs without errors, but the results are quite different from those presented in the book.
Reply to this email directly or view it on GitHub: https://github.com/johnmyleswhite/ML_for_Hackers/issues/4#issuecomment-5260339
I'm using OS X Lion with R 2.15.0 (installed from MacPorts).
I also has this error..
That's because of the data files,not the code, open and check the data/spam/000*..which is not a email,but a file list
@foxet is right. The file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' causes the problem. I amended the mask function to include files which begin with '0000.':
spam.docs <- spam.docs[which( !str_detect(spam.docs,"^0000.") & spam.docs != 'cmds' )]
It's the problem of encoding. ReadLines should be useful no matter it is an email. con <- file(path, open="rt") instead of con <- file(path, open="rt", encoding="utf-8") will be work.
The encoding changes does NOT seem to alter the behavior. I am running this on R 2.15.2 on Windows 7 x64. Here is my function:
get.msg <- function(path) { con <- file(path, open="rt", encoding="native.enc") text <- readLines(con) # The message always begins after the first full line break msg <- text[seq(which(text=="")[1] + 1, length(text), 1)] close(con) return(paste(msg, collapse="\n")) }
I have changed encoding to "utf-8", "latin1" and nothing happens. Same error.
Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)
I also applied the suggestions by foxet and quasiben. The fact is my spam folder does not have this file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' at all.
What am I missing, folks?
Do not define parameter "encoding", just use
con <- file(path, open="rt")
2012/11/1 Kingshuk Chatterjee [email protected]
The encoding changes does NOT seem to alter the behavior. I am running this on R 2.15.2 on Windows 7 x64. Here is my function:
get.msg <- function(path) { con <- file(path, open="rt", encoding="native.enc") text <- readLines(con)
The message always begins after the first full line break
msg <- text[seq(which(text=="")[1] + 1, length(text), 1)] close(con) return(paste(msg, collapse="\n")) }
I have changed encoding to "utf-8", "latin1" and nothing happens. Same error.
Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)
I also applied the suggestions by foxet and quasiben. The fact is my spam folder does not have this file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' at all.
What am I missing, folks?
— Reply to this email directly or view it on GitHubhttps://github.com/johnmyleswhite/ML_for_Hackers/issues/4#issuecomment-9969386.
Alright, I did this now: (Removed the encoding parameter)
get.msg <- function(path) { con <- file(path, open="rt") text <- readLines(con) # The message always begins after the first full line break msg <- text[seq(which(text=="")[1] + 1, length(text), 1)] close(con) return(paste(text, collapse="\n")) }
Ran the whole bunch again. The outcome:
Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)
So, like I said earlier, the encoding parameter does not seem to have any affect. Again, I am running this on Windows 7 x64. And here is my whole bunch:
spam.path <- "datasets/spam/"
easyham.path <- "datasets/easy_ham/"
hardham.path <- "datasets/hard_ham/"
get.msg <- function(path) {
con <- file(path, open="rt")
text <- readLines(con)
# The message always begins after the first full line break
msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
close(con)
return(paste(text, collapse="\n"))
}
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs!="cmds")]
spam.docs <- paste(spam.path, spam.docs, sep="")
all.spam.msgs <- sapply(spam.docs, get.msg) - This is the line that throws the above error
you should check if the length(text) >1.
haoyuan hu Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
On Thursday, November 1, 2012 at 11:24 PM, Kingshuk Chatterjee wrote:
Alright, I did this now: (Removed the encoding parameter) get.msg <- function(path) { con <- file(path, open="rt") text <- readLines(con)
The message always begins after the first full line break
msg <- text[seq(which(text=="")[1] + 1, length(text), 1)] close(con) return(paste(text, collapse="\n")) }
Ran the whole bunch again. The outcome: Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)
So, like I said earlier, the encoding parameter does not seem to have any affect. Again, I am running this on Windows 7 x64. And here is my whole bunch: spam.path <- "datasets/spam/" easyham.path <- "datasets/easy_ham/" hardham.path <- "datasets/hard_ham/"
get.msg <- function(path) { con <- file(path, open="rt") text <- readLines(con)The message always begins after the first full line break
msg <- text[seq(which(text=="")[1] + 1, length(text), 1)] close(con) return(paste(text, collapse="\n")) }
spam.docs <- dir(spam.path) spam.docs <- spam.docs[which(spam.docs!="cmds")] spam.docs <- paste(spam.path, spam.docs, sep="") all.spam.msgs <- sapply(spam.docs, get.msg) - This is the line that throws the above error— Reply to this email directly or view it on GitHub (https://github.com/johnmyleswhite/ML_for_Hackers/issues/4#issuecomment-9983913).
Lovely, that works!! Thanks mon. One last question: I see (intermittently) the socket open warning:
Warning message: closing unused connection 3 (datasets/spam/desktop.ini)
This I am presuming is because the underlying code failed to close all the File Sockets? It does not happen all the time though.
Is there a permanent fix for this issue? I'm having the same problem. If I remove the encoding on the file(), then the get.msg function will work, but obviously you lose some encoding information.
Using Win 7 (64bit), RStudio 0.96.331, R 2.15.2
Can confirm that I am seeing a similar issue as others above - `Error in seq.default(which(text == "")[1] + 1, length(text), 1) : wrong sign in 'by' argument``
Solved by dropping the encoding on con
in get.msg. R 3.0.0 on Windows 7, 64 bit.
I have problem in following code:
get.msg <- function(path) { con <- file(path, open = "rt", encoding = "latin1") text <- readLines(con)
The message always begins after the first full line break
msg <- text[seq(which(text == "")[1] + 1, length(text), 1)] close(con) return(paste(msg, collapse = "\n")) }
How can i do , please some body help me!!
I want say that if I am not use the parameter for encoding, it's ok for working, but when I key in spam.tdm <- get.tdm(all.spam)
The output error information is following: Error in tolower(txt) : invalid multibyte string 1
Who have same situation? Please help me!!
Thanks
I have same issue as y1239051. My system is Win7, 32bit, R version 3.0.2, RStudio Version 0.98.490. However, it seem OK on my old XP system. And,it spent so long time on command "spam.tdm <- get.tdm(all.spam)" that I aborted its running. I will try again.
Ooops!, I try XP system again, and get same error!
I found a solution following these steps:
- Remove "encoding='latin1'" in function get.msg()
- In function get.tdm(), add doc.corpus <- tm_map(doc.corpus, function(x) iconv(x, to='UTF-8', sub='byte')) before doc.dtm <- TermDocumentMatrix(doc.corpus, control)
The solution made program run normally. But, the results are a little different.
head(spam.df[with(spam.df,order(-occurrence)),]) term frequency density occurrence 7471 email 813 0.005859586 0.566 18382 please 425 0.003063129 0.508 14339 list 409 0.002947811 0.444 26848 will 828 0.005967697 0.422 2831 body 379 0.002731591 0.408 9124 free 539 0.003884769 0.390
@y1239051 after I changed the function 'get.msg' to {... con <- file(path, open = "rt") ...} and deleted the wrong encoding words(just one sentence) in file:"00136.faa39d8e816c70f23b4bb8758d8a74f0" the command: all.spam <- sapply(spam.docs,
-
works. but the following command: spam.tdm <- get.tdm(all.spam) received the same problem like this: Error in .tolower(txt) : invalid multibyte string 1 how did you fix it? thanks.function(p) get.msg(file.path(spam.path, p)))
For those of you still have this problem, I'd suggest try removing the
"open" parameter from file
function. It worked for me on
R 3.0.3, Win7 x64, and didn't break anything on R 3.1.1, Ubuntu 12.04
After i correct the encoding parameter to con <- file(path, open = "rt", encoding ="native.enc"), the program can run; however it still show the warning "incomplete final line found on 'data/spam/00136.faa39d8e816c70f23b4bb8758d8a74f0' " in the end of command line. Dose anyone knows what's wrong with this warning ?
Hi Donnie @Donnie-Liu,
I tested your solution, however, your change on get.tdm will cause error:
Error: inherits(doc, "TextDocument") is not TRUE
Could you paste the full text of your get.tdm definition?
Same thing here okamipride what is the solution to this warning ???
library(tm) library(ggplot2)
#defining paths
spam.path<- "data/spam/" spam2.path<- "data/spam_2/" easyham.path <- "data/easy_ham/" easyham2.path <- "data/easy_ham_2/" hardham.path <- "data/hard_ham/" hardham2.path <- "data/hard_ham_2/"
#creating get.msg function
get.msg <- function(path) { con <- file(path, open="rt", encoding="native.enc") text <- readLines(con)
The message always begins after the first full line break
msg <- text[seq(which(text=="")[1]+1,length(text),1)] close(con) return(paste(msg, collapse="\n")) }
#creating spam training dataset
spam.docs <- dir(spam.path) spam.docs <- spam.docs[which(spam.docs!="cmds")] all.spam <- sapply(spam.docs,function(p) get.msg(paste(spam.path, p,sep="")))
get.tdm <- function(doc.vec) { doc.corpus <- Corpus(VectorSource(doc.vec)) control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, minDocFreq=2) doc.dtm <- TermDocumentMatrix(doc.corpus, control) return(doc.dtm) } spam.tdm <- get.tdm(all.spam)
spam.matrix <- as.matrix(spam.tdm) spam.counts <- rowSums(spam.matrix) spam.df <- data.frame(cbind(names(spam.counts), as.numeric(spam.counts)), stringsAsFactors=FALSE) names(spam.df) <- c("term","frequency") spam.df$frequency <- as.numeric(spam.df$frequency) spam.occurrence <- sapply(1:nrow(spam.matrix), function(i) {length(which(spam.matrix[i,] > 0))/ncol(spam.matrix)}) spam.density <- spam.df$frequency/sum(spam.df$frequency) spam.df <- transform(spam.df, density=spam.density, occurrence=spam.occurrence)
#creating easyham.df
easyham.docs <- dir(easyham.path) easyham.docs <- easyham.docs[which(easyham.docs!="cmds")] all.easyham <- sapply(easyham.docs, function(p) get.msg(paste(easyham.path,p,sep="")))[1:500]
get.tdm <- function(doc.vec) { doc.corpus <- Corpus(VectorSource(doc.vec)) control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, minDocFreq=2) doc.dtm <- TermDocumentMatrix(doc.corpus, control) return(doc.dtm) } easyham.tdm <- get.tdm(all.easyham)
easyham.matrix <- as.matrix(easyham.tdm) easyham.counts <- rowSums(easyham.matrix) easyham.df <- data.frame(cbind(names(easyham.counts), as.numeric(easyham.counts)), stringsAsFactors=FALSE) names(easyham.df) <- c("term","frequency") easyham.df$frequency <- as.numeric(easyham.df$frequency) easyham.occurrence <- sapply(1:nrow(easyham.matrix), function(i) {length(which(easyham.matrix[i,] > 0))/ncol(spam.matrix)}) easyham.density <- easyham.df$frequency/sum(easyham.df$frequency) easyham.df <- transform(easyham.df, density=easyham.density, occurrence=easyham.occurrence)
creating the classifier
classify.email <- function(path, training.df, prior=0.5, c=1e-6) { msg <- get.msg(path) msg.tdm <- get.tdm(msg) msg.freq <- rowSums(as.matrix(msg.tdm))
Find intersections of words
msg.match <- intersect(names(msg.freq), training.df$term) if(length(msg.match) < 1) { return(prior*c^(length(msg.freq))) } else { match.probs <- training.df$occurrence[match(msg.match, training.df$term)] return(prior * prod(match.probs) * c^(length(msg.freq)-length(msg.match))) } }
#Testing the classifier
hardham.docs <- dir(hardham.path) hardham.docs <- hardham.docs[which(hardham.docs != "cmds")] hardham.spamtest <- sapply(hardham.docs, function(p) classify.email(paste(hardham.path, p, sep=""), training.df=spam.df)) hardham.hamtest <- sapply(hardham.docs, function(p) classify.email(paste(hardham.path, p, sep=""), training.df=easyham.df)) hardham.res <- ifelse(hardham.spamtest > hardham.hamtest, TRUE, FALSE) summary(hardham.res)
use this code in chapter 3. create a code for easyham.df, which is not given in the book. so you can use this complete code with code written for easyham files creation. the encoding is changed from "latin1" to "naive.enc" also, a file in spam folder is corrupted, which is causing the errors. so, better alternative is to delete that file and then run the code.
delete this file - spam/00002.d94f1b97e48ed3b553b3508d116e6a09. also as written in the book, use only first 500 sample mails from the easyham folder for better results.
hope, you found this solution genuine and good enough.