ML_for_Hackers icon indicating copy to clipboard operation
ML_for_Hackers copied to clipboard

Chapter 3 - Error executing get.msg()

Open erwtokritos opened this issue 12 years ago • 28 comments

Hello guys,

Great book :-) Right now, I am in the 3rd chapter (e-mail classification). I am executing the R commands one by one andi am having a problem getting the list of spam documents (page 81). The command is : all.spam <- sapply(spam.docs, function(p) get.msg(paste(spam.path,p,sep="")))

and the error i get is Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)

Any clue? Thank you very much

erwtokritos avatar Mar 28 '12 14:03 erwtokritos

I wish there was some way to upvote an issue. I'm having the exact same problem. I figured out that the problem seems to be with the "encoding" argument to the "file" function. If you remove it, it works, but the results you get are somewhat different from those in the book. Also, some weird tokens appear in the list of words found in the corpus. Someone also reported this problem at the Unconfirmed Errata page for the book at O'Reilly: http://oreilly.com/catalog/errataunconfirmed.csp?isbn=0636920018483

cesarblum avatar Apr 14 '12 03:04 cesarblum

Sorry about the lag on this, all. We'll look into it more this weekend and report back.

johnmyleswhite avatar Apr 14 '12 13:04 johnmyleswhite

I am having trouble replicating the error. The current version of the code in the repository reads as follows:

# Get all the SPAM-y email into a single vector
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs != "cmds")]
all.spam <- sapply(spam.docs,
               function(p) get.msg(file.path(spam.path, p)))

It runs fine for me on OS X and Ubuntu. So, perhaps the issue is the use of paste rather than the file command, or an operating system issue. The paste function does appear in the text of the book, which should fixed in future editions.

drewconway avatar Apr 20 '12 19:04 drewconway

I still get the errors when using file.path. These are the errors I get:

Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.) In addition: Warning messages: 1: In readLines(con) : invalid input found on input connection 'data/spam//00006.5ab5620d3d7c6c0db76234556a16f6c1' 2: In readLines(con) : invalid input found on input connection 'data/spam//00009.027bf6e0b0c4ab34db3ce0ea4bf2edab' 3: In readLines(con) : invalid input found on input connection 'data/spam//00031.a78bb452b3a7376202b5e62a81530449' 4: In readLines(con) : incomplete final line found on 'data/spam//00031.a78bb452b3a7376202b5e62a81530449' 5: In readLines(con) : invalid input found on input connection 'data/spam//00035.7ce3307b56dd90453027a6630179282e' 6: In readLines(con) : incomplete final line found on 'data/spam//00035.7ce3307b56dd90453027a6630179282e'

The problems seems to be with the encoding argument of the file function called in get.msg. If I remove encoding="latin1", the code runs without errors, but the results are quite different from those presented in the book.

I'm working on OS X with R 2.15.0.

cesarblum avatar Apr 21 '12 13:04 cesarblum

What operation system and version of R are you using?

-- John

On Apr 21, 2012, at 9:01 AM, Cesar L. B. Silveira wrote:

I still get the errors when using file.path. These are the errors I get:

Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.) In addition: Warning messages: 1: In readLines(con) : invalid input found on input connection 'data/spam//00006.5ab5620d3d7c6c0db76234556a16f6c1' 2: In readLines(con) : invalid input found on input connection 'data/spam//00009.027bf6e0b0c4ab34db3ce0ea4bf2edab' 3: In readLines(con) : invalid input found on input connection 'data/spam//00031.a78bb452b3a7376202b5e62a81530449' 4: In readLines(con) : incomplete final line found on 'data/spam//00031.a78bb452b3a7376202b5e62a81530449' 5: In readLines(con) : invalid input found on input connection 'data/spam//00035.7ce3307b56dd90453027a6630179282e' 6: In readLines(con) : incomplete final line found on 'data/spam//00035.7ce3307b56dd90453027a6630179282e'

The problems seems to be with the encoding argument of the file function called in get.msg. If I remove encoding="latin1", the code runs without errors, but the results are quite different from those presented in the book.


Reply to this email directly or view it on GitHub: https://github.com/johnmyleswhite/ML_for_Hackers/issues/4#issuecomment-5260339

johnmyleswhite avatar Apr 21 '12 14:04 johnmyleswhite

I'm using OS X Lion with R 2.15.0 (installed from MacPorts).

cesarblum avatar Apr 21 '12 16:04 cesarblum

I also has this error..

hanfeisun avatar May 24 '12 21:05 hanfeisun

That's because of the data files,not the code, open and check the data/spam/000*..which is not a email,but a file list

foxet avatar Jul 22 '12 15:07 foxet

@foxet is right. The file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' causes the problem. I amended the mask function to include files which begin with '0000.':

spam.docs <- spam.docs[which( !str_detect(spam.docs,"^0000.") & spam.docs != 'cmds' )]

quasiben avatar Sep 03 '12 15:09 quasiben

It's the problem of encoding. ReadLines should be useful no matter it is an email. con <- file(path, open="rt") instead of con <- file(path, open="rt", encoding="utf-8") will be work.

adayone avatar Oct 26 '12 06:10 adayone

The encoding changes does NOT seem to alter the behavior. I am running this on R 2.15.2 on Windows 7 x64. Here is my function:

get.msg <- function(path) { con <- file(path, open="rt", encoding="native.enc") text <- readLines(con) # The message always begins after the first full line break msg <- text[seq(which(text=="")[1] + 1, length(text), 1)] close(con) return(paste(msg, collapse="\n")) }

I have changed encoding to "utf-8", "latin1" and nothing happens. Same error.

Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)

I also applied the suggestions by foxet and quasiben. The fact is my spam folder does not have this file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' at all.

What am I missing, folks?

ceekr avatar Nov 01 '12 02:11 ceekr

Do not define parameter "encoding", just use

con <- file(path, open="rt")

2012/11/1 Kingshuk Chatterjee [email protected]

The encoding changes does NOT seem to alter the behavior. I am running this on R 2.15.2 on Windows 7 x64. Here is my function:

get.msg <- function(path) { con <- file(path, open="rt", encoding="native.enc") text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)] close(con) return(paste(msg, collapse="\n")) }

I have changed encoding to "utf-8", "latin1" and nothing happens. Same error.

Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)

I also applied the suggestions by foxet and quasiben. The fact is my spam folder does not have this file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' at all.

What am I missing, folks?

— Reply to this email directly or view it on GitHubhttps://github.com/johnmyleswhite/ML_for_Hackers/issues/4#issuecomment-9969386.

adayone avatar Nov 01 '12 05:11 adayone

Alright, I did this now: (Removed the encoding parameter)

get.msg <- function(path) { con <- file(path, open="rt") text <- readLines(con) # The message always begins after the first full line break msg <- text[seq(which(text=="")[1] + 1, length(text), 1)] close(con) return(paste(text, collapse="\n")) }

Ran the whole bunch again. The outcome:

            Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)

So, like I said earlier, the encoding parameter does not seem to have any affect. Again, I am running this on Windows 7 x64. And here is my whole bunch:

           spam.path <- "datasets/spam/"
           easyham.path <- "datasets/easy_ham/"
           hardham.path <- "datasets/hard_ham/"

           get.msg <- function(path) {
                    con <- file(path, open="rt")
                    text <- readLines(con)
                    # The message always begins after the first full line break
                    msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
                    close(con)
                    return(paste(text, collapse="\n"))
            }

            spam.docs <- dir(spam.path)
            spam.docs <- spam.docs[which(spam.docs!="cmds")]
            spam.docs <- paste(spam.path, spam.docs, sep="")
            all.spam.msgs <- sapply(spam.docs, get.msg) - This is the line that throws the above error

ceekr avatar Nov 01 '12 15:11 ceekr

you should check if the length(text) >1.

haoyuan hu Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

On Thursday, November 1, 2012 at 11:24 PM, Kingshuk Chatterjee wrote:

Alright, I did this now: (Removed the encoding parameter) get.msg <- function(path) { con <- file(path, open="rt") text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)] close(con) return(paste(text, collapse="\n")) }
Ran the whole bunch again. The outcome: Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)
So, like I said earlier, the encoding parameter does not seem to have any affect. Again, I am running this on Windows 7 x64. And here is my whole bunch: spam.path <- "datasets/spam/" easyham.path <- "datasets/easy_ham/" hardham.path <- "datasets/hard_ham/"
get.msg <- function(path) { con <- file(path, open="rt") text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)] close(con) return(paste(text, collapse="\n")) }
spam.docs <- dir(spam.path) spam.docs <- spam.docs[which(spam.docs!="cmds")] spam.docs <- paste(spam.path, spam.docs, sep="") all.spam.msgs <- sapply(spam.docs, get.msg) - This is the line that throws the above error

— Reply to this email directly or view it on GitHub (https://github.com/johnmyleswhite/ML_for_Hackers/issues/4#issuecomment-9983913).

adayone avatar Nov 01 '12 15:11 adayone

Lovely, that works!! Thanks mon. One last question: I see (intermittently) the socket open warning:

             Warning message: closing unused connection 3 (datasets/spam/desktop.ini) 

This I am presuming is because the underlying code failed to close all the File Sockets? It does not happen all the time though.

ceekr avatar Nov 01 '12 15:11 ceekr

Is there a permanent fix for this issue? I'm having the same problem. If I remove the encoding on the file(), then the get.msg function will work, but obviously you lose some encoding information.

Using Win 7 (64bit), RStudio 0.96.331, R 2.15.2

jamesbconner avatar Dec 02 '12 23:12 jamesbconner

Can confirm that I am seeing a similar issue as others above - `Error in seq.default(which(text == "")[1] + 1, length(text), 1) : wrong sign in 'by' argument``

Solved by dropping the encoding on con in get.msg. R 3.0.0 on Windows 7, 64 bit.

almartin82 avatar Apr 16 '13 03:04 almartin82

I have problem in following code:

get.msg <- function(path) { con <- file(path, open = "rt", encoding = "latin1") text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text == "")[1] + 1, length(text), 1)] close(con) return(paste(msg, collapse = "\n")) }

How can i do , please some body help me!!

y1239051 avatar Jun 23 '13 07:06 y1239051

I want say that if I am not use the parameter for encoding, it's ok for working, but when I key in spam.tdm <- get.tdm(all.spam)

The output error information is following: Error in tolower(txt) : invalid multibyte string 1

Who have same situation? Please help me!!

Thanks

y1239051 avatar Jun 23 '13 12:06 y1239051

I have same issue as y1239051. My system is Win7, 32bit, R version 3.0.2, RStudio Version 0.98.490. However, it seem OK on my old XP system. And,it spent so long time on command "spam.tdm <- get.tdm(all.spam)" that I aborted its running. I will try again.

Donnie-Liu avatar Feb 06 '14 03:02 Donnie-Liu

Ooops!, I try XP system again, and get same error!

Donnie-Liu avatar Feb 06 '14 07:02 Donnie-Liu

I found a solution following these steps:

  1. Remove "encoding='latin1'" in function get.msg()
  2. In function get.tdm(), add doc.corpus <- tm_map(doc.corpus, function(x) iconv(x, to='UTF-8', sub='byte')) before doc.dtm <- TermDocumentMatrix(doc.corpus, control)

The solution made program run normally. But, the results are a little different.

head(spam.df[with(spam.df,order(-occurrence)),]) term frequency density occurrence 7471 email 813 0.005859586 0.566 18382 please 425 0.003063129 0.508 14339 list 409 0.002947811 0.444 26848 will 828 0.005967697 0.422 2831 body 379 0.002731591 0.408 9124 free 539 0.003884769 0.390

Donnie-Liu avatar Feb 06 '14 09:02 Donnie-Liu

@y1239051 after I changed the function 'get.msg' to {... con <- file(path, open = "rt") ...} and deleted the wrong encoding words(just one sentence) in file:"00136.faa39d8e816c70f23b4bb8758d8a74f0" the command: all.spam <- sapply(spam.docs,

  •                function(p) get.msg(file.path(spam.path, p)))
    
    works. but the following command: spam.tdm <- get.tdm(all.spam) received the same problem like this: Error in .tolower(txt) : invalid multibyte string 1 how did you fix it? thanks.

laocan avatar Mar 31 '14 15:03 laocan

For those of you still have this problem, I'd suggest try removing the "open" parameter from file function. It worked for me on R 3.0.3, Win7 x64, and didn't break anything on R 3.1.1, Ubuntu 12.04

jnjcc avatar Jul 29 '14 10:07 jnjcc

After i correct the encoding parameter to con <- file(path, open = "rt", encoding ="native.enc"), the program can run; however it still show the warning "incomplete final line found on 'data/spam/00136.faa39d8e816c70f23b4bb8758d8a74f0' " in the end of command line. Dose anyone knows what's wrong with this warning ?

okamipride avatar Mar 10 '15 10:03 okamipride

Hi Donnie @Donnie-Liu,

I tested your solution, however, your change on get.tdm will cause error:

Error: inherits(doc, "TextDocument") is not TRUE

Could you paste the full text of your get.tdm definition?

bluesilence avatar Apr 22 '15 11:04 bluesilence

Same thing here okamipride what is the solution to this warning ???

IbrahimZamit avatar May 08 '16 16:05 IbrahimZamit

library(tm) library(ggplot2)

#defining paths

spam.path<- "data/spam/" spam2.path<- "data/spam_2/" easyham.path <- "data/easy_ham/" easyham2.path <- "data/easy_ham_2/" hardham.path <- "data/hard_ham/" hardham2.path <- "data/hard_ham_2/"

#creating get.msg function

get.msg <- function(path) { con <- file(path, open="rt", encoding="native.enc") text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1]+1,length(text),1)] close(con) return(paste(msg, collapse="\n")) }

#creating spam training dataset

spam.docs <- dir(spam.path) spam.docs <- spam.docs[which(spam.docs!="cmds")] all.spam <- sapply(spam.docs,function(p) get.msg(paste(spam.path, p,sep="")))

get.tdm <- function(doc.vec) { doc.corpus <- Corpus(VectorSource(doc.vec)) control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, minDocFreq=2) doc.dtm <- TermDocumentMatrix(doc.corpus, control) return(doc.dtm) } spam.tdm <- get.tdm(all.spam)

spam.matrix <- as.matrix(spam.tdm) spam.counts <- rowSums(spam.matrix) spam.df <- data.frame(cbind(names(spam.counts), as.numeric(spam.counts)), stringsAsFactors=FALSE) names(spam.df) <- c("term","frequency") spam.df$frequency <- as.numeric(spam.df$frequency) spam.occurrence <- sapply(1:nrow(spam.matrix), function(i) {length(which(spam.matrix[i,] > 0))/ncol(spam.matrix)}) spam.density <- spam.df$frequency/sum(spam.df$frequency) spam.df <- transform(spam.df, density=spam.density, occurrence=spam.occurrence)

#creating easyham.df

easyham.docs <- dir(easyham.path) easyham.docs <- easyham.docs[which(easyham.docs!="cmds")] all.easyham <- sapply(easyham.docs, function(p) get.msg(paste(easyham.path,p,sep="")))[1:500]

get.tdm <- function(doc.vec) { doc.corpus <- Corpus(VectorSource(doc.vec)) control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, minDocFreq=2) doc.dtm <- TermDocumentMatrix(doc.corpus, control) return(doc.dtm) } easyham.tdm <- get.tdm(all.easyham)

easyham.matrix <- as.matrix(easyham.tdm) easyham.counts <- rowSums(easyham.matrix) easyham.df <- data.frame(cbind(names(easyham.counts), as.numeric(easyham.counts)), stringsAsFactors=FALSE) names(easyham.df) <- c("term","frequency") easyham.df$frequency <- as.numeric(easyham.df$frequency) easyham.occurrence <- sapply(1:nrow(easyham.matrix), function(i) {length(which(easyham.matrix[i,] > 0))/ncol(spam.matrix)}) easyham.density <- easyham.df$frequency/sum(easyham.df$frequency) easyham.df <- transform(easyham.df, density=easyham.density, occurrence=easyham.occurrence)

creating the classifier

classify.email <- function(path, training.df, prior=0.5, c=1e-6) { msg <- get.msg(path) msg.tdm <- get.tdm(msg) msg.freq <- rowSums(as.matrix(msg.tdm))

Find intersections of words

msg.match <- intersect(names(msg.freq), training.df$term) if(length(msg.match) < 1) { return(prior*c^(length(msg.freq))) } else { match.probs <- training.df$occurrence[match(msg.match, training.df$term)] return(prior * prod(match.probs) * c^(length(msg.freq)-length(msg.match))) } }

#Testing the classifier

hardham.docs <- dir(hardham.path) hardham.docs <- hardham.docs[which(hardham.docs != "cmds")] hardham.spamtest <- sapply(hardham.docs, function(p) classify.email(paste(hardham.path, p, sep=""), training.df=spam.df)) hardham.hamtest <- sapply(hardham.docs, function(p) classify.email(paste(hardham.path, p, sep=""), training.df=easyham.df)) hardham.res <- ifelse(hardham.spamtest > hardham.hamtest, TRUE, FALSE) summary(hardham.res)

use this code in chapter 3. create a code for easyham.df, which is not given in the book. so you can use this complete code with code written for easyham files creation. the encoding is changed from "latin1" to "naive.enc" also, a file in spam folder is corrupted, which is causing the errors. so, better alternative is to delete that file and then run the code.

delete this file - spam/00002.d94f1b97e48ed3b553b3508d116e6a09. also as written in the book, use only first 500 sample mails from the easyham folder for better results.

hope, you found this solution genuine and good enough.

divyanshofficials avatar Mar 12 '18 09:03 divyanshofficials