colly icon indicating copy to clipboard operation
colly copied to clipboard

Parsing local files not working on windows

Open stigmelling opened this issue 7 years ago • 1 comments

Looks like there is a difference visiting local files on linux vs windows. On Linux: html file is visited fine using FileTransport and default On Windows html file is visited using default - but not FileTransport.

Looks like a bug(?) /stig ` package main

import ( "fmt" "net/http" "os" "path/filepath"

"github.com/gocolly/colly"
"github.com/gocolly/colly/debug"

)

func main() { dir, _ := os.Getwd()

t := &http.Transport{}
t.RegisterProtocol("file", http.NewFileTransport(http.Dir("/")))

cFile := colly.NewCollector(colly.Debugger(&debug.LogDebugger{}))
cHTTP := colly.NewCollector(colly.Debugger(&debug.LogDebugger{}))
cFile.WithTransport(t)

pages := []string{}

cFile.OnHTML("h1", func(e *colly.HTMLElement) {
	pages = append(pages, e.Text)
})

cHTTP.OnHTML("h1", func(e *colly.HTMLElement) {
	pages = append(pages, e.Text)
})

// windows variant absoluteFileName := filepath.Join(dir, "\index.html") // Linux variant // absoluteFileName := filepath.Join(dir, "/index.html") fmt.Println("Visiting local file") er := cFile.Visit("file:" + absoluteFileName) cFile.Wait() if er != nil { fmt.Println(er) } for i, p := range pages { fmt.Printf("%d : %s\n", i, p) }

fmt.Println("Visiting http file")
er = cHTTP.Visit("http://127.0.0.1:5500/index.html")
cFile.Wait()
if er != nil {
	fmt.Println(er)
}
for i, p := range pages {
	fmt.Printf("%d : %s\n", i, p)
}

}

`

RUNNING ON WINDOWS

Visiting local file [000001] 1 [ 1 - request] map["url":"file:D:\Dropbox\colly_local_file_windows\local_files\index.html"] (999.7µs) [000002] 1 [ 1 - response] map["url":"file:D:\Dropbox\colly_local_file_windows\local_files\index.html" "status":"OK"] (2.0004ms) [000003] 1 [ 1 - scraped] map["url":"file:D:\Dropbox\colly_local_file_windows\local_files\index.html"] (2.999ms) Visiting http file [000001] 2 [ 1 - request] map["url":"http://127.0.0.1:5500/index.html"] (2.999ms) [000002] 2 [ 1 - response] map["url":"http://127.0.0.1:5500/index.html" "status":"OK"] (8.1278ms) [000003] 2 [ 1 - html] map["url":"http://127.0.0.1:5500/index.html" "selector":"h1"] (8.1278ms) [000004] 2 [ 1 - scraped] map["url":"http://127.0.0.1:5500/index.html"] (8.1278ms) 0 : Index.html

RUNNING ON LINUX

Visiting local file [000001] 1 [ 1 - request] map["url":"file:///home/stig/Dropbox/colly_local_file_windows/local_files/index.html"] (66.953µs) [000002] 1 [ 1 - response] map["url":"file:///home/stig/Dropbox/colly_local_file_windows/local_files/index.html" "status":"OK"] (1.994095ms) [000003] 1 [ 1 - html] map["selector":"h1" "url":"file:///home/stig/Dropbox/colly_local_file_windows/local_files/index.html"] (2.094419ms) [000004] 1 [ 1 - scraped] map["url":"file:///home/stig/Dropbox/colly_local_file_windows/local_files/index.html"] (2.12162ms) 0 : Index.html Visiting http file [000001] 2 [ 1 - request] map["url":"http://127.0.0.1:5500/index.html"] (2.112984ms) [000002] 2 [ 1 - response] map["url":"http://127.0.0.1:5500/index.html" "status":"OK"] (3.536972ms) [000003] 2 [ 1 - html] map["selector":"h1" "url":"http://127.0.0.1:5500/index.html"] (3.632818ms) [000004] 2 [ 1 - scraped] map["url":"http://127.0.0.1:5500/index.html"] (3.645623ms) 0 : Index.html 1 : Index.html `

stigmelling avatar Dec 07 '18 12:12 stigmelling

It seems that I also ran into this, any ideas what's happening here? I can workaround it by using WSL, but it's annoying.

malaclypse avatar Dec 02 '21 10:12 malaclypse