Parsing local files not working on windows
Looks like there is a difference visiting local files on linux vs windows. On Linux: html file is visited fine using FileTransport and default On Windows html file is visited using default - but not FileTransport.
Looks like a bug(?) /stig ` package main
import ( "fmt" "net/http" "os" "path/filepath"
"github.com/gocolly/colly"
"github.com/gocolly/colly/debug"
)
func main() { dir, _ := os.Getwd()
t := &http.Transport{}
t.RegisterProtocol("file", http.NewFileTransport(http.Dir("/")))
cFile := colly.NewCollector(colly.Debugger(&debug.LogDebugger{}))
cHTTP := colly.NewCollector(colly.Debugger(&debug.LogDebugger{}))
cFile.WithTransport(t)
pages := []string{}
cFile.OnHTML("h1", func(e *colly.HTMLElement) {
pages = append(pages, e.Text)
})
cHTTP.OnHTML("h1", func(e *colly.HTMLElement) {
pages = append(pages, e.Text)
})
// windows variant absoluteFileName := filepath.Join(dir, "\index.html") // Linux variant // absoluteFileName := filepath.Join(dir, "/index.html") fmt.Println("Visiting local file") er := cFile.Visit("file:" + absoluteFileName) cFile.Wait() if er != nil { fmt.Println(er) } for i, p := range pages { fmt.Printf("%d : %s\n", i, p) }
fmt.Println("Visiting http file")
er = cHTTP.Visit("http://127.0.0.1:5500/index.html")
cFile.Wait()
if er != nil {
fmt.Println(er)
}
for i, p := range pages {
fmt.Printf("%d : %s\n", i, p)
}
}
`
RUNNING ON WINDOWS
Visiting local file [000001] 1 [ 1 - request] map["url":"file:D:\Dropbox\colly_local_file_windows\local_files\index.html"] (999.7µs) [000002] 1 [ 1 - response] map["url":"file:D:\Dropbox\colly_local_file_windows\local_files\index.html" "status":"OK"] (2.0004ms) [000003] 1 [ 1 - scraped] map["url":"file:D:\Dropbox\colly_local_file_windows\local_files\index.html"] (2.999ms) Visiting http file [000001] 2 [ 1 - request] map["url":"http://127.0.0.1:5500/index.html"] (2.999ms) [000002] 2 [ 1 - response] map["url":"http://127.0.0.1:5500/index.html" "status":"OK"] (8.1278ms) [000003] 2 [ 1 - html] map["url":"http://127.0.0.1:5500/index.html" "selector":"h1"] (8.1278ms) [000004] 2 [ 1 - scraped] map["url":"http://127.0.0.1:5500/index.html"] (8.1278ms) 0 : Index.html
RUNNING ON LINUX
Visiting local file [000001] 1 [ 1 - request] map["url":"file:///home/stig/Dropbox/colly_local_file_windows/local_files/index.html"] (66.953µs) [000002] 1 [ 1 - response] map["url":"file:///home/stig/Dropbox/colly_local_file_windows/local_files/index.html" "status":"OK"] (1.994095ms) [000003] 1 [ 1 - html] map["selector":"h1" "url":"file:///home/stig/Dropbox/colly_local_file_windows/local_files/index.html"] (2.094419ms) [000004] 1 [ 1 - scraped] map["url":"file:///home/stig/Dropbox/colly_local_file_windows/local_files/index.html"] (2.12162ms) 0 : Index.html Visiting http file [000001] 2 [ 1 - request] map["url":"http://127.0.0.1:5500/index.html"] (2.112984ms) [000002] 2 [ 1 - response] map["url":"http://127.0.0.1:5500/index.html" "status":"OK"] (3.536972ms) [000003] 2 [ 1 - html] map["selector":"h1" "url":"http://127.0.0.1:5500/index.html"] (3.632818ms) [000004] 2 [ 1 - scraped] map["url":"http://127.0.0.1:5500/index.html"] (3.645623ms) 0 : Index.html 1 : Index.html `
It seems that I also ran into this, any ideas what's happening here? I can workaround it by using WSL, but it's annoying.