crawl4ai
crawl4ai copied to clipboard
task stuck in status “pending”. How to troubleshoot?
{ "status": "pending", "created_at": 1733664419.2874162 }
@SyllaJay I assume you are using Docker. Could you explain your code samples and everything so I can replicate them? It's difficult for me to provide any ideas right now.
Yes, I'm using Docker( image unclecode/crawl4ai:all-amd64). And below is my code, which performs a simple HTTP request.
func GetCrawlTaskID(url string) (string, error) { if len(url) == 0 { return "", errors.New("url is empty") } c, err := client.NewClient( client.WithClientReadTimeout(3 * 60 * time.Second), ) if err != nil { return "", err }
req := &protocol.Request{}
res := &protocol.Response{}
req.Reset()
req.SetMethod(consts.MethodPost)
req.Header.SetContentTypeBytes([]byte("application/json"))
req.Header.Set("Authorization", AuthorizationToken)
req.SetRequestURI(crawlServer)
reqBodyData := struct {
Urls string `json:"urls,omitempty"`
Priority int32 `json:"priority,omitempty"`
WordCountThreshold int32 `json:"word_count_threshold,omitempty"`
ExcludeExternalLinks string `json:"exclude_external_links,omitempty"`
OnlyText string `json:"only_text,omitempty"`
Magic string `json:"magic,omitempty"`
}{
Urls: url,
Priority: 10,
WordCountThreshold: 100,
ExcludeExternalLinks: "true",
OnlyText: "true",
Magic: "true",
}
reqBodyJsonByte, _ := json.Marshal(reqBodyData)
req.SetBody(reqBodyJsonByte)
err = c.Do(context.Background(), req, res)
if err != nil {
return "", err
}
type responseBody struct {
Detail string `json:"detail,omitempty"`
TaskID string `json:"task_id,omitempty"`
}
var resBody responseBody
err = json.Unmarshal(res.Body(), &resBody)
if err != nil {
return "", err
}
if resBody.Detail != "" {
return "", fmt.Errorf("GetCrawlTaskID Failed! Detail=%s", resBody.Detail)
}
return resBody.TaskID, nil
}
func GetCrawlContentByTaskID(taskID string) (string, error) { if len(taskID) == 0 { return "", errors.New("taskID is empty") } c, err := client.NewClient( client.WithClientReadTimeout(3 * 60 * time.Second), ) if err != nil { return "", err }
req := &protocol.Request{}
res := &protocol.Response{}
req.Reset()
req.SetMethod(consts.MethodGet)
//req.Header.SetContentTypeBytes([]byte("application/json"))
req.Header.Set("Authorization", AuthorizationToken)
req.SetRequestURI(crawlServerTask + "/" + taskID)
err = c.Do(context.Background(), req, res)
if err != nil {
return "", err
}
type responseBody struct {
Detail string `json:"detail,omitempty"` // "detail": "Task not found"
Status string `json:"status,omitempty"`
CreatedAt float64 `json:"created_at,omitempty"`
Result struct {
URL string `json:"url,omitempty"`
Success bool `json:"success,omitempty"`
Html string `json:"html,omitempty"`
CleanedHtml string `json:"cleaned_html,omitempty"`
Markdown string `json:"markdown,omitempty"`
ErrorMessage string `json:"error_message,omitempty"`
StatusCode int `json:"status_code,omitempty"`
} `json:"result,omitempty"`
}
var resBody responseBody
err = json.Unmarshal(res.Body(), &resBody)
if err != nil {
return "", err
}
if resBody.Detail != "" {
return "", fmt.Errorf("GetCrawlContentByTaskID Failed! Detail=%s", resBody.Detail)
}
if resBody.Status != "completed" {
return "", fmt.Errorf("GetCrawlContentByTaskID Failed! Status=%s", resBody.Status)
}
markdown := resBody.Result.Markdown
if markdown == "" {
return "", fmt.Errorf("GetCrawlContentByTaskID Failed! markdown content is empty,resBody =%s", string(res.Body()))
}
return markdown, nil
}
@SyllaJay I don't have much experience with Go. I wish it were Rust, but give me a few days to figure out why that happened. Right now, we don't have that issue when most people use other languages. I will try to replicate what you are doing and determine why it keeps returning pending to you.
@unclecode Thank you for your attention. It's just simple HTTP request code written in Go. I guess the code is not the key problem; it's probably the Docker server. The issue occurs when performing HTTP requests, especially when making many requests within a short period.
@SyllaJay You're welcome. I suggest waiting a week because I will release a new version. It includes many optimizations specifically for crawling multiple URLs simultaneously and some changes in the documentation. We will release the new version by Monday, and after a week, we will update the documentation. Then you can use that version again. I close the issue but you are welcome to continue.
@unclecode I encountered this problem again after the crawl server had been running for about 3 days. The crawl server was deployed with Docker (the code is from the main branch, and the latest Git log entry is 'Merge branch 'vr0.4.3b3''). I reviewed the crawl server logs but found no clues. How can I troubleshoot this more deeply? Are there any additional logs I can print out?
@SyllaJay next week we are dropping 0.5.x and docker has changed 100% to a new way. So I suggest wait a but and try the new one, which actively I am monitoring
@unclecode I will try the new 0.5.x version and am looking forward to it.
I’ve run into the same problem in Docker: a container uses excessive CPU (600-800%), and restarting it is the only fix.
@d0rc @SyllaJay Is this the new docker setup?