crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

task stuck in status “pending”. How to troubleshoot?

Open SyllaJay opened this issue 10 months ago • 3 comments
trafficstars

{ "status": "pending", "created_at": 1733664419.2874162 }

SyllaJay avatar Dec 28 '24 13:12 SyllaJay

@SyllaJay I assume you are using Docker. Could you explain your code samples and everything so I can replicate them? It's difficult for me to provide any ideas right now.

unclecode avatar Dec 30 '24 11:12 unclecode

Yes, I'm using Docker( image unclecode/crawl4ai:all-amd64). And below is my code, which performs a simple HTTP request.

func GetCrawlTaskID(url string) (string, error) { if len(url) == 0 { return "", errors.New("url is empty") } c, err := client.NewClient( client.WithClientReadTimeout(3 * 60 * time.Second), ) if err != nil { return "", err }

req := &protocol.Request{}
res := &protocol.Response{}
req.Reset()
req.SetMethod(consts.MethodPost)
req.Header.SetContentTypeBytes([]byte("application/json"))
req.Header.Set("Authorization", AuthorizationToken)
req.SetRequestURI(crawlServer)

reqBodyData := struct {
	Urls                 string `json:"urls,omitempty"`
	Priority             int32  `json:"priority,omitempty"`
	WordCountThreshold   int32  `json:"word_count_threshold,omitempty"`
	ExcludeExternalLinks string `json:"exclude_external_links,omitempty"`
	OnlyText             string `json:"only_text,omitempty"`
	Magic                string `json:"magic,omitempty"`
}{
	Urls:                 url,
	Priority:             10,
	WordCountThreshold:   100,
	ExcludeExternalLinks: "true",
	OnlyText:             "true",
	Magic:                "true",
}
reqBodyJsonByte, _ := json.Marshal(reqBodyData)
req.SetBody(reqBodyJsonByte)
err = c.Do(context.Background(), req, res)
if err != nil {
	return "", err
}

type responseBody struct {
	Detail string `json:"detail,omitempty"`
	TaskID string `json:"task_id,omitempty"`
}
var resBody responseBody
err = json.Unmarshal(res.Body(), &resBody)
if err != nil {
	return "", err
}
if resBody.Detail != "" {
	return "", fmt.Errorf("GetCrawlTaskID Failed! Detail=%s", resBody.Detail)
}
return resBody.TaskID, nil

}

func GetCrawlContentByTaskID(taskID string) (string, error) { if len(taskID) == 0 { return "", errors.New("taskID is empty") } c, err := client.NewClient( client.WithClientReadTimeout(3 * 60 * time.Second), ) if err != nil { return "", err }

req := &protocol.Request{}
res := &protocol.Response{}
req.Reset()
req.SetMethod(consts.MethodGet)
//req.Header.SetContentTypeBytes([]byte("application/json"))
req.Header.Set("Authorization", AuthorizationToken)
req.SetRequestURI(crawlServerTask + "/" + taskID)
err = c.Do(context.Background(), req, res)
if err != nil {
	return "", err
}

type responseBody struct {
	Detail    string  `json:"detail,omitempty"` // "detail": "Task not found"
	Status    string  `json:"status,omitempty"`
	CreatedAt float64 `json:"created_at,omitempty"`
	Result    struct {
		URL          string `json:"url,omitempty"`
		Success      bool   `json:"success,omitempty"`
		Html         string `json:"html,omitempty"`
		CleanedHtml  string `json:"cleaned_html,omitempty"`
		Markdown     string `json:"markdown,omitempty"`
		ErrorMessage string `json:"error_message,omitempty"`
		StatusCode   int    `json:"status_code,omitempty"`
	} `json:"result,omitempty"`
}
var resBody responseBody
err = json.Unmarshal(res.Body(), &resBody)
if err != nil {
	return "", err
}
if resBody.Detail != "" {
	return "", fmt.Errorf("GetCrawlContentByTaskID Failed! Detail=%s", resBody.Detail)
}
if resBody.Status != "completed" {
	return "", fmt.Errorf("GetCrawlContentByTaskID Failed! Status=%s", resBody.Status)
}
markdown := resBody.Result.Markdown
if markdown == "" {
	return "", fmt.Errorf("GetCrawlContentByTaskID Failed!  markdown content is empty,resBody =%s", string(res.Body()))
}
return markdown, nil

}

SyllaJay avatar Jan 12 '25 08:01 SyllaJay

@SyllaJay I don't have much experience with Go. I wish it were Rust, but give me a few days to figure out why that happened. Right now, we don't have that issue when most people use other languages. I will try to replicate what you are doing and determine why it keeps returning pending to you.

unclecode avatar Jan 13 '25 12:01 unclecode

@unclecode Thank you for your attention. It's just simple HTTP request code written in Go. I guess the code is not the key problem; it's probably the Docker server. The issue occurs when performing HTTP requests, especially when making many requests within a short period.

SyllaJay avatar Jan 18 '25 07:01 SyllaJay

@SyllaJay You're welcome. I suggest waiting a week because I will release a new version. It includes many optimizations specifically for crawling multiple URLs simultaneously and some changes in the documentation. We will release the new version by Monday, and after a week, we will update the documentation. Then you can use that version again. I close the issue but you are welcome to continue.

unclecode avatar Jan 19 '25 10:01 unclecode

@unclecode I encountered this problem again after the crawl server had been running for about 3 days. The crawl server was deployed with Docker (the code is from the main branch, and the latest Git log entry is 'Merge branch 'vr0.4.3b3''). I reviewed the crawl server logs but found no clues. How can I troubleshoot this more deeply? Are there any additional logs I can print out?

SyllaJay avatar Feb 22 '25 12:02 SyllaJay

@SyllaJay next week we are dropping 0.5.x and docker has changed 100% to a new way. So I suggest wait a but and try the new one, which actively I am monitoring

unclecode avatar Feb 23 '25 04:02 unclecode

@unclecode I will try the new 0.5.x version and am looking forward to it.

SyllaJay avatar Feb 23 '25 12:02 SyllaJay

@unclecode I got some error logs, hope they are helpful.

crawl4ai-crawl4ai-amd64.log

SyllaJay avatar Feb 25 '25 03:02 SyllaJay

I’ve run into the same problem in Docker: a container uses excessive CPU (600-800%), and restarting it is the only fix.

d0rc avatar Mar 23 '25 11:03 d0rc

@d0rc @SyllaJay Is this the new docker setup?

aravindkarnam avatar Mar 25 '25 08:03 aravindkarnam