crawler fail on second run (env CRAWLEE_PURGE_ON_START not used?)
Which package is this bug report for? If unsure which one to select, leave blank
No response
Issue description
I suspect that the client.purge is not run when crawlee starts as described in the doc. Setting CRAWLEE_PURGE_ON_START=true or false has no effect. All the files are still in key_value_stores/default/ eg: SDK_CRAWLER_STATISTICS_12.json . I have set CRAWLEE_STORAGE_DIR="tmpfilesystem/crawlee" so it might be related.
If I delete the directory tmpfilesystem/crawlee and run the code below. This works just fine, the website is scraped and the title of the website is displayed. The second time the code is run it does not work. If I delete all fines and try again, then it works.
This is debugging from the first run:
DEBUG CheerioCrawler:SessionPool: No 'persistStateKeyValueStoreId' options specified, this session pool's data has been saved in the KeyValueStore with the id: deb916eb-2112-4a46-9e63-80c90cdccd1c
INFO CheerioCrawler: Starting the crawl
DEBUG CheerioCrawler:SessionPool: Created new Session - session_4ErkOlXe8x
INFO CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
DEBUG CheerioCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
DEBUG CheerioCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
INFO CheerioCrawler: Crawl finished. Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1297,"requestsFinishedPerMinute":44,"requestsFailedPerMinute":0,"requestTotalDurationMillis":1297,"requestsTotal":1,"crawlerRuntimeMillis":1368}
crawlee test passed. can webscrape website: https://www.smartebyernorge.no Title=Smarte Byer Norge
This is debugging from the second run:
DEBUG CheerioCrawler:SessionPool: No 'persistStateKeyValueStoreId' options specified, this session pool's data has been saved in the KeyValueStore with the id: deb916eb-2112-4a46-9e63-80c90cdccd1c
DEBUG CheerioCrawler:SessionPool: Recreating state from KeyValueStore {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG CheerioCrawler:SessionPool: 1 active sessions loaded from KeyValueStore
INFO CheerioCrawler: Starting the crawl
INFO CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
DEBUG CheerioCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_1"}
DEBUG CheerioCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_1"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_1"}
INFO CheerioCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":55}
Code sample
/*
.env
CRAWLEE_STORAGE_DIR="tmpfilesystem/crawlee"
CRAWLEE_MEMORY_MBYTES=2048
CRAWLEE_LOG_LEVEL=DEBUG
CRAWLEE_PURGE_ON_START=true
*/
import { CheerioCrawler } from 'crawlee';
async function crawlee_test() {
let testResult = {
testMessage: "",
testPassed: false
};
let returnString = "crawlee test";
let websiteListToCrawl = ["https://www.smartebyernorge.no"];
const crawler = new CheerioCrawler({
minConcurrency: 10,
maxConcurrency: 50,
// On error, retry each page at most once.
maxRequestRetries: 1,
// Increase the timeout for processing of each page.
requestHandlerTimeoutSecs: 30,
// Limit to 10 requests per one crawl
maxRequestsPerCrawl: 10,
async requestHandler({ request, $ }) {
//console.log(`Processing ${request.url}...`);
// Extract data from the page using cheerio.
const title = $('title').text();
//let pageH1 = $('h1').text().trim();
//let pageP1 = $('p').text().trim();
returnString = returnString + " passed. can webscrape website: " + request.url + " Title=" + title;
testResult.testMessage = returnString;
testResult.testPassed = true;
},
// This function is called if the page processing failed more than maxRequestRetries + 1 times.
failedRequestHandler({ request }) {
returnString = returnString + " Failed. can NOT webscrape website: " + request.url;
testResult.testMessage = returnString;
testResult.testPassed = false;
},
});
// Run the crawler and wait for it to finish.
await crawler.run(websiteListToCrawl);
return testResult;
}
async function do_test() {
let testResult = {
testMessage: "",
testPassed: false
};
testResult = await crawlee_test();
// wait a minute before next test
await new Promise(resolve => setTimeout(resolve, 60000));
testResult = await crawlee_test();
}
Package version
Node.js version
v16.17.0
Operating system
mac os
Apify platform
- [ ] Tick me if you encountered this issue on the Apify platform
Priority this issue should have
Medium (should be fixed soon)
I have tested this on the next release
No response
Other context
No response
I encountered the same problem, if the call twice, the second call to requestHandler always does not respond.
Then I realized that the requestQueue was probably cached.
So I tried to add an
await requestQueue.drop(); at the end to clear the cache and it seems to work.
I have a related issue in crawlee 3.3.3. The Datasets seem indeed not be purged by PURGE_ON_START or purgeDefaultStorage. I wrote this small test to actually confirm it:
it("shows that purgeDefaultstorage doesn't do anything?", async () => {
let crawler = new CheerioCrawler({
async requestHandler({}) {
await Dataset.pushData({item: "asdf"});
}
});
await crawler.run(["http://www.google.de"]);
await purgeDefaultStorages();
await crawler.run(["http://www.bing.de"]);
expect((await Dataset.getData()).count).to.be.eq(1);
});
So I believe its indeed a Bug and the purge commands don't seem to work. However I believe @terchris is running against the problem that the requestQueue needs to be dropped as @xialer pointed out.
It still seems to be an issue. Any updates?
Purging works as expected, the problem here is its rather internal API that is not supposed to be working as you guys are trying to use it - crawlee purges default storages automatically (in other words, the PURGE_ON_START defaults to true), and it is supposed to happen only once, as this purgeDefaultStorage method is called multiple times from various places and we want it to execute only once (on the very first call).
I guess we could rework this a bit to support explicit purging too. For now you can try this:
const config = Configuration.getGlobalConfig();
const client = config.getStorageClient();
await client.purge?.();
I have the same problem reported here and have tried the solution you proposed @B4nan but no luck.
const crawlPage = async (seedUrl: string, onDocument: (string) => void) => {
const crawler = new PlaywrightCrawler({
launchContext: { launchOptions: { headless: true } },
maxRequestRetries: 1,
requestHandlerTimeoutSecs: 20,
maxRequestsPerCrawl: 20,
async requestHandler({ request, page, enqueueLinks }) {
try {
const html = await page.evaluate('document.body.innerHTML');
// Publish this html
onDocument(html);
// If the page is part of a seed, visit the links
await enqueueLinks({ strategy: EnqueueStrategy.SameHostname });
} catch (err) {
log.warn('Error processing url: ' + request.url);
}
},
});
await crawler.addRequests([seedUrl]);
await crawler.run();
try {
const config = Configuration.getGlobalConfig();
const client = config.getStorageClient();
await client.purge?.();
} catch (err) {
log.warn('Failed to purge storage client');
}
}
Running this fails because the second onDocument is never called. The page was already crawled.
test("crawl multiple URLs", async () => {
const onDocument = jest.fn();
await crawlPage("https://moveo.ai", onDocument);
expect(onDocument).toHaveBeenCalled();
const onDocumentSecond = jest.fn();
await crawlPage("https://moveo.ai", onDocument);
expect(onDocumentSecond).toHaveBeenCalled();
});
I actually get an error where the purge() is trying to delete a file.
Could not find file at /storage/key_value_stores/default/SDK_SESSION_POOL_STATE.json
Running this fails because the second onDocument is never called. The page was already crawled.
That test seems to be wrong, you are not passing the onDocumentSecond in the second parameter, so it won't be called. Should be this instead:
test("crawl multiple URLs", async () => {
const onDocument = jest.fn();
await crawlPage("https://moveo.ai/", onDocument);
expect(onDocument).toHaveBeenCalled();
const onDocumentSecond = jest.fn();
await crawlPage("https://moveo.ai/", onDocumentSecond); // <--
expect(onDocumentSecond).toHaveBeenCalled();
});
I actually get an error where the purge() is trying to delete a file.
And you get that from some actual usage, or in a test case? Are you calling that method in parallel?
UPDATE: @B4nan, You are right, in my attempt to cleanup the code to paste here I made a typo. I actually check onDocumentSecond in my second expect().
The first time the method runs, it finds multiple pages, so onDocument is called around 22 times (maxRequestsPerCrawl=20).
The second time, the new mocked function onDocumentSecond isn't called because some state from the first run is stored somewhere, possibly in a variable within a module. If we had a teardown(), purge, or similar method to clean up the entire state, I believe this code would function properly.
I've tried various alternatives that I found in several issues similar to this one. I'm currently documenting them and planning to initiate a discussion with my findings.
Are you sure you are using latest version? The run method itself is already doing the necessary cleanup:
https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L648-L655
Yeah. I'm using 3.4.0
I'm having the same problem in 3.4.2
For future reference. I solved the problem using the persistStorage: false configuration. You need to set it each time you instantiate a PlaywrightCrawler instance.
const crawlPage = async (seedUrl: string, onDocument: (string) => void) => {
const crawler = new PlaywrightCrawler({
launchContext: { launchOptions: { headless: true } },
maxRequestRetries: 1,
requestHandlerTimeoutSecs: 20,
maxRequestsPerCrawl: 20,
async requestHandler({ request, page, enqueueLinks }) {
try {
const html = await page.evaluate('document.body.innerHTML');
// Publish this html
onDocument(html);
// If the page is part of a seed, visit the links
await enqueueLinks({ strategy: EnqueueStrategy.SameHostname });
} catch (err) {
log.warn('Error processing url: ' + request.url);
}
},
},
new Configuration({ persistStorage: false })); // <---- Configuration
await crawler.addRequests([seedUrl]);
await crawler.run();
}
it("shows that purgeDefaultstorage doesn't do anything?", async () => {
let crawler = new CheerioCrawler({
async requestHandler({}) {
await Dataset.pushData({item: "asdf"});
}
},
new Configuration({persistStorage: false})
);
await crawler.run(["http://www.google.de"]);
await purgeDefaultStorages();
await crawler.run(["http://www.google.de"]);
expect((await Dataset.getData()).count).to.be.eq(1);
});
I tried this @germanattanasio, but this unit test stills fails, where I am wrong?
expect((await Dataset.getData()).count).to.be.eq(1);
This call will use global config, therefore the same storage. You have three options:
- instead of using local config instance, modify the global one via
Configuration.set()(or use env vars, namely https://crawlee.dev/docs/guides/configuration#crawlee_purge_on_start) - create the dataset instance that respects your local config via
Dataset.open(null, { config })and callgetDataon that - use
crawler.getData()which respects the config you pass to the crawler