opensearchserver Robots.txt new states dissalow after indexing OSS still returns urls as result

Robots.txt new states dissalow after indexing OSS still returns urls as result

Open Mojster opened this issue 8 years ago • 4 comments

Hi,

we're adjusting our robots.txt in our sites, so that only the needed parts get indexed. So we've put a part of page to dissalow. After that we've put the whole host to "select all url to fetch first". After now the index did it's job. The disallowed URL have this status: Robots: Disallow Fetch: Not allowed Parsing: Parsed Index: Indexed

And they are still return as results. Is this a bug or how do I remove disallowed url from the index?

Feb 02 '17 08:02 Mojster

After playing with URL Browser I encountered a new problem.

Now in the result in Render I have only this url. All other that were indexed are gone. I can see this also in the host facet, because it has only one url.

Feb 02 '17 08:02 Mojster

I everything is going downhill, I've had an idea to just delete all and reindex it from start. Under Runtime->Commands I've truncated all documents and under Crawler->Web->URL browser I've deleted all URL's.

So now I've thought that I've just start the crawler and it would start at the beginning. But he just keeps running like there is nothing to index.

Can you please help me what to do. Because it would not be great to reconfigure it everything from the scratch. Just to mention it. If i on the first page click info on the index it still states full size of 1,1GB.

Thanks, Mojster

Feb 02 '17 09:02 Mojster

How's the progress on this issue? Have you got any suggestions how to avoid this, because this is quite a showstopper.

Feb 13 '17 06:02 Mojster

Solved one part of this.

So after deleting all data from index and URL database, the crawler did not start again from fresh. Today I found out that I should remove all the URLs from 'Pattern list' and put them back again. Now the crawler is working again. I would mark this as a bug, but there is a simple fix, if you know about it.

But the main bug still persist (URLs marked as deleted, but still exist in index), without a quick workaround.

Mar 03 '17 07:03 Mojster

opensearchserver opensearchserver copied to clipboard

Robots.txt new states dissalow after indexing OSS still returns urls as result

opensearchserver
opensearchserver copied to clipboard