docs-scraper icon indicating copy to clipboard operation
docs-scraper copied to clipboard

Example configuration complains about invalid JSON

Open yankeeinlondon opened this issue 3 years ago • 16 comments

I've installed the latest Docker image locally and to start I cut and paste the example JSON you provided in your docs and changed a few properties but the JSON is 100% valid JSON and yet the container exists with the following error:

image

Here is the JSON that was used:

{
  "index_uid": "rust-api",
  "start_urls": ["https://docs.rs/tauri/latest/tauri"],
  "sitemap_urls": [],
  "selectors": {
    "lvl0": {
      "selector": "h1",
      "global": true,
      "default_value": "Title"
    },
    "lvl1": {
      "selector": "h2",
      "global": true,
      "default_value": "Section"
    },
    "lvl2": "[title=tauri::*]",
    "lvl3": ".docblock-short",
    "lvl4": ".theme-default-content h4",
    "lvl5": ".theme-default-content h5",
    "lvl6": "null",
    "text": "#main"
  },
  "strip_chars": " .,;:#",
  "scrap_start_urls": true,
  "custom_settings": {
    "synonyms": {
      "relevancy": ["relevant", "relevance"],
      "relevant": ["relevancy", "relevance"],
      "relevance": ["relevancy", "relevant"]
    }
  }
}

yankeeinlondon avatar Jan 19 '22 01:01 yankeeinlondon

In your docker run do you have -v config.json:/docs-scraper/config.json? if yes try -v /path/to/config.json:/docs-scraper/config.json instead.

If that works you will probably get the error: raise SelectorSyntaxError(cssselect.parser.SelectorSyntaxError: Expected ']', got <DELIM ':' at 12>. It is not liking this part of the config: [title=tauri::*].

sanders41 avatar Jan 19 '22 03:01 sanders41

but that is a valid DOM selector, right? I tried replacing that line with just "null" and I still get the error.

yankeeinlondon avatar Jan 19 '22 03:01 yankeeinlondon

I haven’t had a chance to look closely at the parser error yet, but I think it does not allow pseudo elements. Here are the docs for what I think the reason is. This is from a quick scan of the code in this section so I could be wrong about the reason.

sanders41 avatar Jan 19 '22 04:01 sanders41

when you say pseudo elements I presume you're referring to the wildcard character and not the standard syntax for an attribute selector? Also, bear in mind I get the same error with this line removed so while this may be another issue it is not the one causing this current situation.

yankeeinlondon avatar Jan 19 '22 15:01 yankeeinlondon

when you say pseudo elements I presume you're referring to the wildcard character and not the standard syntax for an attribute selector?

Correct, this what I am thinking.

When you say same error are you referring to the ValueError or the SelectorSyntaxError? If the latter you just changed tauri::* in the config to null correct? If that is the case my first thought without having tested it yet is it is trying to parse null as an element. Not sure if that is it, but it would at least be a place to start looking.

sanders41 avatar Jan 19 '22 15:01 sanders41

I get the same value error with or without the tauri::* line being in the configuration. So if the wildcard is an error we've not seen it yet as something else is blocking (or at least that's my read).

yankeeinlondon avatar Jan 19 '22 21:01 yankeeinlondon

With the value error what does your docker run look like. You will see the value error with docker run -v config.json:docs-scraper/config.json (I excluded all arguments but the -v), but docker run -v /full/path/to/config.json:docs-scraper/config.json should fix this error.

sanders41 avatar Jan 19 '22 21:01 sanders41

I'm going to give this some more attention today ... will let you know what I find but considering the RUST AST I'm using is considered unstable I really would like to be able to get this scraper to work for me.

yankeeinlondon avatar Feb 03 '22 19:02 yankeeinlondon

image

I use Docker Compose to boot up both the scraper and Meilisearch. As you can see above the volume is mapped to my local file system in the scraper directory.

In the root of the monorepo I have both the compose file and as you can see the directory structure I thought was needed based on the docs:

image

yankeeinlondon avatar Feb 03 '22 19:02 yankeeinlondon

based on your comments above the first clear problem is that I had thought config.json was a second parameter to the run command ... but in fact there's only one command and no params.

yankeeinlondon avatar Feb 03 '22 19:02 yankeeinlondon

running the command, however, leads to:

Error: the command ./docs_scraper/config.json could not be found within PATH or Pipfile's [scripts].

yankeeinlondon avatar Feb 03 '22 19:02 yankeeinlondon

you'd think this kind of error message would lead to a simple solution but honestly I can't get it to work regardless of where I put the config file and/or adjust the run command. I'm not super up-to-speed on all things Docker so maybe I'm doing something obvious and dumb.

yankeeinlondon avatar Feb 03 '22 20:02 yankeeinlondon

I think it is your volume path. If you change the volume from ./scraper:/data.ms to ./docs_scraper/config.json:/docs-scraper/config.json does it make a difference?

sanders41 avatar Feb 04 '22 00:02 sanders41

Sadly no that's where I started. I know this is a stretch but would you consider a quick pairing session to go over this. Totally understand if you don't have time for that.

yankeeinlondon avatar Feb 07 '22 19:02 yankeeinlondon

@bidoubiwa has much more experience than I do with the day to day use, I've really just worked in the code base and know how to get it running from that. That said, I'm happy to try to help if I can. Are you an Meilisearch's Slack? I am Paul over there (There are a few Pauls, I'm the one with just Paul and no last name) if you want to send me a message to try to set something up.

sanders41 avatar Feb 08 '22 02:02 sanders41

i rarely boot up Slack anymore ... but yes I am connected it would seem. I'll say hello over there.

yankeeinlondon avatar Feb 08 '22 16:02 yankeeinlondon

Hi @yankeeinlondon, I took back this repo, I come to the news to see if this problem was solved or not. Did you still need help?

alallema avatar Sep 27 '22 09:09 alallema

Without news for a while, I close this issue. do not hesitate to reopen it if necessary.

alallema avatar Oct 06 '22 10:10 alallema