city-scrapers icon indicating copy to clipboard operation
city-scrapers copied to clipboard

New/add spider chi_ssa_35 and it's test case

Open sosolidkk opened this issue 4 years ago • 3 comments

Summary

Issue: #568

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

  • [X] Tests are implemented
  • [X] All tests are passing
  • [X] Style checks run (see documentation for more details)
  • [X] Style checks are passing
  • [ ] Code comments from template removed

Questions

I am having some doubts about this spider, because it has all the meetings time, date and place displayed on the website itself, but the meeting details for the current day that will happen are inside a .pdf document. So what i did was to put the .pdf document content displayed into the description field in the spider. Anyway, i don't know if what i did was the correct approach or if the right way would be to iterate over the .pdf documents and parse the data inside them as meetings.

sosolidkk avatar Nov 29 '20 15:11 sosolidkk

Hello @pjsier , I was updating some stopped code and I made the corrections suggested by you. I also updated the code to make the year of each item correct, since it was fixed with a datetime.today().year. The only problem I still have is your change suggestion to Minutes and Agenda on title. I can't think of a way to do this dynamically, since on the page all I have is a <h4> which is followed by several <p> tags that contain the links inside. I kind of have to count and make it a more hard coded process. Do you have any better suggestions?

sosolidkk avatar Jan 10 '21 12:01 sosolidkk

@sosolidkk thanks for the changes! I mentioned in the comment, but the href attribute usually contains "Agenda" or "Minutes" which is one way, and you could also loop through a selector that iterates through the immediate children of .content and updates the document name any time it runs into an h4

pjsier avatar Jan 11 '21 14:01 pjsier

Hey @pjsier , sorry for the delay. I've updated this PR with the changes that you request. Now i'm iterating over all the inner elements of the body and separating the items in groups based on their <h4> title value, that can be Agenda, Schedule or Minutes.

sosolidkk avatar Feb 17 '21 19:02 sosolidkk