snscrape icon indicating copy to clipboard operation
snscrape copied to clipboard

Add docs

Open lahdjirayhan opened this issue 4 years ago • 13 comments

I think I can try writing some docs (docstrings, examples). I may not be able to document all scrapers and all the machinery in it at the moment, just the ones I've used or can fairly understand. That being said, I think having some sort of documentation on this library is still a good thing to do.

Should I continue working on this, @JustAnotherArchivist? I apologize in advance if this PR feels like it comes out of nowhere.

sidenote: I'm not sure the difference between #6 and #7, and whether this PR is solving which.

lahdjirayhan avatar Dec 12 '21 18:12 lahdjirayhan

I think #6 is more say the limitations and workings of each individual scraper, and #7 is documenting the Python API

TheTechRobo avatar Dec 12 '21 18:12 TheTechRobo

Thank you for your attempt at this! The lack of documentation is definitely a major issue, so I do appreciate this!

One thing that immediately sticks out for me is something I'm deeply allergic to, though I'm not sure I spelt it out anywhere before, so apologies for that. It's duplication of information. It makes future maintenance harder because one has to remember to update things in multiple places. This is inevitably forgotten eventually, which then leads to inconsistencies and confusion.

There are three primary examples here:

  • The twitter-reference.rst file. I'd like to avoid listing the scrapers and even their modules in the docs at all. Instead, they should be discovered automatically through snscrape.modules. Apparently sphinx.ext.autosummary might be able to do this (with hacks). (Basically, adding or removing scrapers or modules should not require any changes to the docs directory.)
  • Function signatures, i.e. argument types, default values, and return value type. This should all happen through type hints, and it appears that Sphinx supports that with autodoc_typehints = "description" (which is even enabled in your docs/conf.py). Default values appear to require an extension.
    • In an ideal world, the argument and return value descriptions would also live in annotations (PEP 593); this is what I mentioned in #6 and doesn't seem to be supported anywhere yet. But at least there's only minimal duplication there (the argument names), so that's acceptable for now.
  • The package metadata and version in docs/conf.py. Note that the version is not recorded anywhere in the code; it comes from setuptools_scm via git tags. This should instead require an installation of snscrape at doc build time and then fetch that data using importlib.metadata.

It probably makes sense to split the documentation writing into three parts: type hints and docstrings for all public API, separate documentation (like a general introduction, CLI vs Python layer, and the Twitter example), and everything directly related to Sphinx (configuration etc.). Most of my points above fall into the last part. Perhaps you'd like to focus on just one for now?


A couple comments on style. For docstrings, it should be this, in line with the few existing ones:

def func(foo: str) -> str:
	'''Basic description

	Args:
		foo: a description of foo's significance
	Returns:
		a description of the return value
	'''

	return foo

That is: single quotes, no empty line at the end of the docstring, and an empty line between the docstring and the code. Further, I think the class description should go into a class-level docstring, and only __init__ argument description into the method docstring.

For the rST files, the same style as for my Python code should be used: tabs for indentation, spaces for alignment. Lines should be broken only where it makes sense (i.e. one paragraph = one line); I'm not a fan of breaking text into lines at random points based on 1980s-era screen widths. :-)

That's all I can think of right now. Looking forward to seeing where this is heading! :-)

JustAnotherArchivist avatar Dec 14 '21 08:12 JustAnotherArchivist

And @TheTechRobo's comment is correct. #6 is about high-level documentation of the existing scrapers, what they target, returned item fields, etc., while #7 is about the usage of snscrape from Python. There's some overlap though, naturally.

JustAnotherArchivist avatar Dec 14 '21 08:12 JustAnotherArchivist

Thank you for your feedback. This is one of my first attempts to contribute to a project that's not my own, so I understand that I need to adjust some style accordingly.

I also agree with the points you brought up about (avoiding) code duplication. I dislike unnecessary duplication myself. It's just that I wasn't (and still am not) experienced with setting up Sphinx, and I think that showed in my previous attempts. I need to look further and try to configure Sphinx better so that docs/ directory (and especially, the "API reference"-like section) will not need to be separately updated and maintained.


I have some questions, though:

  • Instagram: currently, each scraper class still needs the mode parameter to be specified, despite the already specific class names. What should be done here?
    1. Recommend users to use InstagramCommonScraper and set the mode themselves. Or,
    2. Setup each Instagram scrapers so that the mode is set on each class' __init__. For example, something like this:
class InstagramUserScraper(InstagramCommonScraper):
    ...
    def __init__(self, name, **kwargs):
        super().__init__(mode="User", **kwargs)
   ...
  • Reddit: technically I can import the specific scraper classes, but they are not listed in the source code. Instead, they are available through _make_scraper. I understand this is done to vastly reduce code duplication, because each scraper class will just contain mostly the same code anyway. But, how should the docstrings be written in this case?

lahdjirayhan avatar Dec 19 '21 15:12 lahdjirayhan

Yeah, I have little experience with Sphinx myself as well. Incidentally, that's part of why I put off working on the docs for such a long time. But I'm sure we can find a good solution.

Good points on the Instagram and Reddit scrapers. Yes, both of those are a mess and should be refactored. I'll look into that.

On that note, I should also revisit the public vs private parts of the code. For example, I don't consider InstagramCommonScraper part of the public API; people should only use the subclasses like InstagramUserScraper. Documentation is much less important for the private ones, naturally. I'll do some cleanup there as well. Expect a decent number of renames everywhere. Hopefully it won't be too messy to rebase your branch afterwards.

JustAnotherArchivist avatar Dec 23 '21 04:12 JustAnotherArchivist

I have questions:

  • #354 (resolved)
  • #355 (resolved)

lahdjirayhan avatar Jan 11 '22 02:01 lahdjirayhan

I'm getting this:

themandownstairs:~/sndocs git:add-docs ❯❯❯ sphinx-build -b html docs out                                             2
Running Sphinx v4.4.0

Configuration error:
There is a programmable error in your configuration file:

Traceback (most recent call last):
  File "/home/thetechrobo/.local/lib/python3.9/site-packages/sphinx/config.py", line 340, in eval_config_file
    exec(code, namespace)
  File "/home/thetechrobo/sndocs/docs/conf.py", line 35, in <module>
    _major, _minor, _patch, _yyyymmdd = release.split('.')
ValueError: too many values to unpack (expected 4)

TheTechRobo avatar Feb 20 '22 18:02 TheTechRobo

Oh, I think it's because I'm on the dev version.

You should add a check for that

TheTechRobo avatar Feb 20 '22 18:02 TheTechRobo

@TheTechRobo,

It does not happen on my end. The following is the snscrape version:

MINGW64 /e/Git-LINE/justanotherarchivist-snscrape/snscrape (add-docs)
$ snscrape --version
snscrape 0.3.5.dev231+g0832e95
>>> from importlib.metadata import metadata
>>> M = metadata('snscrape')
>>> M['version']
'0.3.5.dev231+g0832e95'

dev231+g0832e95 isn't exactly a YYYYMMDD date format and I probably should check against that, but it shouldn't return the error you mentioned as well. Maybe you have different output for snscrape --version?


@JustAnotherArchivist,

I'm not sure why my version is still at 0.3.5.dev ... surely it should be at 0.4.x by now? I've merged master to my branch every now and then, too. Is this okay? I installed snscrape in a virtual environment using pip install -e ., if it matters.

lahdjirayhan avatar Feb 21 '22 14:02 lahdjirayhan

@lahdjirayhan The version is only updated when you run pip because it's stored in the egg-info. Try rerunning pip install -e . (no uninstall needed) and then checking again. I currently have a local version number like 0.4.3.20220107.dev29+g77bbb9f.d20220220 for example. (That date part at the end signifies that there are uncommitted changes.)

JustAnotherArchivist avatar Feb 21 '22 14:02 JustAnotherArchivist

Also, apologies, I didn't notice that you marked this as ready for review. I'll take a look soon!

JustAnotherArchivist avatar Feb 21 '22 15:02 JustAnotherArchivist

@JustAnotherArchivist Thanks for the info on the version. I'll try it out as soon as I can.

Update: Nope, I've tried rerunning pip install -e . again and it still shows the same 0.3.5.dev version. In addition to that, running git tag shows only up to v.0.3.4.

lahdjirayhan avatar Feb 21 '22 19:02 lahdjirayhan

@lahdjirayhan I'm not sure that merging the branch in also gets the tags. That could be the problem.

TheTechRobo avatar Mar 30 '22 22:03 TheTechRobo