gesetze-tools icon indicating copy to clipboard operation
gesetze-tools copied to clipboard

🚀 [Feature] Implement github workflow to publish data daily

Open ulfgebhardt opened this issue 3 years ago • 13 comments

:rocket: Feature

Implement github workflow to publish data daily

  • [ ] Github workflow to push stuff to https://github.com/bundestag/gesetze automatically
  • [ ] run this daily/weekly/regularly
  • [ ] tag github releases (?)
  • [ ] optional python build/syntax check on the code in this repo (this is sorta another issue)

Please help implement it - if you have the free time to do it we would solve a 3 year old problem which pops up every election year. Pinging capable and potentially interested people out of the blue: @Muehe @JBBgameich <3

User Problem

We would have plain text data here in github

https://github.com/bundestag/gesetze/issues/55

Implementation

Use github workflows. See examples:

https://github.com/Ocelot-Social-Community/Ocelot-Social/blob/master/.github/workflows/publish.yml https://github.com/gradido/gradido/blob/master/.github/workflows/publish.yml https://github.com/mattia-lerario/Mentor-Application-Bachelor-Project/blob/master/.github/workflows/test.yml#L23

Additional context

Also ideal wäre wenn man irgendwie ein "binary" erstellen bzw den interpreter drüberlaufen lassen könnte. Ich habe leider noch nie etwas ernsthaftes mit python gemacht, da heißt es glaube ich packages erstellen oder so.

Ich hab überhaupt nix dagegen das zu mergen. Das Repo ist aktuell ziemlich tot
image

ich bekomme allerdings mit der anstehenden Bundestagswahl mehr Anfragen auf https://github.com/bundestag/gesetze und soweit ich das verstanden habe ist dieses Repo zum generieren des Inhalts zuständig.

Siehe: bundestag/gesetze#59 und bundestag/gesetze#55
Heute habe ich eine Anfrage von @Muehe bekommen bezüglich des Repos.

Ich fände es cool wenn wir es gemeinschaftlich hinbekommen ein script für die github workflows zu schreiben um ähnlich zu den Repos die wir für das democracy Projekt crawlen regelmäßig updates bekommen.

https://github.com/bundestag/NamedPolls
https://github.com/bundestag/NamedPollDeputies
https://github.com/bundestag/ConferenceWeekDetails
https://github.com/bundestag/dip21-daten
Die Leute von https://github.com/bundestag/offenegesetze.de haben sich da leider noch nicht eingeklinkt um diese Aufgabe zu übernehmen.

Also @JBBgameich : Hast du bock sowas zu machen? Soll ich diesen PR nun mergen?

src

ulfgebhardt avatar Mar 27 '21 03:03 ulfgebhardt

Who maintains bundestag/gesetze? Who has pull/merge rights?

There are lots of open pull requests which haven't been merged yet. One should first get the manual workflow running before trying to automate things.

darkdragon-001 avatar Mar 27 '21 12:03 darkdragon-001

Most of the pull requests are either jokes, drafts or too large to review. Generating an up to date version from source is probably a better course of action.

jbruechert avatar Mar 27 '21 12:03 jbruechert

I sorta start taking the responsibility since people come to me and ask for the Repo. Tho I have nothing to do with it. My course of action is finding people who wanna do it. I have all the rights needed and can also propagate those rights. I invite people to the orga if they have a commit on a repo in the Orga or a featured fork. This should allow you to have more rights - not sure the merge right is set tho.

So if you wanna do the automatic push thingy, we can certainly make that happen rightwise.

ulfgebhardt avatar Mar 27 '21 15:03 ulfgebhardt

Anyone has an idea how to efficiently determine the changed laws since the last run?

While it is easy for the scrapers (BGBl, BAnz, ...) since they are ordered by date, it is not so easy for the laws. There is Aktualitätendienst which can be mapped to the corresponding entries in the scraped data based on page number, but I don't see how this can determine which laws (name or slug) actually changed. Anyone has an idea?

darkdragon-001 avatar May 17 '21 20:05 darkdragon-001

I am wondering if it makes sense to use https://github.com/actions/cache for storing the json data instead of committing it to some repo as it is fully generated. @ulfgebhardt do you have an opinion here?

darkdragon-001 avatar Nov 14 '23 08:11 darkdragon-001

I believe that it is worthwhile to store all data in a repo - that way we would make the changes of laws transparent and searchable.

Why would we hide the actual content in some volatile cache? I do not really understand the benefits. Furthermore the actual content we provide is the scraped data - we should ensure maximum visibility and transparency.

But thats all just an opinion ;)

ulfgebhardt avatar Nov 14 '23 08:11 ulfgebhardt

I don't like the fact that tooling and data is mixed in this repository. Also using and updating the cache seems just easier. I also don't see any added benefit by storing this data as it is fully reproducible and verifiable by anyone. No strong objection, just my personal opinion.

darkdragon-001 avatar Nov 14 '23 09:11 darkdragon-001

Tooling happens here: https://github.com/bundestag/gesetze-tools Data happens here: https://github.com/bundestag/gesetze

The data is not reproducable since the official websites do not provide a history, do they?

ulfgebhardt avatar Nov 14 '23 12:11 ulfgebhardt

I am talking about the intermediate JSON files stored in https://github.com/bundestag/gesetze-tools/tree/master/data. I agree that the final Markdown files should be committed via Git to the other repository.

darkdragon-001 avatar Nov 14 '23 13:11 darkdragon-001

Ok then I missunderstood

ulfgebhardt avatar Nov 14 '23 14:11 ulfgebhardt

Hi! Sorry for being late to the party.

I am wondering if it makes sense to use https://github.com/actions/cache for storing the json data

Don't cache, always publish. If the data helps for our next automated run, usually it will also help humans with their next manually-invoked run. For data where git can make meaningful useful diffs, pushing it to a repo is a good idea. For all other stuff, let's instead make it part of a "release" = GitHub-hostet blob download.

I don't like the fact that tooling and data is mixed in this repository.

Yes, we should strictly separate both.

I had a quick look at gesetze-tools and see several python scripts. I assume they need to run in a temporary clone of the gesetze repo, right? From the readme I see lawde.py and lawdown.py have to run chained. Can the others run in parallel, each in their own gesetze clone (probably with working directory set to repo root?), or do some of them depend on another's results? Will some of them conflict when run in parallel but using the same (shared) gesetze clone? What files do I need to collect and publish from which of the tools?

mk-pmb avatar Nov 14 '23 15:11 mk-pmb

Edit: Moved to #36

mk-pmb avatar Nov 14 '23 15:11 mk-pmb

Also it would be nice to have a small dummy version of the data repo, with all important structures at the latest version but much faster to clone. Or can I just pick an ancient commit? My hope is to make quick test runs for debugging that will probably produce wrong results but can give a preview of whether it would have worked when using the real data repo.

mk-pmb avatar Nov 14 '23 15:11 mk-pmb