CommonCrawler
CommonCrawler copied to clipboard
Accessibile as a CLI binary.
Summary
Make this program both accessible via Golang and Terminal. Ensure that it works correctly in the terminal.
Requirements
-
Must have a download archive feature so that you can get latest entries from 2019 and beyond.
-
Must download files autonomously from a certain date range.
-
Must be able to extract compressed
.wetfiles. -
Please review the
README.mdfor the proposed functionality.
Payment
- Once #10 is complete, the bounty for this will increase to 2.5
Issue Status: 1. Open 2. Started 3. Submitted 4. Done
This issue now has a funding of 0.5 ETH (67.12 USD @ $134.25/ETH) attached to it as part of the AccessibleSoftware fund.
- If you would like to work on this issue you can 'start work' on the Gitcoin Issue Details page.
- Want to chip in? Add your own contribution here.
- Questions? Checkout Gitcoin Help or the Gitcoin Slack
- $50,186.80 more funded OSS Work available on the Gitcoin Issue Explorer
@zyfrank, great, let me know if you have any questions. 💯 I'll be posting more work later!~
@ChrisCates, I make a first investigation, I think what I can do are:
-
use cobra to enhance config
-
I think we can have two commands: first is download (which include download and unzip files), second is analyze. So you can download in one time and make analyze on another time .
What's is your opinion?
@zyfrank, for the CLI tool it only needs to be able to download any common crawl file (and also navigate files by historical date) plus unzip.
The analyze tool is just used as a demo and will be moved to goveralls which I will assign to another task and bounty in issue #1
@zyfrank, you can use this as a reference: https://commoncrawl.s3.amazonaws.com/ for navigating files in common crawl.
If this goes well. I will be adding another bounty for 1 ETH on the goveralls issue (#1) next week if you'd like to take it.
Issue Status: 1. Open 2. Started 3. Submitted 4. Done
Work for 0.5 ETH (68.81 USD @ $137.61/ETH) has been submitted by:
@ChrisCates please take a look at the submitted work:
- PR by @zyfrank
- Learn more on the Gitcoin Issue Details page
- Want to chip in? Add your own contribution here.
- Questions? Checkout Gitcoin Help or the Gitcoin Slack
- $51,738.31 more funded OSS Work available on the Gitcoin Issue Explorer
seems travis has authentication error
Is this issue already closed or should someone still work on it?
Hi @pedrojor2,
If @zyfrank is up to retrying. I think he should still get a chance. If not, happy for you to try.
I actually do want to refactor this repository. Simply so that the formatting is better and easier to use. I'm not sure if @zyfrank completely understood what my intention was in order to build it into a CLI executable.
I am looking to allocate a couple of hours this Friday.
Issue Status: 1. Open 2. Started 3. Submitted 4. Done
Work has been started.
These users each claimed they can complete the work by 7Â months, 4Â weeks ago. Please review their action plans below:
1) josprachi has started work.
I am learning Golang. I want to work on this issue 2) jay-dee7 has started work.
i've been working with go for 2 years now and also expert in docker containers and tooling.
Learn more on the Gitcoin Issue Details page.
Hi I need help When I tried to run it, I am getting an error go run: cannot run *_test.go files (src/analyze_test.go) Please guide
Hi @josprachi. Could you tell me what OS and version of Go you're using? I will whip up a Go container as per: https://github.com/ChrisCates/CommonCrawler/issues/10
Hello @ChrisCates I am using following elementary OS Linux 4.15.0-47-generic #50~16.04.1-Ubuntu SMP Fri Mar 15 16:06:21 UTC 2019
Okay, are you able to run Docker in that OS? If so, we can debug from there.
Hi I am able to run docker now
@ChrisCates Do you want me to take this?
I submitted a bounty for #10. Once that is complete, we can discuss next steps.
@ChrisCates let's discuss this
@iamonuwa, great, yes! So in https://github.com/ChrisCates/CommonCrawler/blob/master/README.md I've specified a configuration that I'd like for us to use.
If you have any questions about the proposed command line interface, let me know. I'll be back on Friday to discuss more. As today and this week I need to focus on other stuff.
What does each of these commands do?
commoncrawler --base-uri https://commoncrawl.s3.amazonaws.com/
commoncrawler --wet-paths wet.paths
commoncrawler --data-folder output/crawl-data
commoncrawler --start 0
commoncrawler --stop 5
Do you wish to build a full cli tool from this project?
@iamonuwa, those are configurations for using it as a binary.
An example of usage:
commoncrawler start --base-uri https://commoncrawler.com
And that would use a different base path for where CommonCrawl files are stored. This should update the Config struct as well too.
The intended functionality should work both as a library and as a CLI tool when compiled. I will be preparing an issue (with bounty) for making it fully usable as a library. So for now, just focus on it being a CLI tool.
@iamonuwa I've just added: https://github.com/ChrisCates/CommonCrawler/issues/13
The intended functionality should work both as a library and as a CLI tool when compiled. I will be preparing an issue (with bounty) for making it fully usable as a library. So for now, just focus on it being a CLI tool.
It will affect the project structure abit. But will try to capture the expected result
@iamonuwa, absolutely, that is expected. Just ensure that functionality is relatively the same and it works as intended.
@ChrisCates this Bounty Is still active?
We will be revisiting all bounties on this repository at a later date. Sorry that it's been inactive for a considerable amount of time.
@ChrisCates is it still active? i would love to work on it
Any updates?