CommonCrawler icon indicating copy to clipboard operation
CommonCrawler copied to clipboard

Accessibile as a CLI binary.

Open ChrisCates opened this issue 7 years ago • 29 comments

Summary

Make this program both accessible via Golang and Terminal. Ensure that it works correctly in the terminal.

Requirements

  • Must have a download archive feature so that you can get latest entries from 2019 and beyond.

  • Must download files autonomously from a certain date range.

  • Must be able to extract compressed .wet files.

  • Please review the README.md for the proposed functionality.

Payment

  • Once #10 is complete, the bounty for this will increase to 2.5

ChrisCates avatar Oct 15 '18 23:10 ChrisCates

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


This issue now has a funding of 0.5 ETH (67.12 USD @ $134.25/ETH) attached to it as part of the AccessibleSoftware fund.

gitcoinbot avatar Mar 26 '19 00:03 gitcoinbot

@zyfrank, great, let me know if you have any questions. 💯 I'll be posting more work later!~

ChrisCates avatar Mar 26 '19 02:03 ChrisCates

@ChrisCates, I make a first investigation, I think what I can do are:

  1. use cobra to enhance config

  2. I think we can have two commands: first is download (which include download and unzip files), second is analyze. So you can download in one time and make analyze on another time .

What's is your opinion?

zyfrank avatar Mar 26 '19 17:03 zyfrank

@zyfrank, for the CLI tool it only needs to be able to download any common crawl file (and also navigate files by historical date) plus unzip.

The analyze tool is just used as a demo and will be moved to goveralls which I will assign to another task and bounty in issue #1

ChrisCates avatar Mar 26 '19 17:03 ChrisCates

@zyfrank, you can use this as a reference: https://commoncrawl.s3.amazonaws.com/ for navigating files in common crawl.

If this goes well. I will be adding another bounty for 1 ETH on the goveralls issue (#1) next week if you'd like to take it.

ChrisCates avatar Mar 26 '19 17:03 ChrisCates

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


Work for 0.5 ETH (68.81 USD @ $137.61/ETH) has been submitted by:

  1. @zyfrank

@ChrisCates please take a look at the submitted work:

  • PR by @zyfrank

gitcoinbot avatar Mar 27 '19 07:03 gitcoinbot

seems travis has authentication error

zyfrank avatar Mar 27 '19 07:03 zyfrank

Is this issue already closed or should someone still work on it?

rauchp avatar Apr 03 '19 20:04 rauchp

Hi @pedrojor2,

If @zyfrank is up to retrying. I think he should still get a chance. If not, happy for you to try.

I actually do want to refactor this repository. Simply so that the formatting is better and easier to use. I'm not sure if @zyfrank completely understood what my intention was in order to build it into a CLI executable.

I am looking to allocate a couple of hours this Friday.

ChrisCates avatar Apr 04 '19 04:04 ChrisCates

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


Work has been started.

These users each claimed they can complete the work by 7 months, 4 weeks ago. Please review their action plans below:

1) josprachi has started work.

I am learning Golang. I want to work on this issue 2) jay-dee7 has started work.

i've been working with go for 2 years now and also expert in docker containers and tooling.

Learn more on the Gitcoin Issue Details page.

gitcoinbot avatar Apr 04 '19 16:04 gitcoinbot

Hi I need help When I tried to run it, I am getting an error go run: cannot run *_test.go files (src/analyze_test.go) Please guide

josprachi avatar Apr 08 '19 13:04 josprachi

Hi @josprachi. Could you tell me what OS and version of Go you're using? I will whip up a Go container as per: https://github.com/ChrisCates/CommonCrawler/issues/10

ChrisCates avatar Apr 10 '19 21:04 ChrisCates

Hello @ChrisCates I am using following elementary OS Linux 4.15.0-47-generic #50~16.04.1-Ubuntu SMP Fri Mar 15 16:06:21 UTC 2019

josprachi avatar Apr 11 '19 03:04 josprachi

Okay, are you able to run Docker in that OS? If so, we can debug from there.

ChrisCates avatar Apr 11 '19 03:04 ChrisCates

Hi I am able to run docker now

josprachi avatar Apr 12 '19 10:04 josprachi

@ChrisCates Do you want me to take this?

vreddhi avatar May 03 '19 05:05 vreddhi

I submitted a bounty for #10. Once that is complete, we can discuss next steps.

ChrisCates avatar May 04 '19 19:05 ChrisCates

@ChrisCates let's discuss this

iamonuwa avatar May 06 '19 00:05 iamonuwa

@iamonuwa, great, yes! So in https://github.com/ChrisCates/CommonCrawler/blob/master/README.md I've specified a configuration that I'd like for us to use.

If you have any questions about the proposed command line interface, let me know. I'll be back on Friday to discuss more. As today and this week I need to focus on other stuff.

ChrisCates avatar May 06 '19 00:05 ChrisCates

What does each of these commands do?

commoncrawler --base-uri https://commoncrawl.s3.amazonaws.com/
commoncrawler --wet-paths wet.paths
commoncrawler --data-folder output/crawl-data
commoncrawler --start 0
commoncrawler --stop 5

iamonuwa avatar May 06 '19 00:05 iamonuwa

Do you wish to build a full cli tool from this project?

iamonuwa avatar May 06 '19 00:05 iamonuwa

@iamonuwa, those are configurations for using it as a binary.

An example of usage:

commoncrawler start --base-uri https://commoncrawler.com

And that would use a different base path for where CommonCrawl files are stored. This should update the Config struct as well too.

The intended functionality should work both as a library and as a CLI tool when compiled. I will be preparing an issue (with bounty) for making it fully usable as a library. So for now, just focus on it being a CLI tool.

ChrisCates avatar May 06 '19 00:05 ChrisCates

@iamonuwa I've just added: https://github.com/ChrisCates/CommonCrawler/issues/13

ChrisCates avatar May 06 '19 00:05 ChrisCates

The intended functionality should work both as a library and as a CLI tool when compiled. I will be preparing an issue (with bounty) for making it fully usable as a library. So for now, just focus on it being a CLI tool.

It will affect the project structure abit. But will try to capture the expected result

iamonuwa avatar May 06 '19 00:05 iamonuwa

@iamonuwa, absolutely, that is expected. Just ensure that functionality is relatively the same and it works as intended.

ChrisCates avatar May 06 '19 00:05 ChrisCates

@ChrisCates this Bounty Is still active?

zoek1 avatar Jul 31 '19 23:07 zoek1

We will be revisiting all bounties on this repository at a later date. Sorry that it's been inactive for a considerable amount of time.

ChrisCates avatar Aug 01 '19 02:08 ChrisCates

@ChrisCates is it still active? i would love to work on it

jay-dee7 avatar Feb 18 '20 14:02 jay-dee7

Any updates?

SeanDunford avatar Jun 17 '20 06:06 SeanDunford