Linas Valiukas

Results 28 issues of Linas Valiukas

https://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-delay_directive

enhancement

For example, if `Content-Type` for `/robots.txt` is `text/html` (and not `text/plain`), this usually means that the file is missing (and instead a 404 page would get returned) so there's no...

bug

10 levels deep is probably too much: ``` 2018-11-26 13:11:19,139 INFO mediawords.util.sitemap.helpers [162086/MainThread]: Fetching URL https://www.juiceplus.com/fr/fr/franchise/sitemap.xml... 2018-11-26 13:11:19,428 INFO mediawords.util.sitemap.fetchers [162086/MainThread]: Parsing sitemap from URL https://www.juiceplus.com/fr/fr/franchise/sitemap.xml... 2018-11-26 13:11:19,508 INFO mediawords.util.sitemap.fetchers...

bug

``` 2019-07-19 14:26:46,279 INFO mediawords.util.sitemap.media [95859/MainThread]: Fetching sitemap pages for media ID 10 (https://globalvoices.org/)... 2019-07-19 14:26:46,282 INFO usp.fetch_parse [95859/MainThread]: Fetching level 0 sitemap from https://globalvoices.org/robots.txt... 2019-07-19 14:26:46,282 INFO usp.helpers [95859/MainThread]:...

bug

``` 2019-07-19 14:48:41,974 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/sitemap-posttype-post.200705.xml... 2019-07-19 14:48:59,852 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-posttype-post.200705.xml... 2019-07-19 14:48:59,932 ERROR usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-posttype-post.200705.xml failed:...

bug

As a follow-up to #600, it would be beneficial for us to work out some code that categorizes a list of sitemap URLs: ``` http://www.example.com/ http://www.example.com/about.html http://www.example.com/contact.html http://www.example.com/category/apples/ http://www.example.com/category/flowers/ http://www.example.com/2019/01/01/article-1.html...

enhancement

Hey James, Can you move Docker images from Google Cloud Registry back to Docker Hub? We used to use Docker Hub for storing all of our Docker images. Then at...

enhancement

Podcast transcoding fails for some episodes because: ``` $ docker service logs $(docker service ls | grep podcast-transcribe-episode-temporal-worker | awk '{ print $1 }') mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc | INFO podcast_transcribe_episode.workflow: Fetching, transcoding,...

bug

`extract-and-vector` workers tend to fill up `/var/tmp` with gigabytes of pretty much identical files which are of the size of either 0 or 3332489: ``` $ docker exec -it 689b33c92426...

bug

So, now that we came up with lists of media sources / feeds to be merged into each other (#799), let's try doing the actual merging. Given that: * We...

enhancement