beets
beets copied to clipboard
Fetched lyrics from Genius are incomplete
Problem
This is and example of the fetched lyrics
[Verse 1: Killer Mike] Hear what I say, we are the business today Fuck shit is finished today (What) RT and J—we the new PB & J We dropped a classic today (What) We did a tablet of acid today Lit joints with the matches and ashes away SKRRRT! We dash away Donner and Dixon, the pistol is blasting away```
[Verse 2: El-P] Doctors of death Curing our patients of breath We are the pain you can trust Crooked at work Cookin' up curses and slurs Smokin' my brain into mush I became famous for flamin' you fucks Maimin' my way through the brush There is no training or taming of me and my bruh Look like a man, but I'm animal raw
[Verse 3: Killer Mike] We are the murderous pair That went to jail and we murdered the murderers there Then went to Hell and discovered the devil Delivered some hurt and despair Used to have powder to push Now I smoke pounds of the kush Holy, I'm burnin' a bush Now I give a fuck about none of this shit Jewel runner over and out of this bitch
While this is the link of the lyrics https://genius.com/Run-the-jewels-legend-has-it-lyrics
Setup
- OS: arch linux
- Python version: 3.11.3
- beets version: 1.6.1
- Turning off plugins made problem go away (yes/no): no
My configuration (output of beet config
) is:
lyrics:
bing_lang_from: []
force: yes
sources: genius
auto: yes
bing_client_secret: REDACTED
bing_lang_to:
google_API_key: REDACTED
google_engine_ID: REDACTED
genius_api_key: REDACTED
fallback:
local: no
dist_thresh: 0.1
library: ~/.config/beets/library.db
directory: /data/media/audio
plugins: zero lyrics
import:
copy: no
from_scratch: yes
incremental: yes
log: /data/media/audio/beetlogs.txt
move: no
quiet: no
quiet_fallback: skip
resume: ask
timid: no
write: yes
zero:
auto: yes
update_database: yes
fields: images
keep_fields: []
Also there isnt nothing in the documentation in how to configure the genius_api_key parameter
Looks like it's only picking up the lyrics from the first div
. Presumably, something about the site's structure changed, leading to this problem?
Fixing this should be possible by adapting the scraper at https://github.com/beetbox/beets/blob/0c3f428a601cb40c5fd463791df6229d51b0635e/beetsplug/lyrics.py#L396
Yeah, same problem here: https://genius.com/Gnarls-barkley-go-go-gadget-gospel-lyrics
Lyrics plugin is only picking up the first occurrence of the div
:
<div data-lyrics-container="true" class="Lyrics__Container-sc-1ynbvzw-5 Dzxov">
[Intro]
Pump up the peculiar
While I yell unique
F your wondering what you look like, look at me
Ah, let me show you right here
Hey, Ahaha
Ooooh, yeah, yeah, yeah
[Verse 1]
I'm well on my way
I'm almost everything
And this is my day
You make me want to say
[Chorus]
I'm free! Look at me!
Behold everything I'm allowed to see
Free! Come and see
Na, na, na, na, na na na
[Verse 2]
The shapeless, formless, heart is enormous
Bore this, I've worn this, no never what the norm is
Come hear this, it's fearless
Contrast, colour, prisms, so warmin'
Listen and love it
the second occurrence with the same div class/name is ignored:
<div data-lyrics-container="true" class="Lyrics__Container-sc-1ynbvzw-5 Dzxov">
[Chorus]
I'm freeee! Look at me!
Freedom in hi-fidelity
Free! come and see
Na, na, na, na, na na na
[Verse 3]
What you waitin' on?
I won't ask your, passion, smilin', laughin'
Yieldin', feelin', helpin', healin'
Introduce your neighbour to your saviour
[Chorus]
I'm free! Look at me!
Freedom in hi-fidelity
Free!
Na, na, na, na, na na na
Hé guys did anyone of you find a solution for this problem i am struggling with this problem for a week already for example i want to fetch lyrics to the song All Eyez On Me by the artist 2pac
[Intro: 2Pac] Big Syke, 'Nook, Paint, Bogart, Big Serge (yeah) Y'all know how this shit go (you know) All eyes on me Motherfuckin' OG Roll up in the club and shit, is that right? All eyes on me All eyes on me But you know what?
[Verse 1: 2Pac] I bet you got it twisted, you don't know who to trust So many player-hatin' niggas tryna sound like us Say they ready for the funk, but I don't think they knowin' Straight to the depths of Hell is where those cowards goin' Well, are you still down? Nigga, holla when you see me And let these devils be sorry for the day they finally freed me I got a caravan of niggas every time we ride Hittin' motherfuckers up when we pass by Until I die, live the life of a boss player 'Cause even when I'm high, fuck with me and get crossed later The futures in my eyes, 'cause all I want is cash and thangs A five-double-0 Benz, flauntin' flashy rings, uhh Bitches pursue me like a dream Been known to disappear before your eyes just like a dope fiend It seems, my main thing was to be major paid The game sharper than a motherfuckin' razor blade Say money bring bitches, bitches bring lies One nigga's gettin' jealous and motherfuckers died Depend on me like the first and fifteenth They might hold me for a second, but these punks won't get me We got foe niggas and low riders in ski masks Screamin', "Thug Life" every time they pass, all eyes on me
I am missing nearly all the lyrics of the song i have tried everything already anyone have a solution? PS : I am on Windows and used Python to install beets i am currently on beets version 1.6.1
Hi hi hi! Sorry, I'm new to this repo, but I think I can help. It seems like we are only searching for one data-lyrics-container
div. But if you run the fetch method using Ice Cube's It Was A Good Day, it contains 3 data-lyrics-container
div (run this unit test on the test_lyrics.py file:
def test_fetch_with_real_api(self):
lyrics = genius.fetch('ice-cube', 'it was a good day')
print(lyrics)
If you do a break point on _scrape_lyrics_from_html
and look at the soup var, you can see that there are 3 data-lyrics-container
div. One way to fix this is to change the line:
lyrics_div = soup.find("div", {"data-lyrics-container": True})
To:
lyrics_divs = soup.find_all("div", {"data-lyrics-container": True})
Once done, try to iterate thru the results and append each lyrics to a lyric
var like so:
lyrics_divs = soup.find_all("div", {"data-lyrics-container": True})
lyrics = ''
for lyrics_div in lyrics_divs:
if lyrics_div:
self.replace_br(lyrics_div)
lyrics += lyrics_div.get_text()
.....
return lyrics
Let me know if I can make this change! It's my first time on making changes in an open source project haha
michaeldiazh it does not work for me i just copied and past it but gives me an error when i modify it still doesnt give me the full lyrics
@Daredevil09m Mhhh let me take a look again when I get home (:
@Daredevil09m
So I reran the test and I got for Run The Jewels Legend Has It. I refactored a bit of the code so check it:
Here is the test (I am just printing out the lyrics):
def test_fetch_with_real_api(self):
lyrics = genius.fetch('Run The Jewels', 'Legend Has It')
print(lyrics)
Here is the refactored code. Try to replace the _scrape_lyrics_from_html
method in the Genius
class in the lyrics.py module. Also add the helper method _try_extracting_lyrics_from_non_data_lyrics_container
and check if that works!
def _scrape_lyrics_from_html(self, html):
"""Scrape lyrics from a given genius.com html"""
soup = try_parse_html(html)
if not soup:
return
# Remove script tags that they put in the middle of the lyrics.
[h.extract() for h in soup('script')]
# Most of the time, the page contains a div with class="lyrics" where
# all of the lyrics can be found already correctly formatted
# Sometimes, though, it packages the lyrics into separate divs, most
# likely for easier ad placement
lyrics_divs = soup.find_all("div", {"data-lyrics-container": True})
if not lyrics_divs:
self._log.debug('Received unusual song page html')
return self._try_extracting_lyrics_from_non_data_lyrics_container(soup)
lyrics = ''
for lyrics_div in lyrics_divs:
self.replace_br(lyrics_div)
lyrics += lyrics_div.get_text() + '\n\n'
return lyrics
def _try_extracting_lyrics_from_non_data_lyrics_container(self, soup):
"""Extract lyrics from a div without attribute data-lyrics-container
This is the second most common layout on genius.com
"""
verse_div = soup.find("div", class_=re.compile("Lyrics__Container"))
if not verse_div:
if soup.find("div",
class_=re.compile("LyricsPlaceholder__Message"),
string="This song is an instrumental"):
self._log.debug('Detected instrumental')
return "[Instrumental]"
else:
self._log.debug("Couldn't scrape page using known layouts")
return None
lyrics_div = verse_div.parent
self.replace_br(lyrics_div)
ads = lyrics_div.find_all("div",
class_=re.compile("InreadAd__Container"))
for ad in ads:
ad.replace_with("\n")
footers = lyrics_div.find_all("div",
class_=re.compile("Lyrics__Footer"))
for footer in footers:
footer.replace_with("")
return lyrics_div.get_text()
You should get these print statements from the test:
I am recreating this branch. I'll have an MR up soon (: