beets icon indicating copy to clipboard operation
beets copied to clipboard

Fetched lyrics from Genius are incomplete

Open calm3285 opened this issue 1 year ago • 9 comments

Problem

This is and example of the fetched lyrics

[Verse 1: Killer Mike] Hear what I say, we are the business today Fuck shit is finished today (What) RT and J—we the new PB & J We dropped a classic today (What) We did a tablet of acid today Lit joints with the matches and ashes away SKRRRT! We dash away Donner and Dixon, the pistol is blasting away```

[Verse 2: El-P] Doctors of death Curing our patients of breath We are the pain you can trust Crooked at work Cookin' up curses and slurs Smokin' my brain into mush I became famous for flamin' you fucks Maimin' my way through the brush There is no training or taming of me and my bruh Look like a man, but I'm animal raw

[Verse 3: Killer Mike] We are the murderous pair That went to jail and we murdered the murderers there Then went to Hell and discovered the devil Delivered some hurt and despair Used to have powder to push Now I smoke pounds of the kush Holy, I'm burnin' a bush Now I give a fuck about none of this shit Jewel runner over and out of this bitch

While this is the link of the lyrics https://genius.com/Run-the-jewels-legend-has-it-lyrics

Setup

  • OS: arch linux
  • Python version: 3.11.3
  • beets version: 1.6.1
  • Turning off plugins made problem go away (yes/no): no

My configuration (output of beet config) is:

lyrics:
    bing_lang_from: []
    force: yes
    sources: genius
    auto: yes
    bing_client_secret: REDACTED
    bing_lang_to:
    google_API_key: REDACTED
    google_engine_ID: REDACTED
    genius_api_key: REDACTED
    fallback:
    local: no
    dist_thresh: 0.1
library: ~/.config/beets/library.db
directory: /data/media/audio

plugins: zero lyrics

import:
    copy: no
    from_scratch: yes
    incremental: yes
    log: /data/media/audio/beetlogs.txt
    move: no
    quiet: no
    quiet_fallback: skip
    resume: ask
    timid: no
    write: yes
zero:
    auto: yes
    update_database: yes
    fields: images
    keep_fields: []


Also there isnt nothing in the documentation in how to configure the genius_api_key parameter

calm3285 avatar Jun 06 '23 02:06 calm3285

Looks like it's only picking up the lyrics from the first div. Presumably, something about the site's structure changed, leading to this problem?

Fixing this should be possible by adapting the scraper at https://github.com/beetbox/beets/blob/0c3f428a601cb40c5fd463791df6229d51b0635e/beetsplug/lyrics.py#L396

wisp3rwind avatar Jun 10 '23 06:06 wisp3rwind

Yeah, same problem here: https://genius.com/Gnarls-barkley-go-go-gadget-gospel-lyrics

Lyrics plugin is only picking up the first occurrence of the div:

<div data-lyrics-container="true" class="Lyrics__Container-sc-1ynbvzw-5 Dzxov">

[Intro]
Pump up the peculiar
While I yell unique
F your wondering what you look like, look at me
Ah, let me show you right here
Hey, Ahaha
Ooooh, yeah, yeah, yeah

[Verse 1]
I'm well on my way
I'm almost everything
And this is my day
You make me want to say

[Chorus]
I'm free! Look at me!
Behold everything I'm allowed to see
Free! Come and see
Na, na, na, na, na na na

[Verse 2]
The shapeless, formless, heart is enormous
Bore this, I've worn this, no never what the norm is
Come hear this, it's fearless
Contrast, colour, prisms, so warmin'
Listen and love it

the second occurrence with the same div class/name is ignored:

<div data-lyrics-container="true" class="Lyrics__Container-sc-1ynbvzw-5 Dzxov">

[Chorus]
I'm freeee! Look at me!
Freedom in hi-fidelity
Free! come and see
Na, na, na, na, na na na

[Verse 3]
What you waitin' on?
I won't ask your, passion, smilin', laughin'
Yieldin', feelin', helpin', healin'
Introduce your neighbour to your saviour

[Chorus]
I'm free! Look at me!
Freedom in hi-fidelity
Free!
Na, na, na, na, na na na

mojolo avatar Jun 18 '23 01:06 mojolo

Hé guys did anyone of you find a solution for this problem i am struggling with this problem for a week already for example i want to fetch lyrics to the song All Eyez On Me by the artist 2pac

[Intro: 2Pac] Big Syke, 'Nook, Paint, Bogart, Big Serge (yeah) Y'all know how this shit go (you know) All eyes on me Motherfuckin' OG Roll up in the club and shit, is that right? All eyes on me All eyes on me But you know what?

[Verse 1: 2Pac] I bet you got it twisted, you don't know who to trust So many player-hatin' niggas tryna sound like us Say they ready for the funk, but I don't think they knowin' Straight to the depths of Hell is where those cowards goin' Well, are you still down? Nigga, holla when you see me And let these devils be sorry for the day they finally freed me I got a caravan of niggas every time we ride Hittin' motherfuckers up when we pass by Until I die, live the life of a boss player 'Cause even when I'm high, fuck with me and get crossed later The futures in my eyes, 'cause all I want is cash and thangs A five-double-0 Benz, flauntin' flashy rings, uhh Bitches pursue me like a dream Been known to disappear before your eyes just like a dope fiend It seems, my main thing was to be major paid The game sharper than a motherfuckin' razor blade Say money bring bitches, bitches bring lies One nigga's gettin' jealous and motherfuckers died Depend on me like the first and fifteenth They might hold me for a second, but these punks won't get me We got foe niggas and low riders in ski masks Screamin', "Thug Life" every time they pass, all eyes on me

I am missing nearly all the lyrics of the song i have tried everything already anyone have a solution? PS : I am on Windows and used Python to install beets i am currently on beets version 1.6.1

Daredevil09m avatar Aug 19 '23 10:08 Daredevil09m

Hi hi hi! Sorry, I'm new to this repo, but I think I can help. It seems like we are only searching for one data-lyrics-container div. But if you run the fetch method using Ice Cube's It Was A Good Day, it contains 3 data-lyrics-container div (run this unit test on the test_lyrics.py file:

    def test_fetch_with_real_api(self):
        lyrics = genius.fetch('ice-cube', 'it was a good day')
        print(lyrics)

If you do a break point on _scrape_lyrics_from_html and look at the soup var, you can see that there are 3 data-lyrics-container div. One way to fix this is to change the line:

lyrics_div = soup.find("div", {"data-lyrics-container": True})

To:

lyrics_divs = soup.find_all("div", {"data-lyrics-container": True}) Once done, try to iterate thru the results and append each lyrics to a lyric var like so:

 lyrics_divs = soup.find_all("div", {"data-lyrics-container": True})
        lyrics = ''
        for lyrics_div in lyrics_divs:
            if lyrics_div:
                self.replace_br(lyrics_div)
                lyrics += lyrics_div.get_text()
        .....
        return lyrics

Let me know if I can make this change! It's my first time on making changes in an open source project haha

michaeldiazh avatar Sep 04 '23 20:09 michaeldiazh

michaeldiazh it does not work for me i just copied and past it but gives me an error when i modify it still doesnt give me the full lyrics

Daredevil09m avatar Sep 04 '23 21:09 Daredevil09m

@Daredevil09m Mhhh let me take a look again when I get home (:

michaeldiaz0315 avatar Sep 04 '23 22:09 michaeldiaz0315

@Daredevil09m

So I reran the test and I got for Run The Jewels Legend Has It. I refactored a bit of the code so check it:

Here is the test (I am just printing out the lyrics):

    def test_fetch_with_real_api(self):
        lyrics = genius.fetch('Run The Jewels', 'Legend Has It')
        print(lyrics)

Here is the refactored code. Try to replace the _scrape_lyrics_from_html method in the Geniusclass in the lyrics.py module. Also add the helper method _try_extracting_lyrics_from_non_data_lyrics_container and check if that works!

    def _scrape_lyrics_from_html(self, html):
        """Scrape lyrics from a given genius.com html"""

        soup = try_parse_html(html)
        if not soup:
            return

        # Remove script tags that they put in the middle of the lyrics.
        [h.extract() for h in soup('script')]

        # Most of the time, the page contains a div with class="lyrics" where
        # all of the lyrics can be found already correctly formatted
        # Sometimes, though, it packages the lyrics into separate divs, most
        # likely for easier ad placement

        lyrics_divs = soup.find_all("div", {"data-lyrics-container": True})
        if not lyrics_divs:
            self._log.debug('Received unusual song page html')
            return self._try_extracting_lyrics_from_non_data_lyrics_container(soup)
        lyrics = ''
        for lyrics_div in lyrics_divs:
            self.replace_br(lyrics_div)
            lyrics += lyrics_div.get_text() + '\n\n'
        return lyrics

    def _try_extracting_lyrics_from_non_data_lyrics_container(self, soup):
        """Extract lyrics from a div without attribute data-lyrics-container
        This is the second most common layout on genius.com
        """
        verse_div = soup.find("div", class_=re.compile("Lyrics__Container"))
        if not verse_div:
            if soup.find("div",
                         class_=re.compile("LyricsPlaceholder__Message"),
                         string="This song is an instrumental"):
                self._log.debug('Detected instrumental')
                return "[Instrumental]"
            else:
                self._log.debug("Couldn't scrape page using known layouts")
                return None

        lyrics_div = verse_div.parent
        self.replace_br(lyrics_div)

        ads = lyrics_div.find_all("div",
                                  class_=re.compile("InreadAd__Container"))
        for ad in ads:
            ad.replace_with("\n")

        footers = lyrics_div.find_all("div",
                                      class_=re.compile("Lyrics__Footer"))
        for footer in footers:
            footer.replace_with("")
        return lyrics_div.get_text()

You should get these print statements from the test: Screen Shot 2023-09-04 at 6 51 19 PM

Screen Shot 2023-09-04 at 6 51 37 PM

michaeldiazh avatar Sep 04 '23 22:09 michaeldiazh

I am recreating this branch. I'll have an MR up soon (:

michaeldiazh avatar Feb 19 '24 16:02 michaeldiazh