screenpipe [bounty] support for video and voice LLM in search, timeline, meeting

likely need to break down in multiple bounties

/bounty 400

eg

meeting: use voice LLM to transcribe or summarize audio would increase a lot quality - 10x better than granola etc
search: use video LLM would be much more powerful and different context windows
timeline: same

suggest rough design, might create other issues

Jan 13 '25 21:01 louis030195

💎 $400 bounty • screenpi.pe

Steps to solve:

Start working: Comment /attempt #1142 with your implementation plan
Submit work: Create a pull request including /claim #1142 in the PR body to claim the bounty
Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

❗ Important guidelines:

To claim a bounty, you need to provide a short demo video of your changes in your pull request
If anything is unclear, ask for clarification before starting as this will help avoid potential rework
Low quality AI PRs will not receive review and will be closed
Do not ask to be assigned unless you've contributed before

Thank you for contributing to mediar-ai/screenpipe!

Attempt	Started (UTC)	Solution	Actions
🟢 @BenraouaneSoufiane	Aug 05, 2025, 10:31:57 AM	WIP
🟢 @7908837174	Oct 23, 2025, 04:49:57 AM	WIP

Jan 13 '25 21:01 algora-pbc[bot]

I wanna work on it, how are you validating this? need more context.

Jan 22 '25 12:01 kumarvivek1752

/attempt #114

Feb 20 '25 20:02 RaghavArora14

@RaghavArora14: We appreciate your enthusiasm but since you already have 3 active bounty attempts, we're going to keep this open for other contributors to attempt. 🫡

Feb 20 '25 20:02 algora-pbc[bot]

/attempt https://github.com/mediar-ai/screenpipe/pull/114

May 23 '25 07:05 ToSeven

/attempt #1142

Aug 05 '25 10:08 BenraouaneSoufiane

@louis030195 Proposed breakdown, would be ~400$ each:

Voice LLM for meetings
- Whisper → Transcription
- LLM summarization
Video LLM for search
- Frame/audio analysis → LLM → Embeddings
- FAISS-powered semantic search
Timeline enhancement
- Combine visual/audio tags
- Auto-label segments (topic, scene, speaker)

Would start with #1 (meeting voice summary) and propose incremental PRs. Feedback welcome.

Aug 05 '25 10:08 BenraouaneSoufiane

@louis030195 can you release the amount?

Aug 05 '25 11:08 BenraouaneSoufiane

问题解决了吗？ Is the problem solved?

Oct 09 '25 14:10 Deng-Xian-Sheng

/attempt https://github.com/mediar-ai/screenpipe/issues/1142

Oct 23 '25 04:10 kallal79