Forumite@lemm.ee to

Privacy@lemmy.dbzer0.comEnglish · 1 month ago

Scraping for Me, Not for Thee: Large Language Models, Web Data, and Privacy-Problematic Paradigms

2

cross-posted to:
privacyguides@lemmy.one

32

Scraping for Me, Not for Thee: Large Language Models, Web Data, and Privacy-Problematic Paradigms

Forumite@lemm.ee to

Privacy@lemmy.dbzer0.comEnglish · 1 month ago

2

cross-posted to:
privacyguides@lemmy.one

Attention Required! | Cloudflare

Chat

e0qdk@reddthat.com
link
fedilink
arrow-up
12·
1 month ago
arXiv has bulk access methods – you shouldn’t need to scrape their website to get the data: https://info.arxiv.org/help/bulk_data.html

If you really want everything (5TB+), that’s available from their S3 bucket if you’re willing to cover the transfer costs: https://info.arxiv.org/help/bulk_data_s3.html

Privacy@lemmy.dbzer0.com

privacy@lemmy.dbzer0.com

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !privacy@lemmy.dbzer0.com

Welcome! This is a community for all those who are interested in protecting their privacy.

Rules

PS: Don’t be a smartass and try to game the system, we’ll know if you’re breaking the rules when we see it!

Be civil and no prejudice
Don’t promote big-tech software
No reposting of news that was already posted
No crypto, blockchain, NFTs
No Xitter links (if absolutely necessary, use xcancel)

Related communities:

Some of these are only vaguely related, but great communities.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

302 users / day
1.42K users / week
3.53K users / month
5.78K users / 6 months
5 local subscribers
1.84K subscribers
359 Posts
2.82K Comments
Modlog