Forumite@lemm.ee to Privacy@lemmy.dbzer0.comEnglish · 2 days agoScraping for Me, Not for Thee: Large Language Models, Web Data, and Privacy-Problematic Paradigmsepic.orgexternal-linkmessage-square2fedilinkarrow-up131arrow-down10cross-posted to: privacyguides@lemmy.one
arrow-up131arrow-down1external-linkScraping for Me, Not for Thee: Large Language Models, Web Data, and Privacy-Problematic Paradigmsepic.orgForumite@lemm.ee to Privacy@lemmy.dbzer0.comEnglish · 2 days agomessage-square2fedilinkcross-posted to: privacyguides@lemmy.one
minus-squaree0qdk@reddthat.comlinkfedilinkarrow-up12·2 days agoarXiv has bulk access methods – you shouldn’t need to scrape their website to get the data: https://info.arxiv.org/help/bulk_data.html If you really want everything (5TB+), that’s available from their S3 bucket if you’re willing to cover the transfer costs: https://info.arxiv.org/help/bulk_data_s3.html
arXiv has bulk access methods – you shouldn’t need to scrape their website to get the data: https://info.arxiv.org/help/bulk_data.html
If you really want everything (5TB+), that’s available from their S3 bucket if you’re willing to cover the transfer costs: https://info.arxiv.org/help/bulk_data_s3.html