606 256 693

Daniel van Strien PRO

davanstrien

https://danielvanstrien.xyz/

AI & ML interests

Machine Learning Librarian

Articles

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Jun 20

• 12

Data Is Better Together: A Look Back and Forward

Jun 20

• 17

Synthetic dataset generation techniques: generating custom sentence similarity data

May 23

• 14

Synthetic dataset generation techniques: Self-Instruct

May 15

• 6

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

May 7

• 7

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 58

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 24

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

Aug 2, 2023

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

• 1

Introducing BERTopic Integration with Hugging Face Hub

May 31, 2023

• 4

Jupyter X Hugging Face

Mar 23, 2023

• 2

Image search with 🤗 datasets

Mar 16, 2022

• 5

Organizations

davanstrien's activity

upvoted 2 collections about 15 hours ago

Dataset Creation Tools and Utilities

Collection

Spaces focused on helping to create datasets • 3 items • Updated about 15 hours ago • 1

Synthetic Dataset Creation Spaces

Collection

Spaces focused on generating synthetic datasets • 4 items • Updated about 16 hours ago • 1

upvoted 2 papers 3 days ago

jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Paper • 2409.10173 • Published 4 days ago • 15

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

Paper • 2409.02098 • Published 16 days ago • 1

upvoted a collection 3 days ago

CRAFT: Corpus Retrieval and Augmentation for Fine-Tuning

Collection

CRAFTed datasets and LoRA adapter checkpoints. All datasets are synthetically generated. Paper: https://arxiv.org/abs/2409.02098 • 11 items • Updated 16 days ago • 1

upvoted 2 collections 7 days ago

Medieval NER

Collection

This is a collection of Medieval NER datasets and models. • 7 items • Updated Jul 4 • 2

TrOCR Medieval HTR

Collection

This is a collection of models trained to recognize medieval scripts. • 10 items • Updated Jul 8 • 4

upvoted a collection 9 days ago

Hub Card Data

Collection

2 items • Updated 9 days ago • 2

upvoted a paper 14 days ago

Hermes 3 Technical Report

Paper • 2408.11857 • Published Aug 15 • 34

upvoted a paper 15 days ago

Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text

Paper • 2409.02078 • Published 16 days ago • 8

upvoted a collection 21 days ago

Qwen2-VL

Collection

Vision-language model series based on Qwen2 • 15 items • Updated 1 day ago • 114

upvoted a paper about 1 month ago

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16 • 96

upvoted a collection about 1 month ago

Llama 2 Family

Collection

This collection hosts the transformers and original repos of the Llama 2 and Llama Guard releases • 13 items • Updated Aug 2 • 60

upvoted 3 papers about 1 month ago

MathBridge: A Large-Scale Dataset for Translating Mathematical Expressions into Formula Images

Paper • 2408.07081 • Published Aug 7 • 1

Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM

Paper • 2408.07246 • Published Aug 14 • 19

InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning

Paper • 2408.07089 • Published Aug 9 • 12

upvoted a collection about 1 month ago

ARK Annif Models

Collection

Contains 5 Annif models for the languages German, Latin, English, French and multilingual. • 5 items • Updated Aug 14 • 2

upvoted an article about 1 month ago

Article

⭐ PySpark and 🤗 Hugging Face Parquet Files

•

Aug 13

• 5

upvoted 4 papers about 1 month ago

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Paper • 2408.02900 • Published Aug 6 • 25

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

Paper • 2408.02545 • Published Aug 5 • 32

BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba

Paper • 2408.02600 • Published Aug 5 • 8

LocalValueBench: A Collaboratively Built and Extensible Benchmark for Evaluating Localized Value Alignment and Ethical Safety in Large Language Models

Paper • 2408.01460 • Published Jul 27 • 1

upvoted an article about 2 months ago

Article

The case for specialized pre-training: ultra-fast foundation models for dedicated tasks

•

Aug 4

• 24

upvoted a paper about 2 months ago

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

Paper • 2407.19835 • Published Jul 29 • 19

upvoted a collection about 2 months ago

🍃 MINT-1T

Collection

Data for "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens" • 13 items • Updated Jul 24 • 49

upvoted 2 articles 2 months ago

Article

Bringing Open-Source Models to Spreadsheets 🚀

•

Jul 19

• 2

Article

Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing

•

Jul 19

• 17

upvoted 2 collections 2 months ago

DCLM

Collection

DCLM Models + Datasets • 7 items • Updated Jul 22 • 38

Alpaca Style Datasets

Collection

Datasets which follow the Alpaca Style format based on having 'instruction', 'input', and 'output' columns • 2978 items • Updated about 3 hours ago • 2

upvoted an article 2 months ago

Article

Experimenting with Automatic PII Detection on the Hub using Presidio

Jul 10

• 23

upvoted a collection 2 months ago

Direct Preference Optimization Datasets

Collection

Datasets suitable for DPO based on having 'chosen', 'rejected', and 'prompt' columns. Created using librarian-bots/dataset-column-search-api • 2443 items • Updated about 4 hours ago • 4

upvoted a paper 2 months ago

Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors

Paper • 2407.11828 • Published Jul 16 • 4

upvoted a collection 2 months ago

Magpie-Qwen2 Datasets

Collection

Dataset built with Qwen2 72B and Qwen2 7B. • 6 items • Updated 5 days ago • 10

upvoted 4 papers 2 months ago

SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning

Paper • 2407.07523 • Published Jul 10 • 4

MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models

Paper • 2407.10953 • Published Jul 15 • 4

DataDream: Few-shot Guided Dataset Generation

Paper • 2407.10910 • Published Jul 15 • 7

Qwen2 Technical Report

Paper • 2407.10671 • Published Jul 15 • 153

upvoted a collection 2 months ago

H2O Danube3

Collection

6 items • Updated Jul 16 • 51

upvoted 7 papers 2 months ago

PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10 • 64

CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging

Paper • 2407.07315 • Published Jul 10 • 6

On Leakage of Code Generation Evaluation Datasets

Paper • 2407.07565 • Published Jul 10 • 4

AgentInstruct: Toward Generative Teaching with Agentic Flows

Paper • 2407.03502 • Published Jul 3 • 43

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Paper • 2406.11069 • Published Jun 16 • 13

FairJob: A Real-World Dataset for Fairness in Online Systems

Paper • 2407.03059 • Published Jul 3 • 1

InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

Paper • 2407.05700 • Published Jul 8 • 9

upvoted an article 2 months ago

Article

Announcing New Dataset Search Features

Jul 8

• 22

upvoted an article 3 months ago

Article

Image search with 🤗 datasets

Mar 16, 2022

• 5

upvoted 2 collections 3 months ago

Medieval HTR

Collection

This is a collection of HTR data and models • 2 items • Updated Jul 4 • 3

Probably oasst Style Datasets

Collection

Datasets in the OpenAssistant format {"INSTRUCTION": "...", "RESPONSE": "..."} • 46 items • Updated Jul 3 • 1

upvoted a paper 3 months ago

LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

Paper • 2407.01490 • Published Jul 1 • 1

upvoted a collection 3 months ago

Probably function calling datasets

Collection

Created using the https://huggingface.co./spaces/librarian-bots/dataset-column-search-api Space. • 39 items • Updated Jul 17 • 35

upvoted 3 papers 3 months ago

Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER

Paper • 2407.01272 • Published Jul 1 • 8

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

Paper • 2406.18518 • Published Jun 26 • 23

Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Paper • 2406.17720 • Published Jun 25 • 7

upvoted a collection 3 months ago

Probably Alpaca Style Datasets

Collection

Datasets probably matching the alpaca format ({"instruction": "...", "input": "...", "output": "..."}) • 1944 items • Updated Jul 1 • 1

upvoted 2 papers 3 months ago

LiveBench: A Challenging, Contamination-Free LLM Benchmark

Paper • 2406.19314 • Published Jun 27 • 17

Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models

Paper • 2406.14848 • Published Jun 21 • 2

upvoted a collection 3 months ago

Probably DPO datasets

Collection

A collection of datasets that probably support DPO • 146 items • Updated Jun 26 • 12

upvoted 2 papers 3 months ago

DataComp-LM: In search of the next generation of training sets for language models

Paper • 2406.11794 • Published Jun 17 • 48

PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models

Paper • 2406.15513 • Published Jun 20 • 1

Daniel van Strien PRO

AI & ML interests

Articles

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Data Is Better Together: A Look Back and Forward

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Creating open machine learning datasets? Share them on the Hugging Face Hub!

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity

⭐ PySpark and 🤗 Hugging Face Parquet Files

The case for specialized pre-training: ultra-fast foundation models for dedicated tasks

Bringing Open-Source Models to Spreadsheets 🚀

Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing

Experimenting with Automatic PII Detection on the Hub using Presidio

Announcing New Dataset Search Features

Image search with 🤗 datasets