Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 58
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 • 24
Dataset Creation Tools and Utilities Collection Spaces focused on helping to create datasets • 3 items • Updated about 15 hours ago • 1
Synthetic Dataset Creation Spaces Collection Spaces focused on generating synthetic datasets • 4 items • Updated about 16 hours ago • 1
jina-embeddings-v3: Multilingual Embeddings With Task LoRA Paper • 2409.10173 • Published 4 days ago • 15
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation Paper • 2409.02098 • Published 16 days ago • 1
CRAFT: Corpus Retrieval and Augmentation for Fine-Tuning Collection CRAFTed datasets and LoRA adapter checkpoints. All datasets are synthetically generated. Paper: https://arxiv.org/abs/2409.02098 • 11 items • Updated 16 days ago • 1
Medieval NER Collection This is a collection of Medieval NER datasets and models. • 7 items • Updated Jul 4 • 2
TrOCR Medieval HTR Collection This is a collection of models trained to recognize medieval scripts. • 10 items • Updated Jul 8 • 4
Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text Paper • 2409.02078 • Published 16 days ago • 8
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16 • 96
Llama 2 Family Collection This collection hosts the transformers and original repos of the Llama 2 and Llama Guard releases • 13 items • Updated Aug 2 • 60
MathBridge: A Large-Scale Dataset for Translating Mathematical Expressions into Formula Images Paper • 2408.07081 • Published Aug 7 • 1
Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM Paper • 2408.07246 • Published Aug 14 • 19
InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning Paper • 2408.07089 • Published Aug 9 • 12
ARK Annif Models Collection Contains 5 Annif models for the languages German, Latin, English, French and multilingual. • 5 items • Updated Aug 14 • 2
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine Paper • 2408.02900 • Published Aug 6 • 25
RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation Paper • 2408.02545 • Published Aug 5 • 32
BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba Paper • 2408.02600 • Published Aug 5 • 8
LocalValueBench: A Collaboratively Built and Extensible Benchmark for Evaluating Localized Value Alignment and Ethical Safety in Large Language Models Paper • 2408.01460 • Published Jul 27 • 1
view article Article The case for specialized pre-training: ultra-fast foundation models for dedicated tasks By Pclanglais • Aug 4 • 24
ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation Paper • 2407.19835 • Published Jul 29 • 19
🍃 MINT-1T Collection Data for "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens" • 13 items • Updated Jul 24 • 49
view article Article Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing By Pclanglais • Jul 19 • 17
Alpaca Style Datasets Collection Datasets which follow the Alpaca Style format based on having 'instruction', 'input', and 'output' columns • 2978 items • Updated about 3 hours ago • 2
view article Article Experimenting with Automatic PII Detection on the Hub using Presidio Jul 10 • 23
Direct Preference Optimization Datasets Collection Datasets suitable for DPO based on having 'chosen', 'rejected', and 'prompt' columns. Created using librarian-bots/dataset-column-search-api • 2443 items • Updated about 4 hours ago • 4
Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors Paper • 2407.11828 • Published Jul 16 • 4
Magpie-Qwen2 Datasets Collection Dataset built with Qwen2 72B and Qwen2 7B. • 6 items • Updated 5 days ago • 10
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning Paper • 2407.07523 • Published Jul 10 • 4
MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models Paper • 2407.10953 • Published Jul 15 • 4
CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging Paper • 2407.07315 • Published Jul 10 • 6
AgentInstruct: Toward Generative Teaching with Agentic Flows Paper • 2407.03502 • Published Jul 3 • 43
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences Paper • 2406.11069 • Published Jun 16 • 13
FairJob: A Real-World Dataset for Fairness in Online Systems Paper • 2407.03059 • Published Jul 3 • 1
InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct Paper • 2407.05700 • Published Jul 8 • 9
Probably oasst Style Datasets Collection Datasets in the OpenAssistant format {"INSTRUCTION": "...", "RESPONSE": "..."} • 46 items • Updated Jul 3 • 1
LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives Paper • 2407.01490 • Published Jul 1 • 1
Probably function calling datasets Collection Created using the https://huggingface.co./spaces/librarian-bots/dataset-column-search-api Space. • 39 items • Updated Jul 17 • 35
Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER Paper • 2407.01272 • Published Jul 1 • 8
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets Paper • 2406.18518 • Published Jun 26 • 23
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity Paper • 2406.17720 • Published Jun 25 • 7
Probably Alpaca Style Datasets Collection Datasets probably matching the alpaca format ({"instruction": "...", "input": "...", "output": "..."}) • 1944 items • Updated Jul 1 • 1
LiveBench: A Challenging, Contamination-Free LLM Benchmark Paper • 2406.19314 • Published Jun 27 • 17
Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models Paper • 2406.14848 • Published Jun 21 • 2
Probably DPO datasets Collection A collection of datasets that probably support DPO • 146 items • Updated Jun 26 • 12
DataComp-LM: In search of the next generation of training sets for language models Paper • 2406.11794 • Published Jun 17 • 48
PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models Paper • 2406.15513 • Published Jun 20 • 1