COLOMBO, Sri Lanka  LankaData has announced the public release of the Chat2Find Corpus, a major trilingual conversational dataset, marking a key milestone in Sri Lanka’s growing AI ecosystem.

The Chat2Find Corpus consists of over 255 Million tokens (~279,248 records) in Sinhala, Tamil, and English, including code-mixed language such as Singlish and Tanglish. Designed for training and fine-tuning Large Language Models (LLMs), the dataset supports multilingual AI development in low-resource environments.

Developed by Chat2Find and released as open-source, the dataset enables researchers, developers, and institutions to access high-quality, locally relevant data—an area that has long limited AI innovation in Sri Lanka. It is released under the MIT License and is suitable for continual pre-training (CPT) and supervised fine-tuning (SFT).

The corpus captures authentic language use and cultural context, making it especially valuable for modern natural language processing tasks. Alongside the dataset, Chat2Find has also announced upcoming AI models, including a base model and fine-tuned models optimized for trilingual understanding and reasoning.

Access the dataset:

  • Data Corpus | Hugging Face: Link
  • Data Conversations | Hugging Face : Link
  • LankaData | Local Repository: Link

This release positions LankaData as a key hub for open AI resources in Sri Lanka, supporting the next wave of locally grounded AI development.

Official Press Release by Chat2Find: https://chat2find.com/2026/04/09/chat2find-trilingual-corpus-255m-tokens-release/

,


Leave a Reply

Your email address will not be published. Required fields are marked *

Search

About

LankaData is Sri Lanka’s pioneer in structured data collection and intelligent data access, built to transform how national information is stored, searched, and used. We systematically compile, digitise, and organise authoritative datasets across law, taxation, economics, business, and public policy, converting fragmented and complex information into clean, reliable, and searchable digital assets.

Building on these structured repositories, LankaData delivers an intelligent access layer powered by advanced retrieval-augmented AI models. Each dataset is paired with a dedicated domain-specific AI expert, enabling users to ask complex questions in natural language and receive precise, source-linked responses. This intelligence is seamlessly delivered through the Chat2Find platform, making LankaData not just a data provider, but a practical decision-support system for professionals

Categories