Lanka Data Unveils “Base” AI Model Chat2Find Built on Sri Lanka’s Largest Trilingual Dataset


Colombo: The emerging AI platform Chat2Find has taken a major step forward in multilingual artificial intelligence in Sri Lanka with the release of its foundational Chat2Find Base model, now available via Hugging Face.

The Base model forms the backbone of Chat2Find’s upcoming open-weight model suite and is designed to support Sinhala, Tamil, and English, including naturally code-mixed variations such as Singlish and Tanglish. The release marks a significant milestone for locally grounded AI development, particularly in low-resource language ecosystems.

Built on a 255M+ Token Corpus

The Chat2Find Base model is trained through continual pre-training (CPT) on the recently released Chat2Find Corpus—a large-scale conversational dataset containing over 255 million tokens across nearly 280,000 records

Unlike many global datasets, the corpus is derived from real user interactions, capturing authentic linguistic patterns, cultural nuances, and regional knowledge specific to South Asia. This gives the Base model a strong advantage in understanding local context, informal speech, and multilingual switching—areas where traditional models often struggle.

Access Base Model: Hugging Face

Foundation for Advanced AI Models

Industry observers note that “base models” represent the pretrained core of AI systems, which can later be fine-tuned for tasks such as instruction-following or reasoning. 

Chat2Find has confirmed that the Base model will be followed by specialized variants, including:

  • Chat2Find Instruct – optimized for task execution and prompts
  • Chat2Find Reasoning – focused on complex problem-solving

These models are expected to expand the capabilities of AI tools in education, business, and public services across Sri Lanka and the wider region.

Boost for Sri Lanka’s AI Ecosystem

The release is being seen as a breakthrough for Sri Lanka’s AI landscape, where access to high-quality, locally relevant training data has historically been limited. By open-sourcing both the dataset and model components under permissive licensing, Chat2Find is enabling researchers, startups, and institutions to build next-generation applications tailored to regional needs. 

Analysts say the initiative could position Sri Lanka as a regional hub for multilingual AI innovation, particularly in South Asian language technologies.

Looking Ahead

With the Base model now accessible to developers worldwide, attention is shifting to real-world deployments and fine-tuned applications. As global AI development increasingly moves toward open ecosystems, Chat2Find’s approach highlights the growing importance of localized data and inclusive language representation in shaping the future of artificial intelligence.



Leave a Reply

Your email address will not be published. Required fields are marked *

Search

About

LankaData is Sri Lanka’s pioneer in structured data collection and intelligent data access, built to transform how national information is stored, searched, and used. We systematically compile, digitise, and organise authoritative datasets across law, taxation, economics, business, and public policy, converting fragmented and complex information into clean, reliable, and searchable digital assets.

Building on these structured repositories, LankaData delivers an intelligent access layer powered by advanced retrieval-augmented AI models. Each dataset is paired with a dedicated domain-specific AI expert, enabling users to ask complex questions in natural language and receive precise, source-linked responses. This intelligence is seamlessly delivered through the Chat2Find platform, making LankaData not just a data provider, but a practical decision-support system for professionals

Categories