
floatingpoint
W24Pivot 2 of 2Open source API service to parse complex documents
Post-training data to teach models document work
Battle-tested + highly modular vision infrastructure to convert PDFs, PPTs, Word, Excel, PNG, and JPEGs into LLM-ready data. We started by building lumina.sh - where we needed to parse ~600M pages of scientific literature. The researchers didn't care - but devs wanted our ingestion pipeline. So we built chunkr instead. We offer high quality layout analysis, OCR, bounding boxes, granular VLM controls, semantic chunking, and all the last mile engineering that goes into building standout AI applications. Common use-cases include RAG, and automating document workflows like invoices/medical reports -> database.
Floatingpoint builds off-the-shelf post-training datasets that teach models how to do real work with documents. We discover valuable tasks where models fall short and build datasets to close the gap. Human-crafted from real-world sources with synthetic expansions on top, and validated through in-house training cycles.
The company moved from providing an API service for document parsing (infrastructure/tools for developers) to selling pre-built post-training datasets for teaching models document tasks (data product for ML teams), which is a notable shift in core offering but both target support for AI & document workflows.
Post-training data to teach models document work(viewing)
AI Search Engine for Research